date:20171117

misc/021-image-multi-devices fails on latest btrfs-devel/misc-next

2017-11-17 Thread Lakshmipathi.G

Hi.

Few days back(Nov14th) this(misc/021-image-multi-devices) test script
used to pass and seems like its failing
now. may be due to recent commits? Logs,screen-casts available here:
https://lakshmipathi.github.io/btrfsqa/


Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] btrfs: make open_ctree error injectable

2017-11-17 Thread Ingo Molnar


* Josef Bacik  wrote:

> From: Josef Bacik 
> 
> This allows us to do error injection with BPF for open_ctree.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/disk-io.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index dfdab849037b..c6b4e1f07072 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "hash.h"
> @@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb,
>   goto fail_block_groups;
>   goto retry_root_backup;
>  }
> +BPF_ALLOW_ERROR_INJECTION(open_ctree);

Ok, this looks a lot better - except the random header inclusion dependency: if 
a 
facility is in the BPF_*() namespace then it should include  and 
not 
a random asm/* header...

With that detail fixed:

  Acked-by: Ingo Molnar 

for the whole series.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unrecoverable scrub errors

2017-11-17 Thread Adam Borowski

On Fri, Nov 17, 2017 at 08:19:11PM -0700, Chris Murphy wrote:
> On Fri, Nov 17, 2017 at 8:41 AM, Nazar Mokrynskyi  
> wrote:
> 
> >> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238048: metadata leaf (level 0) in tree 985
> >> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238048: metadata leaf (level 0) in tree 985
> >> [551049.038723] BTRFS error (device dm-2): bdev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 
> >> 0, flush 0, corrupt 1, gen 0
> >> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069526528 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238176: metadata leaf (level 0) in tree 985
> >> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069526528 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238176: metadata leaf (level 0) in tree 985
> >> [551049.039637] BTRFS error (device dm-2): bdev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 
> >> 0, flush 0, corrupt 2, gen 0
> >> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error 
> >> at logical 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> 
> These are metadata errors. Are there any other storage stack related
> errors in the previous 2-5 minutes, such as read errors (UNC) or SATA
> link reset messages?
> 
> >Maybe I can find snapshot that contains file with wrong checksum and
> > remove corresponding snapshot or something like that?
> 
> It's not a file. It's metadata leaf.

Just for the record: had this be a data block (ie, a non-inline file
extent), the dmesg message would include one of filenames that refer to that
extent.  To clear the error, you'd need to remove all such files.

> >> nazar-pc@nazar-pc ~> sudo btrfs filesystem df /media/Backup
> >> Data, single: total=879.01GiB, used=877.24GiB
> >> System, DUP: total=40.00MiB, used=128.00KiB
> >> Metadata, DUP: total=20.50GiB, used=18.96GiB
> >> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Metadata is DUP, but both copies have corruption. Kinda strange. But I
> don't know how close the DUP copies are to each other, if possibly a
> big enough media defect can explain this.

The original post mentioned SSD (but was unclear if _this_ filesystem is
backed by one).  If so, DUP is nearly worthless as both copies will be
written to physical cells next to each other, no matter what positions the
FTL shows them at.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unrecoverable scrub errors

2017-11-17 Thread Chris Murphy

On Fri, Nov 17, 2017 at 8:41 AM, Nazar Mokrynskyi  wrote:

>> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 
>> 470069460992 on dev 
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
>> 942238048: metadata leaf (level 0) in tree 985
>> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 
>> 470069460992 on dev 
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
>> 942238048: metadata leaf (level 0) in tree 985
>> [551049.038723] BTRFS error (device dm-2): bdev 
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 
>> 0, flush 0, corrupt 1, gen 0
>> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 
>> 470069526528 on dev 
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
>> 942238176: metadata leaf (level 0) in tree 985
>> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 
>> 470069526528 on dev 
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
>> 942238176: metadata leaf (level 0) in tree 985
>> [551049.039637] BTRFS error (device dm-2): bdev 
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 
>> 0, flush 0, corrupt 2, gen 0
>> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error 
>> at logical 470069460992 on dev 
>> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1


These are metadata errors. Are there any other storage stack related
errors in the previous 2-5 minutes, such as read errors (UNC) or SATA
link reset messages?

> Are there any better options before resorting to `btrfsck --repair`?

I wouldn't try it just yet. What do you get for btrfs check without
repair? This will check the metadata and it should run into the same
problem, but if it craps out then chances are --repair will too.


>Maybe I can find snapshot that contains file with wrong checksum and remove 
>corresponding snapshot or something like that?

It's not a file. It's metadata leaf.


>> nazar-pc@nazar-pc ~> sudo btrfs filesystem df /media/Backup
>> Data, single: total=879.01GiB, used=877.24GiB
>> System, DUP: total=40.00MiB, used=128.00KiB
>> Metadata, DUP: total=20.50GiB, used=18.96GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B

Metadata is DUP, but both copies have corruption. Kinda strange. But I
don't know how close the DUP copies are to each other, if possibly a
big enough media defect can explain this.

What do you get for smartctl -l scterc /dev/ (whole physical device,
not the dm device)

In the meantime, take the drive offline (umount it), and run smartctl
-t long, and after that finishes, smartctl -x. Attach that as a plain
text file, it should be small enough for the list to handle it, and
avoids reformatting problems.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

2017-11-17 Thread Tomasz Chmielewski


On 2017-11-18 10:08, Hans van Kranenburg wrote:

On 11/18/2017 01:49 AM, Tomasz Chmielewski wrote:
I'm getting the following BUG when running balance on one of my 
systems:



[ 3458.698704] BTRFS info (device sdb3): relocating block group
306045779968 flags data|raid1
[ 3466.892933] BTRFS info (device sdb3): found 2405 extents
[ 3495.408630] BTRFS info (device sdb3): found 2405 extents
[ 3498.161144] [ cut here ]
[ 3498.161150] kernel BUG at 
/home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

[ 3498.161264] invalid opcode:  [#1] SMP


(...)


[ 3498.164523] Call Trace:
[ 3498.164694]  tree_advance+0x16e/0x1d0 [btrfs]
[ 3498.164874]  btrfs_compare_trees+0x2da/0x6a0 [btrfs]
[ 3498.165078]  ? process_extent+0x1580/0x1580 [btrfs]
[ 3498.165264]  btrfs_ioctl_send+0xe94/0x1120 [btrfs]


It's using send + balance at the same time. There's something that 
makes

btrfs explode when you do that.

It's not new in 4.14, I have seen it in 4.7 and 4.9 also, various
different explosions in kernel log. Since that happened, I made sure I
never did those two things at the same time.


Indeed, send was started when balance was running.

Thanks for the hint.


Tomasz Chmielewski
https://lxadm.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

2017-11-17 Thread Hans van Kranenburg

On 11/18/2017 01:49 AM, Tomasz Chmielewski wrote:
> I'm getting the following BUG when running balance on one of my systems:
> 
> 
> [ 3458.698704] BTRFS info (device sdb3): relocating block group
> 306045779968 flags data|raid1
> [ 3466.892933] BTRFS info (device sdb3): found 2405 extents
> [ 3495.408630] BTRFS info (device sdb3): found 2405 extents
> [ 3498.161144] [ cut here ]
> [ 3498.161150] kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!
> [ 3498.161264] invalid opcode:  [#1] SMP
> [ 3498.161363] Modules linked in: nf_log_ipv6 nf_log_ipv4 nf_log_common
> xt_LOG xt_multiport xt_conntrack xt_nat binfmt_misc veth ip6table_filter
> xt_CHECKSUM iptable_mangle xt_tcpudp ip6t_MASQUERADE
> nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
> nf_nat_ipv6 ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> nf_conntrack iptable_filter ip_tables x_tables bridge stp llc intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> aes_x86_64 crypto_simd glue_helper cryptd intel_cstate hci_uart
> intel_rapl_perf btbcm input_leds serdev serio_raw btqca btintel
> bluetooth intel_pch_thermal intel_lpss_acpi intel_lpss mac_hid acpi_pad
> [ 3498.162060]  ecdh_generic acpi_als kfifo_buf industrialio autofs4
> btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy
> async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath
> linear raid1 e1000e psmouse ptp ahci pps_core libahci wmi
> pinctrl_sunrisepoint i2c_hid video pinctrl_intel hid
> [ 3498.162386] CPU: 7 PID: 29041 Comm: btrfs Not tainted
> 4.14.0-041400-generic #201711122031
> [ 3498.162545] Hardware name: FUJITSU  /D3401-H2, BIOS V5.0.0.12 R1.5.0
> for D3401-H2x 02/27/2017
> [ 3498.162723] task: 8d7858e82f00 task.stack: b4ee47d5c000
> [ 3498.162890] RIP: 0010:read_node_slot+0xd7/0xe0 [btrfs]
> [ 3498.163027] RSP: 0018:b4ee47d5fb88 EFLAGS: 00010246
> [ 3498.163156] RAX: 8d78c8bb7000 RBX: 8d8124abd380 RCX:
> 0001
> [ 3498.163290] RDX: 0048 RSI: 8d7ae1fef6f8 RDI:
> 8d8124aa
> [ 3498.163422] RBP: b4ee47d5fba8 R08: 0001 R09:
> 8d8124abd384
> [ 3498.163555] R10: 0001 R11: 00114000 R12:
> 0002
> [ 3498.163689] R13: b4ee47d5fc66 R14: b4ee47d5fc50 R15:
> 
> [ 3498.163825] FS:  7fa4c9a998c0() GS:8d816e5c()
> knlGS:
> [ 3498.163990] CS:  0010 DS:  ES:  CR0: 80050033
> [ 3498.164120] CR2: 56410155a028 CR3: 0009c194c002 CR4:
> 003606e0
> [ 3498.164255] DR0:  DR1:  DR2:
> 
> [ 3498.164390] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [ 3498.164523] Call Trace:
> [ 3498.164694]  tree_advance+0x16e/0x1d0 [btrfs]
> [ 3498.164874]  btrfs_compare_trees+0x2da/0x6a0 [btrfs]
> [ 3498.165078]  ? process_extent+0x1580/0x1580 [btrfs]
> [ 3498.165264]  btrfs_ioctl_send+0xe94/0x1120 [btrfs]

It's using send + balance at the same time. There's something that makes
btrfs explode when you do that.

It's not new in 4.14, I have seen it in 4.7 and 4.9 also, various
different explosions in kernel log. Since that happened, I made sure I
never did those two things at the same time.

> [ 3498.165450]  btrfs_ioctl+0x93c/0x1f00 [btrfs]
> [ 3498.165587]  ? enqueue_task_fair+0xa8/0x6c0
> [ 3498.165724]  do_vfs_ioctl+0xa5/0x600
> [ 3498.165854]  ? do_vfs_ioctl+0xa5/0x600
> [ 3498.165979]  ? _do_fork+0x144/0x3a0
> [ 3498.166103]  SyS_ioctl+0x79/0x90
> [ 3498.166234]  entry_SYSCALL_64_fastpath+0x1e/0xa9
> [ 3498.166368] RIP: 0033:0x7fa4c8b17f07
> [ 3498.166488] RSP: 002b:7ffd33644e38 EFLAGS: 0202 ORIG_RAX:
> 0010
> [ 3498.166653] RAX: ffda RBX: 7fa4c8a1a700 RCX:
> 7fa4c8b17f07
> [ 3498.166787] RDX: 7ffd33644f30 RSI: 40489426 RDI:
> 0004
> [ 3498.166921] RBP: 7ffd33644dc0 R08:  R09:
> 7fa4c8a1a700
> [ 3498.167055] R10: 7fa4c8a1a9d0 R11: 0202 R12:
> 
> [ 3498.167190] R13: 7ffd33644dbf R14: 7fa4c8a1a9c0 R15:
> 0129f020
> [ 3498.167326] Code: 48 c7 c3 fb ff ff ff e8 f8 5c 05 00 48 89 d8 5b 41
> 5c 41 5d 41 5e 5d c3 48 c7 c3 fe ff ff ff 48 89 d8 5b 41 5c 41 5d 41 5e
> 5d c3 <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41
> [ 3498.167690] RIP: read_node_slot+0xd7/0xe0 [btrfs] RSP: b4ee47d5fb88
> [ 3498.167892] ---[ end trace 6a751a3020dd3086 ]---
> [ 3499.572729] BTRFS info (device sdb3): relocating block group
> 304972038144 flags data|raid1
> [ 3504.068432] BTRFS info (device sdb3): found 2037 extents
> [ 3538.281808] BTRFS info (device sdb3): found 2037 extents

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send

4.14 balance: kernel BUG at /home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

2017-11-17 Thread Tomasz Chmielewski


I'm getting the following BUG when running balance on one of my systems:


[ 3458.698704] BTRFS info (device sdb3): relocating block group 
306045779968 flags data|raid1

[ 3466.892933] BTRFS info (device sdb3): found 2405 extents
[ 3495.408630] BTRFS info (device sdb3): found 2405 extents
[ 3498.161144] [ cut here ]
[ 3498.161150] kernel BUG at 
/home/kernel/COD/linux/fs/btrfs/ctree.c:1856!

[ 3498.161264] invalid opcode:  [#1] SMP
[ 3498.161363] Modules linked in: nf_log_ipv6 nf_log_ipv4 nf_log_common 
xt_LOG xt_multiport xt_conntrack xt_nat binfmt_misc veth ip6table_filter 
xt_CHECKSUM iptable_mangle xt_tcpudp ip6t_MASQUERADE 
nf_nat_masquerade_ipv6 ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 
nf_nat_ipv6 ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_filter ip_tables x_tables bridge stp llc intel_rapl 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
aes_x86_64 crypto_simd glue_helper cryptd intel_cstate hci_uart 
intel_rapl_perf btbcm input_leds serdev serio_raw btqca btintel 
bluetooth intel_pch_thermal intel_lpss_acpi intel_lpss mac_hid acpi_pad
[ 3498.162060]  ecdh_generic acpi_als kfifo_buf industrialio autofs4 
btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy 
async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath 
linear raid1 e1000e psmouse ptp ahci pps_core libahci wmi 
pinctrl_sunrisepoint i2c_hid video pinctrl_intel hid
[ 3498.162386] CPU: 7 PID: 29041 Comm: btrfs Not tainted 
4.14.0-041400-generic #201711122031
[ 3498.162545] Hardware name: FUJITSU  /D3401-H2, BIOS V5.0.0.12 R1.5.0 
for D3401-H2x 02/27/2017

[ 3498.162723] task: 8d7858e82f00 task.stack: b4ee47d5c000
[ 3498.162890] RIP: 0010:read_node_slot+0xd7/0xe0 [btrfs]
[ 3498.163027] RSP: 0018:b4ee47d5fb88 EFLAGS: 00010246
[ 3498.163156] RAX: 8d78c8bb7000 RBX: 8d8124abd380 RCX: 
0001
[ 3498.163290] RDX: 0048 RSI: 8d7ae1fef6f8 RDI: 
8d8124aa
[ 3498.163422] RBP: b4ee47d5fba8 R08: 0001 R09: 
8d8124abd384
[ 3498.163555] R10: 0001 R11: 00114000 R12: 
0002
[ 3498.163689] R13: b4ee47d5fc66 R14: b4ee47d5fc50 R15: 

[ 3498.163825] FS:  7fa4c9a998c0() GS:8d816e5c() 
knlGS:

[ 3498.163990] CS:  0010 DS:  ES:  CR0: 80050033
[ 3498.164120] CR2: 56410155a028 CR3: 0009c194c002 CR4: 
003606e0
[ 3498.164255] DR0:  DR1:  DR2: 

[ 3498.164390] DR3:  DR6: fffe0ff0 DR7: 
0400

[ 3498.164523] Call Trace:
[ 3498.164694]  tree_advance+0x16e/0x1d0 [btrfs]
[ 3498.164874]  btrfs_compare_trees+0x2da/0x6a0 [btrfs]
[ 3498.165078]  ? process_extent+0x1580/0x1580 [btrfs]
[ 3498.165264]  btrfs_ioctl_send+0xe94/0x1120 [btrfs]
[ 3498.165450]  btrfs_ioctl+0x93c/0x1f00 [btrfs]
[ 3498.165587]  ? enqueue_task_fair+0xa8/0x6c0
[ 3498.165724]  do_vfs_ioctl+0xa5/0x600
[ 3498.165854]  ? do_vfs_ioctl+0xa5/0x600
[ 3498.165979]  ? _do_fork+0x144/0x3a0
[ 3498.166103]  SyS_ioctl+0x79/0x90
[ 3498.166234]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[ 3498.166368] RIP: 0033:0x7fa4c8b17f07
[ 3498.166488] RSP: 002b:7ffd33644e38 EFLAGS: 0202 ORIG_RAX: 
0010
[ 3498.166653] RAX: ffda RBX: 7fa4c8a1a700 RCX: 
7fa4c8b17f07
[ 3498.166787] RDX: 7ffd33644f30 RSI: 40489426 RDI: 
0004
[ 3498.166921] RBP: 7ffd33644dc0 R08:  R09: 
7fa4c8a1a700
[ 3498.167055] R10: 7fa4c8a1a9d0 R11: 0202 R12: 

[ 3498.167190] R13: 7ffd33644dbf R14: 7fa4c8a1a9c0 R15: 
0129f020
[ 3498.167326] Code: 48 c7 c3 fb ff ff ff e8 f8 5c 05 00 48 89 d8 5b 41 
5c 41 5d 41 5e 5d c3 48 c7 c3 fe ff ff ff 48 89 d8 5b 41 5c 41 5d 41 5e 
5d c3 <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41
[ 3498.167690] RIP: read_node_slot+0xd7/0xe0 [btrfs] RSP: 
b4ee47d5fb88

[ 3498.167892] ---[ end trace 6a751a3020dd3086 ]---
[ 3499.572729] BTRFS info (device sdb3): relocating block group 
304972038144 flags data|raid1

[ 3504.068432] BTRFS info (device sdb3): found 2037 extents
[ 3538.281808] BTRFS info (device sdb3): found 2037 extents



Tomasz Chmielewski
https://lxadm.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check: add_missing_dir_index: BUG_ON `ret` triggered, value -17

2017-11-17 Thread Qu Wenruo



On 2017年11月17日 23:50, Marc MERLIN wrote:
> On Fri, Nov 17, 2017 at 04:12:07PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2017年11月17日 15:30, Marc MERLIN wrote:
>>> Here's the whole output:
>>> gargamel:~# btrfs-debug-tree -t 258 /dev/mapper/raid0d1 | grep 1919805647
>>
>> Sorry, I missed "-C10" parameter for grep.
>  
> generation 2231977 transid 2237084 size 64 nbytes 0
> block group 0 mode 40755 links 1 uid 33 gid 33 rdev 0
> sequence 0 flags 0x1710(none)
> atime 1510290002.516060162 (2017-11-09 21:00:02)
> ctime 1510477350.88506455 (2017-11-12 01:02:30)
> mtime 1510477350.88506455 (2017-11-12 01:02:30)
> otime 1510290002.516060162 (2017-11-09 21:00:02)
> item 26 key (1919785864 INODE_REF 1919785862) itemoff 14683 itemsize 
> 12
> index 2 namelen 2 name: 00
> item 27 key (1919785864 DIR_ITEM 2591417872) itemoff 14637 itemsize 46
> location key (1919805647 INODE_ITEM 0) type FILE
> transid 2231988 data_len 0 name_len 16
> name: 1955-capture.jpg

OK, this DIR_ITEM matches with INODE_REF.
So btrfs-check should only need to insert DIR_INDEX for it.

> item 28 key (1919785864 DIR_ITEM 3406016191) itemoff 14591 itemsize 46
> location key (1919805657 INODE_ITEM 0) type FILE
> transid 2231988 data_len 0 name_len 16
> name: 1956-capture.jpg
> item 29 key (1919785864 DIR_INDEX 1957) itemoff 14575 itemsize 16
> location key (7383370114097217536 UNKNOWN.211 
> 15651972432879681580) type DIR_ITEM.0
> transid 72057594045427176 data_len 0 name_len 0
> name: 
> item 30 key (1919805647 INODE_ITEM 0) itemoff 14415 itemsize 160
> generation 2231988 transid 2231989 size 81701 nbytes 81920
> block group 0 mode 100644 links 1 uid 33 gid 33 rdev 0
> sequence 8 flags 0x14(NOCOMPRESS)
> atime 1510290392.703320623 (2017-11-09 21:06:32)
> ctime 1510290392.715320477 (2017-11-09 21:06:32)
> mtime 1510290392.715320477 (2017-11-09 21:06:32)
> otime 1510290392.703320623 (2017-11-09 21:06:32)
> item 31 key (1919805647 INODE_REF 1919785864) itemoff 14389 itemsize 
> 26
> index 1957 namelen 16 name: 1955-capture.jpg
> item 32 key (1919805647 EXTENT_DATA 0) itemoff 14336 itemsize 53
> generation 2231989 type 1 (regular)
> extent data disk byte 2381649588224 nr 81920
> extent data offset 0 nr 81920 ram 81920
> extent compression 0 (none)
> item 33 key (1919805657 INODE_ITEM 0) itemoff 14176 itemsize 160
> generation 2231988 transid 2231989 size 81856 nbytes 81920
> block group 0 mode 100644 links 1 uid 33 gid 33 rdev 0
> sequence 8 flags 0x14(NOCOMPRESS)
> atime 1510290392.919317997 (2017-11-09 21:06:32)
> ctime 1510290392.931317852 (2017-11-09 21:06:32)

No extra interesting things here.

> 
> 
>> Although what we could try is to avoid BUG_ON(), but I'm afraid the
>> problem is more severe than my expectation.
>  
> How does it look now?

At least we know what btrfs check should do.
I could dig it a little deeper to see if we could fix it.
(Or something strange happened again)

Thanks
Qu

> 
>> Any idea how such corruption happened?
> 
> Sigh, I wish I knew.
> 
> It feels like every btrfs filesystem I've had between my 3 systems has
> gotten inexplicably corrupted at least once.
> This one is not even using bcache, just dmcrypt underneath.
> 
> It's my only one using btrfs raid (1):
> gargamel:~# btrfs fi show /dev/mapper/raid0d1 
> Label: 'btrfs_space'  uuid: 01334b81-c0db-4e80-92e4-cac4da867651
> Total devices 2 FS bytes used 1.12TiB
> devid1 size 836.13GiB used 722.03GiB path /dev/mapper/raid0d1
> devid2 size 836.13GiB used 722.03GiB path /dev/mapper/raid0d2
> 
> Data, RAID0: total=38TiB, used=1.11TiB
> System, RAID1: total2.00MiB, used8.00KiB
> Metadata, RAID1: total.00GiB, used=8.54GiB
> GlobalReserve, single: totalQ2.00MiB, used=0.00B
> 
> Now, I didn't get errors or warnings, or even scrub warnings on it, I just 
> ran 
> btrfs check to see what would happen.
> 
> Marc
> 



signature.asc
Description: OpenPGP digital signature

Re: [RFC PATCH 0/8] btrfs iomap support

2017-11-17 Thread Goldwyn Rodrigues



On 11/17/2017 12:45 PM, Nikolay Borisov wrote:
> 
> 
> On 17.11.2017 19:44, Goldwyn Rodrigues wrote:
>> This patch series attempts to use kernels iomap for btrfs. Currently,
>> it covers buffered writes only, but I intend to add some other iomap
>> uses once this gets through. I am sending this as an RFC because I
>> would like to find ways to improve the solution since some changes
>> require adding more functions to the iomap infrastructure which I
>> would try to avoid. I still have to remove some kinks as well such
>> as -o compress. I have posted some questions in the individual
>> patches and would appreciate some input to those.
>>
>> Some of the problems I faced is:
>>
>> 1. extent locking: While we perform the extent locking for writes,
>> we need to perform any reads because of non-page-aligned calls before
>> locking can be done. This requires reading the page, increasing their
>> pagecount and "letting it go". The iomap infrastructure uses
>> buffer_heads wheras btrfs uses bio and hence needs to call readpage
>> exclusively. The "letting it go" part makes me somewhat nervous of
>> conflicting reads/writes, even though we are protected under i_rwsem.
>> Is readpage_nolock() a good idea? The extent locking sequence is a
>> bit weird, with locks and unlock happening in different functions.
> 
> Is there some inherent requirement in iomap's design that necessitates
> the usage of buffer heads? I thought the trend is for buffer_head to
> eventually die out. Given that iomap is fairly recent (2-3 years?) I
> find it odd it's relying on buffer heads.
> 

No, there is no inherent reason that I see, but legacy. iomap is carved
out of existing filesystems such as xfs which traditionally use
buffer_heads. In any case, the buffer heads make I/O to individual pages
independently. iomap calls existing functions which use buffer heads.

>>
>> 2. btrfs pages use PagePrivate to store EXTENT_PAGE_PRIVATE which is not 
>> used anywhere.
>> However, a PagePrivate flag is used for try_to_release_buffers(). Can
>> we do away with PagePrivate for data pages? The same with PageChecked.
>> How and why is it used (I guess -o compress)
>>
>> 3. I had to stick information which will be required from iomap_begin()
>> to iomap_end() in btrfs_iomap which is a pointer in btrfs_inode. Is
>> there any other place/way we can transmit this information. XFS only
>> performs allocations and deallocations so it just relies of bmap code
>> for it.
>>
>> Suggestions/Criticism welcome.
>>

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/8] fs: btrfs: remove unused hardirq.h

2017-11-17 Thread Yang Shi

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/extent_map.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 2e348fb..cced7f1 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -2,7 +2,6 @@
 #include 
 #include 
 #include 
-#include 
 #include "ctree.h"
 #include "extent_map.h"
 #include "compression.h"
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-17 Thread Liu Bo

On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> If two programs simultaneously try to write to the same part of a file
> via direct IO and buffered IO, there's a chance that the post-diowrite
> pagecache invalidation will fail on the dirty page.  When this happens,
> the dio write succeeded, which means that the page cache is no longer
> coherent with the disk!
> 
> Programs are not supposed to mix IO types and this is a clear case of
> data corruption, so store an EIO which will be reflected to userspace
> during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
> so that the developers have /some/ kind of breadcrumb to track down the
> offending program(s) and file(s) involved.
>

Looks good to me, thanks for addressing the warning.

Reviewed-by: Liu Bo 

Thanks,

-liubo

> Signed-off-by: Darrick J. Wong 
> ---
>  fs/direct-io.c |   24 +++-
>  fs/iomap.c |   12 ++--
>  include/linux/fs.h |1 +
>  3 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 98fe132..ef5d12a 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -219,6 +219,27 @@ static inline struct page *dio_get_page(struct dio *dio,
>   return dio->pages[sdio->head];
>  }
>  
> +/*
> + * Warn about a page cache invalidation failure during a direct io write.
> + */
> +void dio_warn_stale_pagecache(struct file *filp)
> +{
> + static DEFINE_RATELIMIT_STATE(_rs, 30 * HZ, DEFAULT_RATELIMIT_BURST);
> + char pathname[128];
> + struct inode *inode = file_inode(filp);
> + char *path;
> +
> + errseq_set(>i_mapping->wb_err, -EIO);
> + if (__ratelimit(&_rs)) {
> + path = file_path(filp, pathname, sizeof(pathname));
> + if (IS_ERR(path))
> + path = "(unknown)";
> + pr_crit("Page cache invalidation failure on direct I/O.  
> Possible data corruption due to collision with buffered I/O!\n");
> + pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid,
> + current->comm);
> + }
> +}
> +
>  /**
>   * dio_complete() - called when all DIO BIO I/O has been completed
>   * @offset: the byte offset in the file of the completed operation
> @@ -290,7 +311,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
> unsigned int flags)
>   err = invalidate_inode_pages2_range(dio->inode->i_mapping,
>   offset >> PAGE_SHIFT,
>   (offset + ret - 1) >> PAGE_SHIFT);
> - WARN_ON_ONCE(err);
> + if (err)
> + dio_warn_stale_pagecache(dio->iocb->ki_filp);
>   }
>  
>   if (!(dio->flags & DIO_SKIP_DIO_COUNT))
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 5011a96..028f329 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -753,7 +753,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   err = invalidate_inode_pages2_range(inode->i_mapping,
>   offset >> PAGE_SHIFT,
>   (offset + dio->size - 1) >> PAGE_SHIFT);
> - WARN_ON_ONCE(err);
> + if (err)
> + dio_warn_stale_pagecache(iocb->ki_filp);
>   }
>  
>   inode_dio_end(file_inode(iocb->ki_filp));
> @@ -1012,9 +1013,16 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   if (ret)
>   goto out_free_dio;
>  
> + /*
> +  * Try to invalidate cache pages for the range we're direct
> +  * writing.  If this invalidation fails, tough, the write will
> +  * still work, but racing two incompatible write paths is a
> +  * pretty crazy thing to do, so we don't support it 100%.
> +  */
>   ret = invalidate_inode_pages2_range(mapping,
>   start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> - WARN_ON_ONCE(ret);
> + if (ret)
> + dio_warn_stale_pagecache(iocb->ki_filp);
>   ret = 0;
>  
>   if (iov_iter_rw(iter) == WRITE && !is_sync_kiocb(iocb) &&
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2690864..0e5f060 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2976,6 +2976,7 @@ enum {
>  };
>  
>  void dio_end_io(struct bio *bio);
> +void dio_warn_stale_pagecache(struct file *filp);
>  
>  ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>struct block_device *bdev, struct iov_iter *iter,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-17 Thread Kai Krakow

Am Fri, 17 Nov 2017 06:51:52 +0300
schrieb Andrei Borzenkov :

> 16.11.2017 19:13, Kai Krakow пишет:
> ...
> > > BTW: From user API perspective, btrfs snapshots do not guarantee  
> > perfect granular consistent backups.  
> 
> Is it documented somewhere? I was relying on crash-consistent
> write-order-preserving snapshots in NetApp for as long as I remember.
> And I was sure btrfs offers is as it is something obvious for
> redirect-on-write idea.

I think it has ordering guarantees, but it is not as atomic in time as
one might think. That's the point. But devs may tell better.


> > A user-level file transaction may
> > still end up only partially in the snapshot. If you are running
> > transaction sensitive applications, those usually do provide some
> > means of preparing a freeze and a thaw of transactions.
> >   
> 
> Is snapshot creation synchronous to know when thaw?

I think you could do "btrfs snap create", then "btrfs fs sync", and
everything should be fine.


> > I think the user transactions API which could've been used for this
> > will even be removed during the next kernel cycles. I remember
> > reiserfs4 tried to deploy something similar. But there's no
> > consistent layer in the VFS for subscribing applications to
> > filesystem snapshots so they could prepare and notify the kernel
> > when they are ready. 
> 
> I do not see what VFS has to do with it. NetApp works by simply
> preserving previous consistency point instead of throwing it away.
> I.e. snapshot is always last committed image on stable storage. Would
> something like this be possible on btrfs level by duplicating current
> on-disk root (sorry if I use wrong term)?

I think btrfs gives the same consistency. But the moment you issue
"btrfs snap create" may delay snapshot creation a little bit. So if
your application relies on exact point in time snapshots, you need to
ensure synchronizing your application to the filesystem. I think the
same is true for NetApp.

I just wanted to point that out because it may not be obvious, given
that btrfs snapshot creation is built right into the tool chain of
filesystem itself, unlike e.g. NetApp or LVM, or other storage layers.

Background: A good while back I was told that btrfs snapshots during
ongoing IO may result in some of the later IO carried over to before
the snapshot. Transactional ordering of IO operations is still
guaranteed but it may overlap with snapshot creation. So you can still
loose a transaction you didn't expect to loose at that point in time.

So I understood this as:

If you just want to ensure transactional integrity of your database,
you are all fine with btrfs snapshots.

But if you want to ensure that a just finished transaction makes it
into the snapshot completely, you have to sync the processes.

However, things may have changed since then.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs progs pre-release 4.14-rc1

2017-11-17 Thread Nick Terrell

> On Nov 17, 2017, at 8:48 AM, Uli Heller  wrote:
> 
> I tried to compile these on ubuntu-14.04 running 4.4.0-98-generic
> - I had to install libzstd-dev
> - I had to disable the zstd tests
> 
> Do you think this ends up in a working binary? Or am I doing something 
> strange?

You'll need to install libzstd-dev to compile with zstd enabled, since you
need the libraries and headers. What version of libzstd do you have? You'll
need a version >= 0.8.1 in order to read the compressed data.

On the zstd side, we're working on getting the version of zstd in Ubuntu
16.04 upgraded from 0.5.1 to a version >= 1.0.0 [1]. zstd-0.5.1 was
released before the format stabilized, and can't read data produced by zstd
>= 0.6.

I would expect the zstd tests to fail if the zstd version is < 0.8.1.
Otherwise, could you tell me which version of zstd you have, and I'll
attempt to reproduce the issue.

[1] https://bugs.launchpad.net/xenial-backports/+bug/1717040

-Nick--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARN_ON in __writeback_inodes_sb_nr

2017-11-17 Thread David Ahern

I see a backtrace booting 4.14+ on a mellanox switch. The trace is due
to the WARN_ON in __writeback_inodes_sb_nr.

[   40.958590] WARNING: CPU: 0 PID: 183 at
/home/dsa/kernel-2.git/fs/fs-writeback.c:2339
__writeback_inodes_sb_nr+0x8a/0x90
[   40.958593] Modules linked in: ebtable_filter ebtables ip6table_raw
ip6table_mangle ip6table_filter ip6_tables iTCO_wdt iTCO_vendor_support
coretemp x86_pkg_temp_thermal kvm_intel kvm irqbypass lpc_ich mfd_core
battery mei_me mei shpchp lm75 regmap_i2c pmbus pmbus_core i2c_dev
i2c_mux k10temp mpls_iptunnel mpls_router ip_tunnel at24 tun
br_netfilter bonding loop autofs4 dm_mod dax crc32c_intel i2c_i801
i2c_core thermal mlxsw_spectrum psample parman bridge stp llc mlxsw_pci
mlxsw_core mlxfw devlink e1000e hwmon
[   40.958664] CPU: 0 PID: 183 Comm: btrfs-transacti Not tainted 4.14.0+ #38
[   40.958666] Hardware name: Mellanox Technologies Ltd. Mellanox
switch/Mellanox switch, BIOS 4.6.5 05/21/2015
[   40.958669] task: 8803db266540 task.stack: c9344000
[   40.958675] RIP: 0010:__writeback_inodes_sb_nr+0x8a/0x90
[   40.958677] RSP: 0018:c9347dc8 EFLAGS: 00010246
[   40.958681] RAX:  RBX: 88040a641800 RCX:

[   40.958683] RDX: 0002 RSI: 45a7 RDI:
c9347e10
[   40.958685] RBP: c9347e20 R08: 0003 R09:
c9347dd0
[   40.958687] R10: 8803db287000 R11:  R12:
c9347dcc
[   40.958689] R13: 8804015e7a00 R14: 8803dade2bc8 R15:
880400a68000
[   40.958692] FS:  () GS:88041dc0()
knlGS:
[   40.958695] CS:  0010 DS:  ES:  CR0: 80050033
[   40.958697] CR2: 5602d78aad50 CR3: 01e09002 CR4:
001606f0
[   40.958699] Call Trace:
[   40.958710]  writeback_inodes_sb+0x27/0x30
[   40.958719]  btrfs_commit_transaction+0x7b2/0x8e0
[   40.958723]  ? start_transaction+0x9e/0x450
[   40.958728]  transaction_kthread+0x177/0x1b0
[   40.958735]  kthread+0x11d/0x150
[   40.958740]  ? btrfs_cleanup_transaction+0x4f0/0x4f0
[   40.958744]  ? kthread_associate_blkcg+0xb0/0xb0
[   40.958751]  ret_from_fork+0x24/0x30
[   40.958754] Code: 8b 42 70 48 85 c0 74 23 4c 89 ce 48 89 df 41 0f b6
d3 e8 fa fc ff ff 4c 89 e6 48 89 df e8 8f de ff ff 48 83 c4 48 5b 41 5c
5d c3 <0f> ff eb d9 66 90 0f 1f 44 00 00 55 31 c9 48 89 e5 e8 60 ff ff
[   40.958826] ---[ end trace defabeb7afdfd414 ]---


Tree is DaveM's net-next, but it was recently merged with Linus' tree at:

commit 6363b3f3ac5be096d08c8c504128befa0c033529
Merge: 1b6115fbe3b3 6297fabd93f9
Author: Linus Torvalds 
Date:   Wed Nov 15 15:12:28 2017 -0800

Merge tag 'ipmi-for-4.15' of git://github.com/cminyard/linux-ipmi

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: clear space cache inode generation always

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

We discovered a box that had double allocations, and suspected the space
cache may be to blame.  While auditing the write out path I noticed that
if we've already setup the space cache we will just carry on.  This
means that any error we hit after cache_save_setup before we go to
actually write the cache out we won't reset the inode generation, so
whatever was already written will be considered correct, except it'll be
stale.  Fix this by _always_ resetting the generation on the block group
inode, this way we only ever have valid or invalid cache.

With this patch I was no longer able to reproduce cache corruption with
dm-log-writes and my bpf error injection tool.

Cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e2d7e86b51d1..67b26b209e23 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3526,13 +3526,6 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
goto again;
}
 
-   /* We've already setup this transaction, go ahead and exit */
-   if (block_group->cache_generation == trans->transid &&
-   i_size_read(inode)) {
-   dcs = BTRFS_DC_SETUP;
-   goto out_put;
-   }
-
/*
 * We want to set the generation to 0, that way if anything goes wrong
 * from here on out we know not to trust this cache when we load up next
@@ -3556,6 +3549,13 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
}
WARN_ON(ret);
 
+   /* We've already setup this transaction, go ahead and exit */
+   if (block_group->cache_generation == trans->transid &&
+   i_size_read(inode)) {
+   dcs = BTRFS_DC_SETUP;
+   goto out_put;
+   }
+
if (i_size_read(inode) > 0) {
ret = btrfs_check_trunc_cache_free_space(fs_info,
_info->global_block_rsv);
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] iomap: report collisions between directio and buffered writes to userspace

2017-11-17 Thread Darrick J. Wong

From: Darrick J. Wong 

If two programs simultaneously try to write to the same part of a file
via direct IO and buffered IO, there's a chance that the post-diowrite
pagecache invalidation will fail on the dirty page.  When this happens,
the dio write succeeded, which means that the page cache is no longer
coherent with the disk!

Programs are not supposed to mix IO types and this is a clear case of
data corruption, so store an EIO which will be reflected to userspace
during the next fsync.  Replace the WARN_ON with a ratelimited pr_crit
so that the developers have /some/ kind of breadcrumb to track down the
offending program(s) and file(s) involved.

Signed-off-by: Darrick J. Wong 
---
 fs/direct-io.c |   24 +++-
 fs/iomap.c |   12 ++--
 include/linux/fs.h |1 +
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 98fe132..ef5d12a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -219,6 +219,27 @@ static inline struct page *dio_get_page(struct dio *dio,
return dio->pages[sdio->head];
 }
 
+/*
+ * Warn about a page cache invalidation failure during a direct io write.
+ */
+void dio_warn_stale_pagecache(struct file *filp)
+{
+   static DEFINE_RATELIMIT_STATE(_rs, 30 * HZ, DEFAULT_RATELIMIT_BURST);
+   char pathname[128];
+   struct inode *inode = file_inode(filp);
+   char *path;
+
+   errseq_set(>i_mapping->wb_err, -EIO);
+   if (__ratelimit(&_rs)) {
+   path = file_path(filp, pathname, sizeof(pathname));
+   if (IS_ERR(path))
+   path = "(unknown)";
+   pr_crit("Page cache invalidation failure on direct I/O.  
Possible data corruption due to collision with buffered I/O!\n");
+   pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid,
+   current->comm);
+   }
+}
+
 /**
  * dio_complete() - called when all DIO BIO I/O has been completed
  * @offset: the byte offset in the file of the completed operation
@@ -290,7 +311,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
unsigned int flags)
err = invalidate_inode_pages2_range(dio->inode->i_mapping,
offset >> PAGE_SHIFT,
(offset + ret - 1) >> PAGE_SHIFT);
-   WARN_ON_ONCE(err);
+   if (err)
+   dio_warn_stale_pagecache(dio->iocb->ki_filp);
}
 
if (!(dio->flags & DIO_SKIP_DIO_COUNT))
diff --git a/fs/iomap.c b/fs/iomap.c
index 5011a96..028f329 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -753,7 +753,8 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
err = invalidate_inode_pages2_range(inode->i_mapping,
offset >> PAGE_SHIFT,
(offset + dio->size - 1) >> PAGE_SHIFT);
-   WARN_ON_ONCE(err);
+   if (err)
+   dio_warn_stale_pagecache(iocb->ki_filp);
}
 
inode_dio_end(file_inode(iocb->ki_filp));
@@ -1012,9 +1013,16 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (ret)
goto out_free_dio;
 
+   /*
+* Try to invalidate cache pages for the range we're direct
+* writing.  If this invalidation fails, tough, the write will
+* still work, but racing two incompatible write paths is a
+* pretty crazy thing to do, so we don't support it 100%.
+*/
ret = invalidate_inode_pages2_range(mapping,
start >> PAGE_SHIFT, end >> PAGE_SHIFT);
-   WARN_ON_ONCE(ret);
+   if (ret)
+   dio_warn_stale_pagecache(iocb->ki_filp);
ret = 0;
 
if (iov_iter_rw(iter) == WRITE && !is_sync_kiocb(iocb) &&
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2690864..0e5f060 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2976,6 +2976,7 @@ enum {
 };
 
 void dio_end_io(struct bio *bio);
+void dio_warn_stale_pagecache(struct file *filp);
 
 ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 struct block_device *bdev, struct iov_iter *iter,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-17 Thread Andreas Dilger

On Nov 15, 2017, at 7:18 PM, Qu Wenruo  wrote:
> 
> [Background]
> Recently I'm considering the possibility to use checksum from filesystem
> to enhance device-mapper raid.
> 
> The idea behind it is quite simple, since most modern filesystems have
> checksum for their metadata, and even some (btrfs) have checksum for data.
> 
> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
> it can use the checksum to determine which copy is correct so it can
> return the correct data even one copy get corrupted.
> 
> [Objective]
> The final objective is to allow device mapper to do the checksum
> verification (and repair if possible).
> 
> If only for verification, it's not much different from current endio
> hook method used by most of the fs.
> However if we can move the repair part from filesystem (well, only btrfs
> supports it yet), it would benefit all fs.

I recall Darrick was looking into a mechanism to do this.  Rather than
changing the whole block layer to take a callback to do a checksum, what
we looked at was to allow the upper-layer read to specify a "retry count"
to the lower-layer block device.  If the lower layer is able to retry the
read then it will read a different device (or combination of devices for
e.g. RAID-6) based on the retry count, until the upper layer gets a good
read (based on checksum, or whatever).  If there are no more devices (or
combinations) to try then a final error is returned.

Darrick can probably point at the original thread/patch.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!

2017-11-17 Thread Holger Hoffstätte

On 11/17/17 18:48, Marc MERLIN wrote:
> On Thu, Nov 16, 2017 at 09:53:15PM -0800, Marc MERLIN wrote:
>>> I suggest that you try lvmcache instead. It's much more flexible than 
>>> bcache,
>>> does pretty much the same job, and has much less of the "hacky" feel to it.
>>
>> I can read up on it, it's going to be a big pain to convert from one to
>> the other, but I can look at it for new filesystems.
> 
> I had a quick read. As expected, it's slower since it goes through all
> the LVM overhead that I got rid of recently
> https://github.com/stec-inc/EnhanceIO/wiki/PERFORMANCE-COMPARISON-AMONG-dm-cache,-bcache-and-EnhanceIO
> 
> Given the pain it would be for me to switch, I'm going to stick with
> bcache and hope it improves.

My understanding is that bcache until recently was more or less unmaintained,
but got quite a few patches for 4.14 and now has a new maintainer as well.
So things are looking up.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/8] btrfs iomap support

2017-11-17 Thread Nikolay Borisov



On 17.11.2017 19:44, Goldwyn Rodrigues wrote:
> This patch series attempts to use kernels iomap for btrfs. Currently,
> it covers buffered writes only, but I intend to add some other iomap
> uses once this gets through. I am sending this as an RFC because I
> would like to find ways to improve the solution since some changes
> require adding more functions to the iomap infrastructure which I
> would try to avoid. I still have to remove some kinks as well such
> as -o compress. I have posted some questions in the individual
> patches and would appreciate some input to those.
> 
> Some of the problems I faced is:
> 
> 1. extent locking: While we perform the extent locking for writes,
> we need to perform any reads because of non-page-aligned calls before
> locking can be done. This requires reading the page, increasing their
> pagecount and "letting it go". The iomap infrastructure uses
> buffer_heads wheras btrfs uses bio and hence needs to call readpage
> exclusively. The "letting it go" part makes me somewhat nervous of
> conflicting reads/writes, even though we are protected under i_rwsem.
> Is readpage_nolock() a good idea? The extent locking sequence is a
> bit weird, with locks and unlock happening in different functions.

Is there some inherent requirement in iomap's design that necessitates
the usage of buffer heads? I thought the trend is for buffer_head to
eventually die out. Given that iomap is fairly recent (2-3 years?) I
find it odd it's relying on buffer heads.

> 
> 2. btrfs pages use PagePrivate to store EXTENT_PAGE_PRIVATE which is not used 
> anywhere.
> However, a PagePrivate flag is used for try_to_release_buffers(). Can
> we do away with PagePrivate for data pages? The same with PageChecked.
> How and why is it used (I guess -o compress)
> 
> 3. I had to stick information which will be required from iomap_begin()
> to iomap_end() in btrfs_iomap which is a pointer in btrfs_inode. Is
> there any other place/way we can transmit this information. XFS only
> performs allocations and deallocations so it just relies of bmap code
> for it.
> 
> Suggestions/Criticism welcome.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] bpf: add a bpf_override_function helper

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov 
Signed-off-by: Josef Bacik 
---
 arch/Kconfig |  3 +++
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/kprobes.h   |  4 +++
 arch/x86/include/asm/ptrace.h|  5 
 arch/x86/kernel/kprobes/ftrace.c | 14 ++
 include/linux/filter.h   |  3 ++-
 include/linux/trace_events.h |  1 +
 include/uapi/linux/bpf.h |  7 -
 kernel/bpf/core.c|  3 +++
 kernel/bpf/verifier.c|  2 ++
 kernel/events/core.c |  7 +
 kernel/trace/Kconfig | 11 
 kernel/trace/bpf_trace.c | 38 +++
 kernel/trace/trace_kprobe.c  | 55 +++-
 kernel/trace/trace_probe.h   | 12 +
 15 files changed, 157 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d789a89cb32c..4fb618082259 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -195,6 +195,9 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_KPROBE_OVERRIDE
+   bool
+
 config HAVE_NMI
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..5126d2750dd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
+   select HAVE_KPROBE_OVERRIDE
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH   if X86_64
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 6cf65437b5e5..c6c3b1f4306a 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size;
 void arch_remove_kprobe(struct kprobe *p);
 asmlinkage void kretprobe_trampoline(void);
 
+#ifdef CONFIG_KPROBES_ON_FTRACE
+extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs);
+#endif
+
 /* Architecture specific copy of original instruction*/
 struct arch_specific_insn {
/* copy of the original instruction */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 91c04c8e67fa..f04e71800c2f 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct 
pt_regs *regs)
return regs->ax;
 }
 
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->ax = rc;
+}
+
 /*
  * user_mode(regs) determines whether a register set came from user
  * mode.  On x86_32, this is true if V8086 mode was enabled OR if the
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 041f7b6dfa0f..3c455bf490cb 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
p->ainsn.boostable = false;
return 0;
 }
+
+asmlinkage void override_func(void);
+asm(
+   ".type override_func, @function\n"
+   "override_func:\n"
+   "   ret\n"
+   ".size override_func, .-override_func\n"
+);
+
+void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
+{
+   regs->ip = (unsigned long)_func;
+}
+NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index cdd78a7beaae..dfa44fd74bae 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -458,7 +458,8 @@ struct bpf_prog {
locked:1,   /* Program image locked? */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
-   dst_needed:1;   /* Do we need dst entry? */
+   dst_needed:1,   /* Do we need dst entry? */
+   kprobe_override:1; /* Do we override a kprobe? 
*/
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index fc6aeca945db..be8bd5a8efaa 100644
---

[PATCH 2/4] btrfs: make open_ctree error injectable

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

This allows us to do error injection with BPF for open_ctree.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dfdab849037b..c6b4e1f07072 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "hash.h"
@@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb,
goto fail_block_groups;
goto retry_root_backup;
 }
+BPF_ALLOW_ERROR_INJECTION(open_ctree);
 
 static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] add infrastructure for tagging functions as error injectable

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

Using BPF we can override kprob'ed functions and return arbitrary
values.  Obviously this can be a bit unsafe, so make this feature opt-in
for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.

Signed-off-by: Josef Bacik 
---
 arch/x86/include/asm/asm.h|   6 ++
 include/asm-generic/kprobes.h |   9 +++
 include/asm-generic/vmlinux.lds.h |  10 +++
 include/linux/kprobes.h   |   1 +
 include/linux/module.h|   5 ++
 kernel/kprobes.c  | 163 ++
 kernel/module.c   |   6 +-
 7 files changed, 199 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index b0dc91f4bedc..340f4cc43255 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -85,6 +85,12 @@
_ASM_PTR (entry);   \
.popsection
 
+# define _ASM_KPROBE_ERROR_INJECT(entry)   \
+   .pushsection "_kprobe_error_inject_list","aw" ; \
+   _ASM_ALIGN ;\
+   _ASM_PTR (entry);   \
+   .popseciton
+
 .macro ALIGN_DESTINATION
/* check for bad alignment of destination */
movl %edi,%ecx
diff --git a/include/asm-generic/kprobes.h b/include/asm-generic/kprobes.h
index 57af9f21d148..f96c4de5d7b0 100644
--- a/include/asm-generic/kprobes.h
+++ b/include/asm-generic/kprobes.h
@@ -22,4 +22,13 @@ static unsigned long __used  
\
 #endif
 #endif /* defined(__KERNEL__) && !defined(__ASSEMBLY__) */
 
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define BPF_ALLOW_ERROR_INJECTION(fname)   \
+static unsigned long __used\
+   __attribute__((__section__("_kprobe_error_inject_list")))   \
+   _eil_addr_##fname = (unsigned long)fname;
+#else
+#define BPF_ALLOW_ERROR_INJECTION(fname)
+#endif
+
 #endif /* _ASM_GENERIC_KPROBES_H */
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 8acfc1e099e1..85822804861e 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -136,6 +136,15 @@
 #define KPROBE_BLACKLIST()
 #endif
 
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define ERROR_INJECT_LIST(). = ALIGN(8);   
\
+   
VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
+   KEEP(*(_kprobe_error_inject_list))  
\
+   VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
= .;
+#else
+#define ERROR_INJECT_LIST()
+#endif
+
 #ifdef CONFIG_EVENT_TRACING
 #define FTRACE_EVENTS(). = ALIGN(8);   
\
VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
@@ -560,6 +569,7 @@
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
KPROBE_BLACKLIST()  \
+   ERROR_INJECT_LIST() \
MEM_DISCARD(init.rodata)\
CLK_OF_TABLES() \
RESERVEDMEM_OF_TABLES() \
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index bd2684700b74..4f501cb73aec 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long offset);
 extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
unsigned long offset);
 
 extern bool within_kprobe_blacklist(unsigned long addr);
+extern bool within_kprobe_error_injection_list(unsigned long addr);
 
 struct kprobe_insn_cache {
struct mutex mutex;
diff --git a/include/linux/module.h b/include/linux/module.h
index fe5aa3736707..7bb1a9b9a322 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -475,6 +475,11 @@ struct module {
ctor_fn_t *ctors;
unsigned int num_ctors;
 #endif
+
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+   unsigned int num_kprobe_ei_funcs;
+   unsigned long *kprobe_ei_funcs;
+#endif
 } cacheline_aligned __randomize_layout;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1606a4224e1..7afadf07b34e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned 
long hash)
return &(kretprobe_table_locks[hash].lock);
 }
 
+/* List of symbols that can be overriden for error injection. */
+static

[PATCH 4/4] samples/bpf: add a test for bpf_override_return

2017-11-17 Thread Josef Bacik

From: Josef Bacik 

This adds a basic test for bpf_override_return to verify it works.  We
override the main function for mounting a btrfs fs so it'll return
-ENOMEM and then make sure that trying to mount a btrfs fs will fail.

Acked-by: Alexei Starovoitov 
Signed-off-by: Josef Bacik 
---
 samples/bpf/Makefile  |  4 
 samples/bpf/test_override_return.sh   | 15 +++
 samples/bpf/tracex7_kern.c| 16 
 samples/bpf/tracex7_user.c| 28 
 tools/include/uapi/linux/bpf.h|  7 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++-
 6 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100755 samples/bpf/test_override_return.sh
 create mode 100644 samples/bpf/tracex7_kern.c
 create mode 100644 samples/bpf/tracex7_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ea2b9e6135f3..83d06bc1f710 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -14,6 +14,7 @@ hostprogs-y += tracex3
 hostprogs-y += tracex4
 hostprogs-y += tracex5
 hostprogs-y += tracex6
+hostprogs-y += tracex7
 hostprogs-y += test_probe_write_user
 hostprogs-y += trace_output
 hostprogs-y += lathist
@@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
 tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
 tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
 tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
@@ -100,6 +102,7 @@ always += tracex3_kern.o
 always += tracex4_kern.o
 always += tracex5_kern.o
 always += tracex6_kern.o
+always += tracex7_kern.o
 always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
@@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf
 HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
+HOSTLOADLIBES_tracex7 += -lelf
 HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
 HOSTLOADLIBES_load_sock_ops += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
diff --git a/samples/bpf/test_override_return.sh 
b/samples/bpf/test_override_return.sh
new file mode 100755
index ..e68b9ee6814b
--- /dev/null
+++ b/samples/bpf/test_override_return.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+rm -f testfile.img
+dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
+DEVICE=$(losetup --show -f testfile.img)
+mkfs.btrfs -f $DEVICE
+mkdir tmpmnt
+./tracex7 $DEVICE
+if [ $? -eq 0 ]
+then
+   echo "SUCCESS!"
+else
+   echo "FAILED!"
+fi
+losetup -d $DEVICE
diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c
new file mode 100644
index ..1ab308a43e0f
--- /dev/null
+++ b/samples/bpf/tracex7_kern.c
@@ -0,0 +1,16 @@
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+SEC("kprobe/open_ctree")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   unsigned long rc = -12;
+
+   bpf_override_return(ctx, rc);
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c
new file mode 100644
index ..8a52ac492e8b
--- /dev/null
+++ b/samples/bpf/tracex7_user.c
@@ -0,0 +1,28 @@
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int argc, char **argv)
+{
+   FILE *f;
+   char filename[256];
+   char command[256];
+   int ret;
+
+   snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+   if (load_bpf_file(filename)) {
+   printf("%s", bpf_log_buf);
+   return 1;
+   }
+
+   snprintf(command, 256, "mount %s tmpmnt/", argv[1]);
+   f = popen(command, "r");
+   ret = pclose(f);
+
+   return ret ? 0 : 1;
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4a4b6e78c977..3756dde69834 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -673,6 +673,10 @@ union bpf_attr {
  * @buf: buf to fill
  * @buf_size: size of the buf
  * Return : 0 on success or negative error code
+ *
+ * int bpf_override_return(pt_regs, rc)
+ * @pt_regs: pointer to struct pt_regs
+ * @rc: the return value to set
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -732,7 +736,8 @@ union bpf_attr {
FN(xdp_adjust_meta),\
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
-   FN(getsockopt),
+   FN(getsockopt), \
+   FN(override_return),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  *

[PATCH 0/4][v6] Add the ability to do BPF directed error injection

2017-11-17 Thread Josef Bacik

I've reworked this to be opt-in only as per Igno and Alexei.  Still needs to go
through Dave because of the bpf bits, but I need tracing guys to weigh in and
sign off on my approach please.

v5->v6:
- add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
  feature.  This way only functions that opt-in will be allowed to be
  overridden.
- added a btrfs patch to allow error injection for open_ctree() so that the bpf
  sample actually works.

v4->v5:
- disallow kprobe_override programs from being put in the prog map array so we
  don't tail call into something we didn't check.  This allows us to make the
  normal path still fast without a bunch of percpu operations.

v3->v4:
- fix a build error found by kbuild test bot (I didn't wait long enough
  apparently.)
- Added a warning message as per Daniels suggestion.

v2->v3:
- added a ->kprobe_override flag to bpf_prog.
- added some sanity checks to disallow attaching bpf progs that have
  ->kprobe_override set that aren't for ftrace kprobes.
- added the trace_kprobe_ftrace helper to check if the trace_event_call is a
  ftrace kprobe.
- renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
  value in the kprobe path, and thus only write to it if we're overriding or
  clearing the override.

v1->v2:
- moved things around to make sure that bpf_override_return could really only be
  used for an ftrace kprobe.
- killed the special return values from trace_call_bpf.
- renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
  it was being called from an ftrace kprobe context.
- reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
- updated the test as per Alexei's review.

- Original message -

A lot of our error paths are not well tested because we have no good way of
injecting errors generically.  Some subystems (block, memory) have ways to
inject errors, but they are random so it's hard to get reproduceable results.

With BPF we can add determinism to our error injection.  We can use kprobes and
other things to verify we are injecting errors at the exact case we are trying
to test.  This patch gives us the tool to actual do the error injection part.
It is very simple, we just set the return value of the pt_regs we're given to
whatever we provide, and then override the PC with a dummy function that simply
returns.

Right now this only works on x86, but it would be simple enough to expand to
other architectures.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!

2017-11-17 Thread Marc MERLIN

On Thu, Nov 16, 2017 at 09:53:15PM -0800, Marc MERLIN wrote:
> > I suggest that you try lvmcache instead. It's much more flexible than 
> > bcache,
> > does pretty much the same job, and has much less of the "hacky" feel to it.
> 
> I can read up on it, it's going to be a big pain to convert from one to
> the other, but I can look at it for new filesystems.

I had a quick read. As expected, it's slower since it goes through all
the LVM overhead that I got rid of recently
https://github.com/stec-inc/EnhanceIO/wiki/PERFORMANCE-COMPARISON-AMONG-dm-cache,-bcache-and-EnhanceIO

Given the pain it would be for me to switch, I'm going to stick with
bcache and hope it improves.
But just to be safe, I'm going to stick to this:
echo writearound > /sys/block/bcache0/bcache/cache_mode

Probably my issues were having writes go through bcache, writeback on
one drive even.
I'll go back to the safest setting and hope for the best.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/8] btrfs: use iocb for __btrfs_buffered_write

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Preparatory patch. It reduces the arguments to __btrfs_buffered_write
to follow buffered_write() style.

Signed-off-by: Goldwyn Rodrigues 

---
 fs/btrfs/file.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index aafcc785f840..9bceb0e61361 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1572,10 +1572,11 @@ static noinline int check_can_nocow(struct btrfs_inode 
*inode, loff_t pos,
return ret;
 }
 
-static noinline ssize_t __btrfs_buffered_write(struct file *file,
-  struct iov_iter *i,
-  loff_t pos)
+static noinline ssize_t __btrfs_buffered_write(struct kiocb *iocb,
+  struct iov_iter *i)
 {
+   struct file *file = iocb->ki_filp;
+   loff_t pos = iocb->ki_pos;
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -1815,7 +1816,6 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
 {
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
-   loff_t pos = iocb->ki_pos;
ssize_t written;
ssize_t written_buffered;
loff_t endbyte;
@@ -1826,8 +1826,8 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
if (written < 0 || !iov_iter_count(from))
return written;
 
-   pos += written;
-   written_buffered = __btrfs_buffered_write(file, from, pos);
+   iocb->ki_pos += written;
+   written_buffered = __btrfs_buffered_write(iocb, from);
if (written_buffered < 0) {
err = written_buffered;
goto out;
@@ -1836,16 +1836,16 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
 * Ensure all data is persisted. We want the next direct IO read to be
 * able to read what was just written.
 */
-   endbyte = pos + written_buffered - 1;
-   err = btrfs_fdatawrite_range(inode, pos, endbyte);
+   endbyte = iocb->ki_pos + written_buffered - 1;
+   err = btrfs_fdatawrite_range(inode, iocb->ki_pos, endbyte);
if (err)
goto out;
-   err = filemap_fdatawait_range(inode->i_mapping, pos, endbyte);
+   err = filemap_fdatawait_range(inode->i_mapping, iocb->ki_pos, endbyte);
if (err)
goto out;
+   iocb->ki_pos += written_buffered;
written += written_buffered;
-   iocb->ki_pos = pos + written_buffered;
-   invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
+   invalidate_mapping_pages(file->f_mapping, iocb->ki_pos >> PAGE_SHIFT,
 endbyte >> PAGE_SHIFT);
 out:
return written ? written : err;
@@ -1964,7 +1964,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
if (iocb->ki_flags & IOCB_DIRECT) {
num_written = __btrfs_direct_write(iocb, from);
} else {
-   num_written = __btrfs_buffered_write(file, from, pos);
+   num_written = __btrfs_buffered_write(iocb, from);
if (num_written > 0)
iocb->ki_pos = pos + num_written;
if (clean_page)
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 7/8] fs: iomap->prepare_pages() to set directives specific for the page

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

This adds prepare_pages() to iomap in order to set page directives
for the page so as FS such as btrfs may perform post-write operations
after write completes.

Can we do away with this? EXTENT_PAGE_PRIVATE is only set and not used.
However, we want the page to be set with PG_Priavate with SetPagePrivate()
for try_to_release_buffers(). Can we work around it?

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c   |  8 
 fs/dax.c  |  2 +-
 fs/internal.h |  2 +-
 fs/iomap.c| 23 ++-
 include/linux/iomap.h |  3 +++
 5 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b34ec493fe4b..b5cc5c0a0cf5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1641,9 +1641,17 @@ int btrfs_file_iomap_end(struct inode *inode, loff_t 
pos, loff_t length,
return ret;
 }
 
+static void btrfs_file_process_page(struct inode *inode, struct page *page)
+{
+   SetPagePrivate(page);
+   set_page_private(page, EXTENT_PAGE_PRIVATE);
+   get_page(page);
+}
+
 const struct iomap_ops btrfs_iomap_ops = {
 .iomap_begin= btrfs_file_iomap_begin,
 .iomap_end  = btrfs_file_iomap_end,
+   .iomap_process_page = btrfs_file_process_page,
 };
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..51d07b24b3a1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -943,7 +943,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
loff_t pos)
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
struct block_device *bdev = iomap->bdev;
struct dax_device *dax_dev = iomap->dax_dev;
diff --git a/fs/internal.h b/fs/internal.h
index 48cee21b4f14..bd9d5a37bd23 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -176,7 +176,7 @@ extern long vfs_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg);
  * iomap support:
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-   void *data, struct iomap *iomap);
+   void *data, const struct iomap_ops *ops, struct iomap *iomap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, const struct iomap_ops *ops, void *data,
diff --git a/fs/iomap.c b/fs/iomap.c
index 9ec9cc3077b3..a32660b1b6c5 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -78,7 +78,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
unsigned flags,
 * we can do the copy-in page by page without having to worry about
 * failures exposing transient data.
 */
-   written = actor(inode, pos, length, data, );
+   written = actor(inode, pos, length, data, ops, );
 
/*
 * Now the data has been copied, commit the range we've copied.  This
@@ -155,7 +155,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned 
len,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
struct iov_iter *i = data;
long status = 0;
@@ -195,6 +195,9 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (unlikely(status))
break;
 
+   if (ops->iomap_process_page)
+   ops->iomap_process_page(inode, page);
+
if (mapping_writably_mapped(inode->i_mapping))
flush_dcache_page(page);
 
@@ -271,7 +274,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
 
 static loff_t
 iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
long status = 0;
ssize_t written = 0;
@@ -363,7 +366,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, 
unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-   void *data, struct iomap *iomap)
+   void *data, const struct iomap_ops *ops, struct iomap *iomap)
 {
bool *did_zero = data;
loff_t written = 0;
@@ -432,7 +435,7 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-   void *data, struct iomap *iomap)
+   void *data, const struct iomap_ops *ops, struct iomap *iomap)
 {
struct page *page = data;
int ret;
@@ -523,7 +526,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length,

[RFC PATCH 7/8] fs: iomap->prepare_pages() to set directives specific for the page

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

This adds prepare_pages() to iomap in order to set page directives
for the page so as FS such as btrfs may perform post-write operations
after write completes.

Can we do away with this? EXTENT_PAGE_PRIVATE is only set and not used.
However, we want the page to be set with PG_Priavate with SetPagePrivate()
for try_to_release_buffers(). Can we work around it?

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c   | 12 ++--
 fs/dax.c  |  2 +-
 fs/internal.h |  2 +-
 fs/iomap.c| 23 ++-
 include/linux/iomap.h |  3 +++
 5 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f5f34e199709..1c459c9001b2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1261,8 +1261,8 @@ static int prepare_uptodate_page(struct inode *inode, u64 
pos, struct page **pag
if (!(pos & (PAGE_SIZE - 1)))
goto out;
 
-   page = find_or_create_page(inode->i_mapping, index,
-   btrfs_alloc_write_mask(inode->i_mapping) | __GFP_WRITE);
+   page = grab_cache_page_write_begin(inode->i_mapping, index,
+   AOP_FLAG_NOFS);
 
if (!PageUptodate(page)) {
int ret = btrfs_readpage(NULL, page);
@@ -1641,9 +1641,17 @@ int btrfs_file_iomap_end(struct inode *inode, loff_t 
pos, loff_t length,
return ret;
 }
 
+static void btrfs_file_process_page(struct inode *inode, struct page *page)
+{
+   SetPagePrivate(page);
+   set_page_private(page, EXTENT_PAGE_PRIVATE);
+   get_page(page);
+}
+
 const struct iomap_ops btrfs_iomap_ops = {
 .iomap_begin= btrfs_file_iomap_begin,
 .iomap_end  = btrfs_file_iomap_end,
+   .iomap_process_page = btrfs_file_process_page,
 };
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..51d07b24b3a1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -943,7 +943,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
loff_t pos)
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
struct block_device *bdev = iomap->bdev;
struct dax_device *dax_dev = iomap->dax_dev;
diff --git a/fs/internal.h b/fs/internal.h
index 48cee21b4f14..bd9d5a37bd23 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -176,7 +176,7 @@ extern long vfs_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg);
  * iomap support:
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-   void *data, struct iomap *iomap);
+   void *data, const struct iomap_ops *ops, struct iomap *iomap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, const struct iomap_ops *ops, void *data,
diff --git a/fs/iomap.c b/fs/iomap.c
index 9ec9cc3077b3..a32660b1b6c5 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -78,7 +78,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
unsigned flags,
 * we can do the copy-in page by page without having to worry about
 * failures exposing transient data.
 */
-   written = actor(inode, pos, length, data, );
+   written = actor(inode, pos, length, data, ops, );
 
/*
 * Now the data has been copied, commit the range we've copied.  This
@@ -155,7 +155,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned 
len,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
struct iov_iter *i = data;
long status = 0;
@@ -195,6 +195,9 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
if (unlikely(status))
break;
 
+   if (ops->iomap_process_page)
+   ops->iomap_process_page(inode, page);
+
if (mapping_writably_mapped(inode->i_mapping))
flush_dcache_page(page);
 
@@ -271,7 +274,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
 
 static loff_t
 iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-   struct iomap *iomap)
+   const struct iomap_ops *ops, struct iomap *iomap)
 {
long status = 0;
ssize_t written = 0;
@@ -363,7 +366,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, 
unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-   void *data, struct iomap *iomap)
+   void *data, const struct iomap_ops *ops, struct iomap *iomap)
 {
bool *did_zero = data;

[RFC PATCH 8/8] iomap: Introduce iomap->dirty_page()

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

In dirty_page(), we are clearing PageChecked, though I don't see it set.
Is this used for compression only?
Can we call __set_page_dirty_nobuffers instead?

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c   | 8 
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 3 files changed, 11 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1c459c9001b2..ba304e782098 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1648,10 +1648,18 @@ static void btrfs_file_process_page(struct inode 
*inode, struct page *page)
get_page(page);
 }
 
+static void btrfs_file_dirty_page(struct page *page)
+{
+   SetPageUptodate(page);
+   ClearPageChecked(page);
+   set_page_dirty(page);
+}
+
 const struct iomap_ops btrfs_iomap_ops = {
 .iomap_begin= btrfs_file_iomap_begin,
 .iomap_end  = btrfs_file_iomap_end,
.iomap_process_page = btrfs_file_process_page,
+   .iomap_dirty_page   = btrfs_file_dirty_page,
 };
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
diff --git a/fs/iomap.c b/fs/iomap.c
index a32660b1b6c5..0907790c76c0 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -208,6 +208,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
status = iomap_write_end(inode, pos, bytes, copied, page, 
iomap);
if (unlikely(status < 0))
break;
+   if (ops->iomap_dirty_page)
+   ops->iomap_dirty_page(page);
copied = status;
 
cond_resched();
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index fbb0194d56d6..7fbf6889dc54 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -76,6 +76,7 @@ struct iomap_ops {
ssize_t written, unsigned flags, struct iomap *iomap);
 
void (*iomap_process_page)(struct inode *inode, struct page *page);
+   void (*iomap_dirty_page)(struct page *page);
 };
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 3/8] fs: Introduce IOMAP_F_NOBH

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

IOMAP_F_NOBH tells iomap functions not to use or attach buffer heads
to the page. Page flush and writeback is the responsibility of the
filesystem (such as btrfs) code, which use bio to perform it.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/iomap.c| 20 
 include/linux/iomap.h |  1 +
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index d4801f8dd4fd..9ec9cc3077b3 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -123,7 +123,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned 
len, unsigned flags,
if (!page)
return -ENOMEM;
 
-   status = __block_write_begin_int(page, pos, len, NULL, iomap);
+   if (!(iomap->flags & IOMAP_F_NOBH))
+   status = __block_write_begin_int(page, pos, len, NULL, iomap);
if (unlikely(status)) {
unlock_page(page);
put_page(page);
@@ -138,12 +139,15 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
unsigned len, unsigned flags,
 
 static int
 iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
-   unsigned copied, struct page *page)
+   unsigned copied, struct page *page, struct iomap *iomap)
 {
-   int ret;
+   int ret = len;
 
-   ret = generic_write_end(NULL, inode->i_mapping, pos, len,
-   copied, page, NULL);
+   if (iomap->flags & IOMAP_F_NOBH)
+   ret = inode_extend_page(inode, pos, copied, page);
+   else
+   ret = generic_write_end(NULL, inode->i_mapping, pos, len,
+   copied, page, NULL);
if (ret < len)
iomap_write_failed(inode, pos, len);
return ret;
@@ -198,7 +202,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 
flush_dcache_page(page);
 
-   status = iomap_write_end(inode, pos, bytes, copied, page);
+   status = iomap_write_end(inode, pos, bytes, copied, page, 
iomap);
if (unlikely(status < 0))
break;
copied = status;
@@ -292,7 +296,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 
WARN_ON_ONCE(!PageUptodate(page));
 
-   status = iomap_write_end(inode, pos, bytes, bytes, page);
+   status = iomap_write_end(inode, pos, bytes, bytes, page, iomap);
if (unlikely(status <= 0)) {
if (WARN_ON_ONCE(status == 0))
return -EIO;
@@ -344,7 +348,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
unsigned offset,
zero_user(page, offset, bytes);
mark_page_accessed(page);
 
-   return iomap_write_end(inode, pos, bytes, bytes, page);
+   return iomap_write_end(inode, pos, bytes, bytes, page, iomap);
 }
 
 static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8a7c6d26b147..61af7b1bd0fc 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -29,6 +29,7 @@ struct vm_fault;
  */
 #define IOMAP_F_MERGED 0x10/* contains multiple blocks/extents */
 #define IOMAP_F_SHARED 0x20/* block shared with another file */
+#define IOMAP_F_NOBH   0x40/* Do not assign buffer heads */
 
 /*
  * Magic value for blkno:
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 5/8] btrfs: use iomap to perform buffered writes

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

This eliminates all page related code.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/btrfs_inode.h |   4 +-
 fs/btrfs/file.c| 488 ++---
 2 files changed, 185 insertions(+), 307 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index eccadb5f62a5..2c2bc5fd5cc9 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -21,7 +21,7 @@
 
 #include 
 #include "extent_map.h"
-#include "extent_io.h"
+#include "iomap.h"
 #include "ordered-data.h"
 #include "delayed-inode.h"
 
@@ -207,6 +207,8 @@ struct btrfs_inode {
 */
struct rw_semaphore dio_sem;
 
+   struct btrfs_iomap *b_iomap;
+
struct inode vfs_inode;
 };
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 876c2acc2a71..b7390214ef3a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -405,79 +405,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
return 0;
 }
 
-/* simple helper to fault in pages and copy.  This should go away
- * and be replaced with calls into generic code.
- */
-static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
-struct page **prepared_pages,
-struct iov_iter *i)
-{
-   size_t copied = 0;
-   size_t total_copied = 0;
-   int pg = 0;
-   int offset = pos & (PAGE_SIZE - 1);
-
-   while (write_bytes > 0) {
-   size_t count = min_t(size_t,
-PAGE_SIZE - offset, write_bytes);
-   struct page *page = prepared_pages[pg];
-   /*
-* Copy data from userspace to the current page
-*/
-   copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
-
-   /* Flush processor's dcache for this page */
-   flush_dcache_page(page);
-
-   /*
-* if we get a partial write, we can end up with
-* partially up to date pages.  These add
-* a lot of complexity, so make sure they don't
-* happen by forcing this copy to be retried.
-*
-* The rest of the btrfs_file_write code will fall
-* back to page at a time copies after we return 0.
-*/
-   if (!PageUptodate(page) && copied < count)
-   copied = 0;
-
-   iov_iter_advance(i, copied);
-   write_bytes -= copied;
-   total_copied += copied;
-
-   /* Return to btrfs_file_write_iter to fault page */
-   if (unlikely(copied == 0))
-   break;
-
-   if (copied < PAGE_SIZE - offset) {
-   offset += copied;
-   } else {
-   pg++;
-   offset = 0;
-   }
-   }
-   return total_copied;
-}
-
-/*
- * unlocks pages after btrfs_file_write is done with them
- */
-static void btrfs_drop_pages(struct page **pages, size_t num_pages)
-{
-   size_t i;
-   for (i = 0; i < num_pages; i++) {
-   /* page checked is some magic around finding pages that
-* have been modified without going through btrfs_set_page_dirty
-* clear it here. There should be no need to mark the pages
-* accessed as prepare_pages should have marked them accessed
-* in prepare_pages via find_or_create_page()
-*/
-   ClearPageChecked(pages[i]);
-   unlock_page(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /*
  * after copy_from_user, pages need to be dirtied and we need to make
  * sure holes are created between the current EOF and the start of
@@ -1457,8 +1384,7 @@ static int btrfs_find_new_delalloc_bytes(struct 
btrfs_inode *inode,
  * the other < 0 number - Something wrong happens
  */
 static noinline int
-lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
-   size_t num_pages, loff_t pos,
+lock_and_cleanup_extent(struct btrfs_inode *inode, loff_t pos,
size_t write_bytes,
u64 *lockstart, u64 *lockend,
struct extent_state **cached_state)
@@ -1466,7 +1392,6 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode 
*inode, struct page **pages,
struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
u64 start_pos;
u64 last_pos;
-   int i;
int ret = 0;
 
start_pos = round_down(pos, fs_info->sectorsize);
@@ -1488,10 +1413,6 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode 
*inode, struct page **pages,
ordered->file_offset <= last_pos) {
unlock_extent_cached(>io_tree,

[RFC PATCH 6/8] btrfs: read the first/last page of the write

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

We cannot perform a readpage in iomap_apply after
iomap_begin() because we have our extents locked. So,
we perform a readpage and make sure we unlock it, but
increase the page count.

Question: How do we deal with -EAGAIN return from
prepare_uptodate_page()? Under what scenario's would this occur?

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c  | 116 ++-
 fs/btrfs/iomap.h |   1 +
 2 files changed, 47 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b7390214ef3a..b34ec493fe4b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1252,84 +1252,36 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
-/*
- * on error we return an unlocked page and the error value
- * on success we return a locked page and 0
- */
-static int prepare_uptodate_page(struct inode *inode,
-struct page *page, u64 pos,
-bool force_uptodate)
+static int prepare_uptodate_page(struct inode *inode, u64 pos, struct page 
**pagep)
 {
+   struct page *page = NULL;
int ret = 0;
+   int index = pos >> PAGE_SHIFT;
+
+   if (!(pos & (PAGE_SIZE - 1)))
+   goto out;
+
+   page = grab_cache_page_write_begin(inode->i_mapping, index,
+   AOP_FLAG_NOFS);
 
-   if (((pos & (PAGE_SIZE - 1)) || force_uptodate) &&
-   !PageUptodate(page)) {
+   if (!PageUptodate(page)) {
ret = btrfs_readpage(NULL, page);
if (ret)
-   return ret;
-   lock_page(page);
+   goto out;
if (!PageUptodate(page)) {
-   unlock_page(page);
-   return -EIO;
+   ret = -EIO;
+   goto out;
}
if (page->mapping != inode->i_mapping) {
-   unlock_page(page);
-   return -EAGAIN;
-   }
-   }
-   return 0;
-}
-
-/*
- * this just gets pages into the page cache and locks them down.
- */
-static noinline int prepare_pages(struct inode *inode, struct page **pages,
- size_t num_pages, loff_t pos,
- size_t write_bytes, bool force_uptodate)
-{
-   int i;
-   unsigned long index = pos >> PAGE_SHIFT;
-   gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-   int err = 0;
-   int faili;
-
-   for (i = 0; i < num_pages; i++) {
-again:
-   pages[i] = find_or_create_page(inode->i_mapping, index + i,
-  mask | __GFP_WRITE);
-   if (!pages[i]) {
-   faili = i - 1;
-   err = -ENOMEM;
-   goto fail;
-   }
-
-   if (i == 0)
-   err = prepare_uptodate_page(inode, pages[i], pos,
-   force_uptodate);
-   if (!err && i == num_pages - 1)
-   err = prepare_uptodate_page(inode, pages[i],
-   pos + write_bytes, false);
-   if (err) {
-   put_page(pages[i]);
-   if (err == -EAGAIN) {
-   err = 0;
-   goto again;
-   }
-   faili = i - 1;
-   goto fail;
+   ret = -EAGAIN;
+   goto out;
}
-   wait_on_page_writeback(pages[i]);
}
-
-   return 0;
-fail:
-   while (faili >= 0) {
-   unlock_page(pages[faili]);
-   put_page(pages[faili]);
-   faili--;
-   }
-   return err;
-
+out:
+   if (page)
+   unlock_page(page);
+   *pagep = page;
+   return ret;
 }
 
 static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
@@ -1502,7 +1454,7 @@ int btrfs_file_iomap_begin(struct inode *inode, loff_t 
pos, loff_t length,
 fs_info->sectorsize);
 bim->extent_locked = false;
 iomap->type = IOMAP_DELALLOC;
-iomap->flags = IOMAP_F_NEW;
+iomap->flags = IOMAP_F_NEW | IOMAP_F_NOBH;
 
extent_changeset_release(bim->data_reserved);
 /* Reserve data/quota space */
@@ -1526,7 +1478,7 @@ int btrfs_file_iomap_begin(struct inode *inode, loff_t 
pos, loff_t length,
 sector_offset,
 fs_info->sectorsize);
 iomap->type = IOMAP_UNWRITTEN;
-iomap->flags = 0;
+iomap->flags &= ~IOMAP_F_NEW;
 } else {

[RFC PATCH 8/8] fs: Introduce iomap->dirty_page()

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

In dirty_page(), we are clearing PageChecked, though I don't see it set.
Is this used for compression only?
Can we call __set_page_dirty_nobuffers instead?

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c   | 8 
 fs/iomap.c| 2 ++
 include/linux/iomap.h | 1 +
 3 files changed, 11 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b5cc5c0a0cf5..049ed1d8ce1f 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1648,10 +1648,18 @@ static void btrfs_file_process_page(struct inode 
*inode, struct page *page)
get_page(page);
 }
 
+static void btrfs_file_dirty_page(struct page *page)
+{
+   SetPageUptodate(page);
+   ClearPageChecked(page);
+   set_page_dirty(page);
+}
+
 const struct iomap_ops btrfs_iomap_ops = {
 .iomap_begin= btrfs_file_iomap_begin,
 .iomap_end  = btrfs_file_iomap_end,
.iomap_process_page = btrfs_file_process_page,
+   .iomap_dirty_page   = btrfs_file_dirty_page,
 };
 
 static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
diff --git a/fs/iomap.c b/fs/iomap.c
index a32660b1b6c5..0907790c76c0 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -208,6 +208,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
status = iomap_write_end(inode, pos, bytes, copied, page, 
iomap);
if (unlikely(status < 0))
break;
+   if (ops->iomap_dirty_page)
+   ops->iomap_dirty_page(page);
copied = status;
 
cond_resched();
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index fbb0194d56d6..7fbf6889dc54 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -76,6 +76,7 @@ struct iomap_ops {
ssize_t written, unsigned flags, struct iomap *iomap);
 
void (*iomap_process_page)(struct inode *inode, struct page *page);
+   void (*iomap_dirty_page)(struct page *page);
 };
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 4/8] btrfs: Introduce btrfs_iomap

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

Preparatory patch. btrfs_iomap structure carries extent/page
state from iomap_begin() to iomap_end().

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c  | 68 ++--
 fs/btrfs/iomap.h | 21 +
 2 files changed, 53 insertions(+), 36 deletions(-)
 create mode 100644 fs/btrfs/iomap.h

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9bceb0e61361..876c2acc2a71 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -41,6 +41,7 @@
 #include "volumes.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "iomap.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -1580,18 +1581,14 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_iomap btrfs_iomap = {0};
+   struct btrfs_iomap *bim = _iomap;
struct page **pages = NULL;
-   struct extent_state *cached_state = NULL;
-   struct extent_changeset *data_reserved = NULL;
u64 release_bytes = 0;
-   u64 lockstart;
-   u64 lockend;
size_t num_written = 0;
int nrptrs;
int ret = 0;
-   bool only_release_metadata = false;
bool force_page_uptodate = false;
-   bool need_unlock;
 
nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
PAGE_SIZE / (sizeof(struct page *)));
@@ -1609,7 +1606,6 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
 offset);
size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
PAGE_SIZE);
-   size_t reserve_bytes;
size_t dirty_pages;
size_t copied;
size_t dirty_sectors;
@@ -1627,11 +1623,11 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
}
 
sector_offset = pos & (fs_info->sectorsize - 1);
-   reserve_bytes = round_up(write_bytes + sector_offset,
+   bim->reserve_bytes = round_up(write_bytes + sector_offset,
fs_info->sectorsize);
 
-   extent_changeset_release(data_reserved);
-   ret = btrfs_check_data_free_space(inode, _reserved, pos,
+   extent_changeset_release(bim->data_reserved);
+   ret = btrfs_check_data_free_space(inode, >data_reserved, 
pos,
  write_bytes);
if (ret < 0) {
if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
@@ -1642,14 +1638,14 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
 * For nodata cow case, no need to reserve
 * data space.
 */
-   only_release_metadata = true;
+   bim->only_release_metadata = true;
/*
 * our prealloc extent may be smaller than
 * write_bytes, so scale down.
 */
num_pages = DIV_ROUND_UP(write_bytes + offset,
 PAGE_SIZE);
-   reserve_bytes = round_up(write_bytes +
+   bim->reserve_bytes = round_up(write_bytes +
 sector_offset,
 fs_info->sectorsize);
} else {
@@ -1658,19 +1654,19 @@ static noinline ssize_t __btrfs_buffered_write(struct 
kiocb *iocb,
}
 
ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-   reserve_bytes);
+   bim->reserve_bytes);
if (ret) {
-   if (!only_release_metadata)
+   if (!bim->only_release_metadata)
btrfs_free_reserved_data_space(inode,
-   data_reserved, pos,
+   bim->data_reserved, pos,
write_bytes);
else
btrfs_end_write_no_snapshotting(root);
break;
}
 
-   release_bytes = reserve_bytes;
-   need_unlock = false;
+   release_bytes = bim->reserve_bytes;
+   bim->extent_locked = 0;
 again:
/*
 * This is going to setup the pages array

[RFC PATCH 2/8] fs: Add inode_extend_page()

2017-11-17 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

This splits the generic_write_end() into functions which handle
block_write_end() and iomap_extend_page().

iomap_extend_page() performs the functions of increasing
i_size (if required) and extending pagecache.

Performed this split so we don't use buffer_heads while ending file I/O.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/buffer.c | 20 +---
 include/linux/buffer_head.h |  1 +
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 170df856bdb9..266daa85b80e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2180,16 +2180,11 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
 }
 EXPORT_SYMBOL(block_write_end);
 
-int generic_write_end(struct file *file, struct address_space *mapping,
-   loff_t pos, unsigned len, unsigned copied,
-   struct page *page, void *fsdata)
+int inode_extend_page(struct inode *inode, loff_t pos,
+   unsigned copied, struct page *page)
 {
-   struct inode *inode = mapping->host;
loff_t old_size = inode->i_size;
int i_size_changed = 0;
-
-   copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-
/*
 * No need to use i_size_read() here, the i_size
 * cannot change under us because we hold i_mutex.
@@ -2218,6 +2213,17 @@ int generic_write_end(struct file *file, struct 
address_space *mapping,
 
return copied;
 }
+EXPORT_SYMBOL(inode_extend_page);
+
+int generic_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   struct inode *inode = mapping->host;
+   copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+   return inode_extend_page(inode, pos, copied, page);
+
+}
 EXPORT_SYMBOL(generic_write_end);
 
 /*
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index afa37f807f12..16cf994be178 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -229,6 +229,7 @@ int __block_write_begin(struct page *page, loff_t pos, 
unsigned len,
 int block_write_end(struct file *, struct address_space *,
loff_t, unsigned, unsigned,
struct page *, void *);
+int inode_extend_page(struct inode *, loff_t, unsigned, struct page*);
 int generic_write_end(struct file *, struct address_space *,
loff_t, unsigned, unsigned,
struct page *, void *);
-- 
2.14.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/8] btrfs iomap support

2017-11-17 Thread Goldwyn Rodrigues

This patch series attempts to use kernels iomap for btrfs. Currently,
it covers buffered writes only, but I intend to add some other iomap
uses once this gets through. I am sending this as an RFC because I
would like to find ways to improve the solution since some changes
require adding more functions to the iomap infrastructure which I
would try to avoid. I still have to remove some kinks as well such
as -o compress. I have posted some questions in the individual
patches and would appreciate some input to those.

Some of the problems I faced is:

1. extent locking: While we perform the extent locking for writes,
we need to perform any reads because of non-page-aligned calls before
locking can be done. This requires reading the page, increasing their
pagecount and "letting it go". The iomap infrastructure uses
buffer_heads wheras btrfs uses bio and hence needs to call readpage
exclusively. The "letting it go" part makes me somewhat nervous of
conflicting reads/writes, even though we are protected under i_rwsem.
Is readpage_nolock() a good idea? The extent locking sequence is a
bit weird, with locks and unlock happening in different functions.

2. btrfs pages use PagePrivate to store EXTENT_PAGE_PRIVATE which is not used 
anywhere.
However, a PagePrivate flag is used for try_to_release_buffers(). Can
we do away with PagePrivate for data pages? The same with PageChecked.
How and why is it used (I guess -o compress)

3. I had to stick information which will be required from iomap_begin()
to iomap_end() in btrfs_iomap which is a pointer in btrfs_inode. Is
there any other place/way we can transmit this information. XFS only
performs allocations and deallocations so it just relies of bmap code
for it.

Suggestions/Criticism welcome.

-- 
Goldwyn


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs progs pre-release 4.14-rc1

2017-11-17 Thread Uli Heller


I tried to compile these on ubuntu-14.04 running 4.4.0-98-generic
- I had to install libzstd-dev
- I had to disable the zstd tests

Do you think this ends up in a working binary? Or am I doing something 
strange?


Best regards, Uli

--
Uli Heller

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check: add_missing_dir_index: BUG_ON `ret` triggered, value -17

2017-11-17 Thread Marc MERLIN

On Fri, Nov 17, 2017 at 04:12:07PM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年11月17日 15:30, Marc MERLIN wrote:
> > Here's the whole output:
> > gargamel:~# btrfs-debug-tree -t 258 /dev/mapper/raid0d1 | grep 1919805647
> 
> Sorry, I missed "-C10" parameter for grep.
 
generation 2231977 transid 2237084 size 64 nbytes 0
block group 0 mode 40755 links 1 uid 33 gid 33 rdev 0
sequence 0 flags 0x1710(none)
atime 1510290002.516060162 (2017-11-09 21:00:02)
ctime 1510477350.88506455 (2017-11-12 01:02:30)
mtime 1510477350.88506455 (2017-11-12 01:02:30)
otime 1510290002.516060162 (2017-11-09 21:00:02)
item 26 key (1919785864 INODE_REF 1919785862) itemoff 14683 itemsize 12
index 2 namelen 2 name: 00
item 27 key (1919785864 DIR_ITEM 2591417872) itemoff 14637 itemsize 46
location key (1919805647 INODE_ITEM 0) type FILE
transid 2231988 data_len 0 name_len 16
name: 1955-capture.jpg
item 28 key (1919785864 DIR_ITEM 3406016191) itemoff 14591 itemsize 46
location key (1919805657 INODE_ITEM 0) type FILE
transid 2231988 data_len 0 name_len 16
name: 1956-capture.jpg
item 29 key (1919785864 DIR_INDEX 1957) itemoff 14575 itemsize 16
location key (7383370114097217536 UNKNOWN.211 
15651972432879681580) type DIR_ITEM.0
transid 72057594045427176 data_len 0 name_len 0
name: 
item 30 key (1919805647 INODE_ITEM 0) itemoff 14415 itemsize 160
generation 2231988 transid 2231989 size 81701 nbytes 81920
block group 0 mode 100644 links 1 uid 33 gid 33 rdev 0
sequence 8 flags 0x14(NOCOMPRESS)
atime 1510290392.703320623 (2017-11-09 21:06:32)
ctime 1510290392.715320477 (2017-11-09 21:06:32)
mtime 1510290392.715320477 (2017-11-09 21:06:32)
otime 1510290392.703320623 (2017-11-09 21:06:32)
item 31 key (1919805647 INODE_REF 1919785864) itemoff 14389 itemsize 26
index 1957 namelen 16 name: 1955-capture.jpg
item 32 key (1919805647 EXTENT_DATA 0) itemoff 14336 itemsize 53
generation 2231989 type 1 (regular)
extent data disk byte 2381649588224 nr 81920
extent data offset 0 nr 81920 ram 81920
extent compression 0 (none)
item 33 key (1919805657 INODE_ITEM 0) itemoff 14176 itemsize 160
generation 2231988 transid 2231989 size 81856 nbytes 81920
block group 0 mode 100644 links 1 uid 33 gid 33 rdev 0
sequence 8 flags 0x14(NOCOMPRESS)
atime 1510290392.919317997 (2017-11-09 21:06:32)
ctime 1510290392.931317852 (2017-11-09 21:06:32)


> Although what we could try is to avoid BUG_ON(), but I'm afraid the
> problem is more severe than my expectation.
 
How does it look now?

> Any idea how such corruption happened?

Sigh, I wish I knew.

It feels like every btrfs filesystem I've had between my 3 systems has
gotten inexplicably corrupted at least once.
This one is not even using bcache, just dmcrypt underneath.

It's my only one using btrfs raid (1):
gargamel:~# btrfs fi show /dev/mapper/raid0d1 
Label: 'btrfs_space'  uuid: 01334b81-c0db-4e80-92e4-cac4da867651
Total devices 2 FS bytes used 1.12TiB
devid1 size 836.13GiB used 722.03GiB path /dev/mapper/raid0d1
devid2 size 836.13GiB used 722.03GiB path /dev/mapper/raid0d2

Data, RAID0: total=1.38TiB, used=1.11TiB
System, RAID1: total=32.00MiB, used=128.00KiB
Metadata, RAID1: total=13.00GiB, used=8.54GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Now, I didn't get errors or warnings, or even scrub warnings on it, I just ran 
btrfs check to see what would happen.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Unrecoverable scrub errors

2017-11-17 Thread Nazar Mokrynskyi

Hi folks,

I'm a long-term btrfs user (permanently for my root partition and other stuff 
for ~3 years now, with compression, most of the way with RAID0 on various SSD, 
etc).

In simple words my setup consists of root partition and backup partition. There 
are automated snapshots on root partition which are then copied to online 
backup partition (send/receive, handled by "Just backup btrfs") and 
occasionally to offline backup partition (handled by "Btrfs sync subvolumes").

I've recently found that my online backup partition has some unrecoverable 
errors as reported after running scrub:

> scrub status for 82cfcb0f-0b80-4764-bed6-f529f2030ac5
>     scrub started at Fri Nov 17 15:05:12 2017 and finished after 02:07:30
>     total bytes scrubbed: 915.16GiB with 12 errors
>     error details: csum=12
>     corrected errors: 0, uncorrectable errors: 12, unverified errors: 0
dmesg (this is all related to mentioned errors):

> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 
> 470069460992 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238048: metadata leaf (level 0) in tree 985
> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 
> 470069460992 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238048: metadata leaf (level 0) in tree 985
> [551049.038723] BTRFS error (device dm-2): bdev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, 
> flush 0, corrupt 1, gen 0
> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 
> 470069526528 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238176: metadata leaf (level 0) in tree 985
> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 
> 470069526528 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238176: metadata leaf (level 0) in tree 985
> [551049.039637] BTRFS error (device dm-2): bdev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, 
> flush 0, corrupt 2, gen 0
> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error at 
> logical 470069460992 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.413473] BTRFS warning (device dm-2): checksum error at logical 
> 470069477376 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238080: metadata leaf (level 0) in tree 985
> [551049.413473] BTRFS warning (device dm-2): checksum error at logical 
> 470069477376 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238080: metadata leaf (level 0) in tree 985
> [551049.413475] BTRFS error (device dm-2): bdev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, 
> flush 0, corrupt 3, gen 0
> [551049.413685] BTRFS error (device dm-2): unable to fixup (regular) error at 
> logical 470069477376 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.413910] BTRFS warning (device dm-2): checksum error at logical 
> 470069493760 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238112: metadata leaf (level 0) in tree 985
> [551049.413911] BTRFS warning (device dm-2): checksum error at logical 
> 470069493760 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238112: metadata leaf (level 0) in tree 985
> [551049.413912] BTRFS error (device dm-2): bdev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, 
> flush 0, corrupt 4, gen 0
> [551049.414121] BTRFS error (device dm-2): unable to fixup (regular) error at 
> logical 470069493760 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.414354] BTRFS warning (device dm-2): checksum error at logical 
> 470069510144 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238144: metadata leaf (level 0) in tree 985
> [551049.414355] BTRFS warning (device dm-2): checksum error at logical 
> 470069510144 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238144: metadata leaf (level 0) in tree 985
> [551049.414356] BTRFS error (device dm-2): bdev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 0, 
> flush 0, corrupt 5, gen 0
> [551049.414567] BTRFS error (device dm-2): unable to fixup (regular) error at 
> logical 470069510144 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.479023] BTRFS error (device dm-2): unable to fixup (regular) error at 
> logical 470069526528 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> [551049.479989] BTRFS warning (device dm-2): checksum error at logical 
> 470069542912 on dev 
> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> 942238208: metadata leaf (level 0)

Re: Btrfs restore error

2017-11-17 Thread Jay




On 2017-11-16 10:38 PM, Qu Wenruo wrote:



On 2017年11月17日 11:56, Jay wrote:

Hello,

I thought I should report something since there was little information
on this error. The situation is I have 2 external hard drives on
Xubuntu. One is not working and I need to move the data over to the
other.


"btrfs replace" should be your first option, not "btrfs restore", unless
it's totally damaged and you want to salvage as much as possible.


OK, thank you.


I used 'sudo btrfs restore -v /dev/sde1 /mnt/Old4TB' and
received 'Error mkdiring /mnt/Old4TB/Jayda TV:2'.


No extra info like something restored succefully? Just 'Error mkdiring
/mnt/Old4TB/Jayda TV:2'?


Correct, the program just exited.


At least it's ENOENT, checking mkdir(3p) should gives your the reason:
---
ENOENT A  component  of the path prefix specified by path does
not name
   an existing directory or path is an empty string.

---

Did the dir "/mnt/Old4TB" exists in first place?


I see, errno was set to 2; ENOENT.  Thank you. Yes, Old4TB is a drive 
that is auto mounted on boot to that permanent mount point.  I was able 
to create a folder on that drive using both mkdir and sudo mkdir.  Maybe 
the btrfs-progs Ubuntu package is not configured correctly... although 
the forum post shows at lest one other person has had this same error. 
I am not sure what distro they were using.



Thanks,
Qu


I found one forum
post that said I needed to make the destination folder manually, then
restore. That did not work. looking at your code 2(%d) is a kernel
message? not sure what to make of it. I decided to enter a root
environment with 'sudo su' and the restore worked(the folder still
existed from previous troubleshoot step). The console is showing files
being restored. I tried a dry run first which did not show an error.
Just some feedback and reference.

uname -a

Linux emb 4.10.0-38-generic #42-16.04.1-Ubuntu SMP Tue Oct 10 16:21:20
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version

btrfs-progs v4.4

Thank you,

Jayotis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] btrfs: handle dynamically reappearing missing device

2017-11-17 Thread Anand Jain

If the device is not present at the time of (-o degrade) mount,
the mount context will create a dummy missing struct btrfs_device.
Later this device may reappear after the FS is mounted and
then device is included in the device list but it missed the
open_device part. So this patch handles that case by going
through the open_device steps which this device missed and finally
adds to the device alloc list.

So now with this patch, to bring back the missing device user can run,

   btrfs dev scan 

Without this kernel patch, even though 'btrfs fi show' and 'btrfs
dev ready' would tell you that missing device has reappeared
successfully but actually in kernel FS layer it didn't.

Signed-off-by: Anand Jain 
---
This patch needs:
 [PATCH 0/4]  factor __btrfs_open_devices()

v2:
Add more comments.
Add more change log.
Add to check if device missing is set, to handle the case
dev open fail and user will rerun the dev scan.

 fs/btrfs/volumes.c | 63 +++---
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5bd73edc2602..b3224baa6db4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -725,7 +725,9 @@ static noinline int device_list_add(const char *path,
 
ret = 1;
device->fs_devices = fs_devices;
-   } else if (!device->name || strcmp(device->name->str, path)) {
+   } else if (!device->name ||
+   strcmp(device->name->str, path) ||
+   device->missing) {
/*
 * When FS is already mounted.
 * 1. If you are here and if the device->name is NULL that
@@ -769,8 +771,63 @@ static noinline int device_list_add(const char *path,
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
if (device->missing) {
-   fs_devices->missing_devices--;
-   device->missing = 0;
+   int ret;
+   struct btrfs_fs_info *fs_info = fs_devices->fs_info;
+   fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+   if (btrfs_super_flags(disk_super) &
+   BTRFS_SUPER_FLAG_SEEDING)
+   fmode &= ~FMODE_WRITE;
+
+   /*
+* Missing can be set only when FS is mounted.
+* So here its always fs_devices->opened > 0 and most
+* of the struct device members are already updated by
+* the mount process even if this device was missing, so
+* now follow the normal open device procedure for this
+* device. The scrub will take care of filling the
+* missing stripes for raid56 and balance for raid1 and
+* raid10.
+*/
+   ASSERT(fs_devices->opened);
+   mutex_lock(_devices->device_list_mutex);
+   mutex_lock(_info->chunk_mutex);
+   /*
+* By not failing for the reason that
+* btrfs_open_one_device() could fail, it will
+* keep the original behaviors as it is for now.
+* That's fix me for later.
+*/
+   ret = btrfs_open_one_device(fs_devices, device, fmode,
+   fs_info->bdev_holder);
+   if (!ret) {
+   /*
+* Making sure missing is set to 0
+* only when bdev != NULL is as expected
+* at most of the places in the code.
+* Further, what if user fixes the
+* dev and reruns the dev scan, then again
+* we need to come here to reset missing.
+*/
+   fs_devices->missing_devices--;
+   device->missing = 0;
+   btrfs_clear_opt(fs_info->mount_opt, DEGRADED);
+   btrfs_warn(fs_info,
+   "BTRFS: device %s devid %llu uuid %pU 
joined\n",
+   path, devid, device->uuid);
+   }
+
+   if (device->writeable &&
+   !device->is_tgtdev_for_dev_replace) {
+   fs_devices->total_rw_bytes +=
+   device->total_bytes;
+   atomic64_add(device->total_bytes -
+

Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-17 Thread Austin S. Hemmelgarn


On 2017-11-16 20:30, Qu Wenruo wrote:



On 2017年11月17日 00:47, Austin S. Hemmelgarn wrote:



This is at least less complicated than dm-integrity.

Just a new hook for READ bio. And it can start from easy part.
Like starting from dm-raid1 and other fs support.

It's less complicated for end users (in theory, but cryptsetup devs are
working on that for dm-integrity), but significantly more complicated
for developers.

It also brings up the question of what happens when you want some other
layer between the filesystem and the MD/DM RAID layer (say, running
bcache or dm-cache on top of the RAID array).  In the case of
dm-integrity, that's not an issue because dm-integrity is entirely
self-contained, it doesn't depend on other layers beyond the standard
block interface.


Each layer can choose to drop the support for extra verification.

If the layer is not modifying the data, it can pass it do lower layer.
Just as integrity payload.
Which then makes things a bit more complicated in every other layer as 
well, in turn making things more complicated for all developers.




As I mentioned in my other reply on this thread, running with
dm-integrity _below_ the RAID layer instead of on top of it will provide
the same net effect, and in fact provide a stronger guarantee than what
you are proposing (because dm-integrity does real cryptographic
integrity verification, as opposed to just checking for bit-rot).


Although with more CPU usage for each device even they are containing
same data.

I never said it wasn't higher resource usage.






If your checksum is calculated and checked at FS level there is no added
value when you spread this logic to other layers.


That's why I'm moving the checking part to lower level, to make more
value from the checksum.



dm-integrity adds basic 'check-summing' to any filesystem without the
need to modify fs itself


Well, despite the fact that modern filesystem has already implemented
their metadata csum.

   - the paid price is - if there is bug between

passing data from  'fs' to dm-integrity'  it cannot be captured.

Advantage of having separated 'fs' and 'block' layer is in its
separation and simplicity at each level.


Totally agreed on this.

But the idea here should not bring that large impact (compared to big
things like ZFS/Btrfs).

1) It only affect READ bio
2) Every dm target can choose if to support or pass down the hook.
     no mean to support it for RAID0 for example.
     And for complex raid like RAID5/6 no need to support it from the very
     beginning.
3) Main part of the functionality is already implemented
     The core complexity contains 2 parts:
     a) checksum calculation and checking
    Modern fs is already doing this, at least for metadata.
     b) recovery
    dm targets already have this implemented for supported raid
    profile.
     All these are already implemented, just moving them to different
     timing is not bringing such big modification IIRC.


If you want integrated solution - you are simply looking for btrfs where
multiple layers are integrated together.


If with such verification hook (along with something extra to handle
scrub), btrfs chunk mapping can be re-implemented with device-mapper:

In fact btrfs logical space is just a dm-linear device, and each chunk
can be implemented by its corresponding dm-* module like:

dm-linear:   | btrfs chunk 1 | btrfs chunk 2 | ... | btrfs chunk n |
and
btrfs chunk 1: metadata, using dm-raid1 on diskA and diskB
btrfs chunk 2: data, using dm-raid0 on disk A B C D
...
btrfs chunk n: system, using dm-raid1 on disk A B

At least btrfs can take the advantage of the simplicity of separate
layers.

And other filesystem can get a little higher chance to recover its
metadata if built on dm-raid.

Again, just put dm-integrity below dm-raid.  The other filesystems
primarily have metadata checksums to catch data corruption, not repair
it,


Because they have no extra copy.
If they have, they will definitely use the extra copy to repair.
But they don't have those extra copies now, so that really becomes 
irrelevant as an argument (especially since it's not likely they will 
add data or metadata replication in the filesystem any time in the near 
future).



and I severely doubt that you will manage to convince developers to
add support in their filesystem (especially XFS) because:
1. It's a layering violation (yes, I know BTRFS is too, but that's a bit
less of an issue because it's a completely self-contained layering
violation, while this isn't).


If passing something along with bio is violating layers, then integrity
payload is already doing this for a long time.
The block integrity layer is also interfacing directly with hardware and 
_needs_ to pass that data down.  Unless I'm mistaken, it also doesn't do 
any verification except in the filesystem layer, and doesn't pass down 
any complaints about the integrity of the data (it may try to re-read 
it, but that's not the same as what

Re: [PATCH] btrfs: handle dynamically reappearing missing device

2017-11-17 Thread Anand Jain




On 11/17/2017 07:28 AM, Liu Bo wrote:

On Sun, Nov 12, 2017 at 06:56:50PM +0800, Anand Jain wrote:

If the device is not present at the time of (-o degrade) mount
the mount context will create a dummy missing struct btrfs_device.
Later this device may reappear after the FS is mounted.



This commit log doesn't explain what would happen when the missing
device re-appears.


 Hm. You mean in the current design without this patch.?
 Just nothing it gives a false impression to the user by
'btrfs dev ready' and 'btrfs fi show' that missing device
 has been included. Will update change log.


So this
patch handles that case by going through the open_device steps



which this device missed and finally adds to the device alloc list.

So now with this patch, to bring back the missing device user can run,

btrfs dev scan 


Most common distros already have some udev rules of btrfs to process a
reappeared disk.


 Right. Do you see anything that will break ?
 Planning to add comments [1] in v2 for more clarity around this.



Signed-off-by: Anand Jain 
---
This patch needs:
  [PATCH 0/4]  factor __btrfs_open_devices()

  fs/btrfs/volumes.c | 43 +--
  1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d24e966ee29f..e7dd996831f2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -769,8 +769,47 @@ static noinline int device_list_add(const char *path,
rcu_string_free(device->name);
rcu_assign_pointer(device->name, name);
if (device->missing) {
-   fs_devices->missing_devices--;
-   device->missing = 0;
+   int ret;
+   struct btrfs_fs_info *fs_info = fs_devices->fs_info;
+   fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+   if (btrfs_super_flags(disk_super) & 
BTRFS_SUPER_FLAG_SEEDING)
+   fmode &= ~FMODE_WRITE;
+
+   /*
+* Missing can be set only when FS is mounted.
+* So here its always fs_devices->opened > 0 and most
+* of the struct device members are already updated by
+* the mount process even if this device was missing, so
+* now follow the normal open device procedure for this
+* device. The scrub will take care of filling the
+* missing stripes for raid56 and balance for raid1 and
+* raid10.
+*/



The idea looks good to me.

What if users skip balance/scrub given both could take a while?

Then btrfs takes the disks back and reads content from it, and it's OK
for raid1/10 as they may have a good copy, 


 Yes, except for the split brain situation for which patches are
 in the ML, review comments appreciated.


but in case of raid56, it
might hit the reconstruction bug if two disks are missing and added
back, which I tried to address recently.



At least ->writable should not be set until balance/scrub completes
the re-sync job.


 Hm. Why ?


Thanks,

-liubo


+   ASSERT(fs_devices->opened);
+   mutex_lock(_devices->device_list_mutex);
+   mutex_lock(_info->chunk_mutex);



[1]
/*
 * By not failing for the
 * reason that btrfs_open_one_device() could
 * fail, it will keep the original behaviors as
 * it is for now. That's fix me for later.
 */

+   ret = btrfs_open_one_device(fs_devices, device, fmode,
+   fs_info->bdev_holder);
+   if (!ret) {


/*
 * Making sure missing is set to 0
 * only when bdev != NULL is as expected
 * at most of the places in the code.
 * Further, what if user fixes the
 * dev and reruns dev scan, then again
 * we need to come here.
 */


+   fs_devices->missing_devices--;
+   device->missing = 0;
+   btrfs_clear_opt(fs_info->mount_opt, DEGRADED);
+   btrfs_warn(fs_info,
+   "BTRFS: device %s devid %llu uuid %pU 
joined\n",
+   path, devid, device->uuid);
+   }


 Also added check for missing as below in v2.

--
@@ -725,7 +725,9 @@ static noinline int

Re: btrfs check: add_missing_dir_index: BUG_ON `ret` triggered, value -17

2017-11-17 Thread Qu Wenruo



On 2017年11月17日 15:30, Marc MERLIN wrote:
> Here's the whole output:
> gargamel:~# btrfs-debug-tree -t 258 /dev/mapper/raid0d1 | grep 1919805647

Sorry, I missed "-C10" parameter for grep.

> location key (1919805647 INODE_ITEM 0) type FILE

This should be the DIR_ITEM/DIR_INDEX pointing to that inode.

Although there is only one hit, means we missed one DIR_ITEM/DIR_INDEX.
Maybe it's btrfs-progs re-inserting the wrong DIR_INDEX/DIR_ITEM that
leads to -EEXIST.

This needs extra info from "-C10" to determine.

> item 30 key (1919805647 INODE_ITEM 0) itemoff 14415 itemsize 160
> item 31 key (1919805647 INODE_REF 1919785864) itemoff 14389 itemsize 
> 26

Here we know the parent inode number is 1919785864.

> item 32 key (1919805647 EXTENT_DATA 0) itemoff 14336 itemsize 53
> parent transid verify failed on 1173964603392 wanted 2244945 found 2247404
> parent transid verify failed on 1173964603392 wanted 2244945 found 2247404
> parent transid verify failed on 1173964603392 wanted 2244945 found 2247404
> parent transid verify failed on 1173964603392 wanted 2244945 found 2247404
> Ignoring transid failure

I think the transid failure is the root cause.

Although what we could try is to avoid BUG_ON(), but I'm afraid the
problem is more severe than my expectation.

Any idea how such corruption happened?

Thanks,
Qu

> parent transid verify failed on 1652248100864 wanted 2245186 found 2247494
> parent transid verify failed on 1652248100864 wanted 2245186 found 2247494
> parent transid verify failed on 1652248100864 wanted 2245186 found 2247494
> parent transid verify failed on 1652248100864 wanted 2245186 found 2247494
> Ignoring transid failure
> parent transid verify failed on 1174605512704 wanted 2245171 found 2247435
> parent transid verify failed on 1174605512704 wanted 2245171 found 2247435
> parent transid verify failed on 1174605512704 wanted 2245171 found 2247435
> parent transid verify failed on 1174605512704 wanted 2245171 found 2247435
> Ignoring transid failure
> WARNING: eb corrupted: item 130 eb level 0 next level 2, skipping the rest
> 
> 
> On Thu, Nov 16, 2017 at 10:17:07PM -0800, Marc MERLIN wrote:
>> On Fri, Nov 17, 2017 at 01:17:19PM +0800, Qu Wenruo wrote:
>>>
>>>
>>> On 2017年11月17日 10:26, Marc MERLIN wrote:
 Howdy,

 Up to date git pull from btrfs-progs:

 gargamel:~# btrfs check --repair /dev/mapper/raid0d1
 enabling repair mode
 Checking filesystem on /dev/mapper/raid0d1
 UUID: 01334b81-c0db-4e80-92e4-cac4da867651
 checking extents
 corrupt extent record: key 203003699200 168 40960
 corrupt extent record: key 203003764736 168 172032
 ref mismatch on [203003699200 40960] extent item 0, found 1
 Data backref 203003699200 root 258 owner 1933897829 offset 0 num_refs 0 
 not found in extent tree
 Incorrect local backref count on 203003699200 root 258 owner 1933897829 
 offset 0 found 1 wanted 0 back 0x5596988c2130
 backpointer mismatch on [203003699200 40960]
 repair deleting extent record: key 203003699200 168 40960
 adding new data backref on 203003699200 root 258 owner 1933897829 offset 0 
 found 1
 Repaired extent references for 203003699200
 Data backref 203003764736 root 258 owner 1932315368 offset 0 num_refs 0 
 not found in extent tree
 Incorrect local backref count on 203003764736 root 258 owner 1932315368 
 offset 0 found 1 wanted 0 back 0x5596dde358f0
 backpointer mismatch on [203003764736 172032]
 repair deleting extent record: key 203003764736 168 172032
 adding new data backref on 203003764736 root 258 owner 1932315368 offset 0 
 found 1
 Repaired extent references for 203003764736
 Fixed 0 roots.
 checking free space cache
 cache and super generation don't match, space cache will be invalidated
 checking fs roots
 invalid location in dir item 0
 Deleting bad dir index [1919785864,96,1958] root 258
 Deleting bad dir index [1919785864,96,1957] root 258
 repairing missing dir index item for inode 1919805647
 cmds-check.c:2614: add_missing_dir_index: BUG_ON `ret` triggered, value -17
>>>
>>> -EEXIST. Btrfs check --repair is trying to re-insert some key which
>>> exists already.
>>>
>>> Would you please provide the output of "btrfs-debug-tree -t 258 | grep
>>> 1919805647" to help the debugging?
>>  
>> Sure. It may run all night and I'm going to bed now, but the output I
>> got so far, is:
>> gargamel:~# btrfs-debug-tree -t 258 /dev/mapper/raid0d1 | grep 1919805647
>> location key (1919805647 INODE_ITEM 0) type FILE
>> item 30 key (1919805647 INODE_ITEM 0) itemoff 14415 itemsize 160
>> item 31 key (1919805647 INODE_REF 1919785864) itemoff 14389 itemsize 
>> 26
>> item 32 key (1919805647 EXTENT_DATA 0) itemoff 14336 itemsize 53
>> (...)
>>
>> I'll post tomorrow if I get more overnight
>>
>> Marc
>> -- 
>> "A mouse is a device used to point at the xterm

44 matches

Mail list logo