Re: btrfs restore corrupt file

2017-11-22 Thread Qu Wenruo


On 2017年11月23日 13:25, Chris Murphy wrote:
> On Wed, Nov 22, 2017 at 12:18 PM, Jorge Bastos  
> wrote:
>> Hello,
>>
>> While doing btrfs checksum testing I purposely corrupted a file and
>> got the expect I/O error when trying to copy it, I also tested btrfs
>> restore to see if I could recover a known corrupt file and it did copy
>> it but there was no checksum error or warning. I used btrfs restore -v
>>
>> Is this expect behavior or should restore warn about checksum failures?
>>
>> Kernel used was 4.13.13,  btrfs-progs v4.13.2
> 
> I think it's expected. "The checks done by restore are less strict"
> from the man page. Although it'd be nice if -v option at least could
> flag such files as possibly being corrupt.
> 
I think people always consider "btrfs restore" as a tool to "restore" data.

The proper name of it should be "salvage" and moved under "btrfs
rescue", to reduce the confusion.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: WARNING: CPU: 3 PID: 20953 at /usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0

2017-11-22 Thread Jérôme Carretero
Hi,

On Wed, 22 Nov 2017 15:35:35 -0800
Liu Bo  wrote:

> On Mon, Nov 20, 2017 at 02:00:07AM -0500, Jérôme Carretero wrote:
> > [ cut here ] [633254.461294] WARNING: CPU:
> > 3 PID: 20953 at /usr/src/linux/fs/btrfs/raid56.c:848
> > __free_raid_bio+0x8e/0xa0  
> 
> 
> The vanilla 4.14.0 shows it is
> WARN_ON(!bio_list_empty(>bio_list)); but we just emptied
> rbio->bio_list two lines above, i.e. struct bio *cur =
> bio_list_get(>bio_list);
> 
> Either we have some weird race, or the line number is misleading me.
> 
> Can you please check the code which warning fs/btrfs/raid56.c:848
> points to?

Same code as yours:
WARN_ON(!bio_list_empty(>bio_list));

So yeah, at least git is not broken, now it could be a very weird
compiler bug or a less-weird race indeed...


Regards,

-- 
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs restore corrupt file

2017-11-22 Thread Chris Murphy
On Wed, Nov 22, 2017 at 12:18 PM, Jorge Bastos  wrote:
> Hello,
>
> While doing btrfs checksum testing I purposely corrupted a file and
> got the expect I/O error when trying to copy it, I also tested btrfs
> restore to see if I could recover a known corrupt file and it did copy
> it but there was no checksum error or warning. I used btrfs restore -v
>
> Is this expect behavior or should restore warn about checksum failures?
>
> Kernel used was 4.13.13,  btrfs-progs v4.13.2

I think it's expected. "The checks done by restore are less strict"
from the man page. Although it'd be nice if -v option at least could
flag such files as possibly being corrupt.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-4.13: cmds-inspect-tree-stats.h: No such file or directory

2017-11-22 Thread John L. Center


On 11/22/2017 11:15 PM, John L. Center wrote:


On 09/09/2017 01:48 AM, Uli Heller wrote:

I dowloaded the btrfs-progs-v4.13.tar.xz tarball and tried to build it.
Unfortunately, I get an error about cmds-inspect-tree-stats.h. The 
file seems

to be missing from the tarball, it used to exist in v4.12.

...
true -g -Os -fstack-protector --param=ssp-buffer-size=4 -Wformat 
-Werror=format-security -DBTRFS_FLAT_INCLUDES=1 -I. -Ikernel-lib -Itmp 
-include .cc-defines.h -D__CHECKER__ -D__CHECK_ENDIAN__ -Wbitwise 
-Wuninitialized -Wshadow -Wundef -U_FORTIFY_SOURCE btrfs-calc-size.c

 [CC] btrfs-calc-size.o
gcc -g -Os -fstack-protector --param=ssp-buffer-size=4 -Wformat 
-Werror=format-security -DBTRFS_FLAT_INCLUDES=1 -I. -Ikernel-lib -Itmp 
-c btrfs-calc-size.c -o btrfs-calc-size.o  \


btrfs-calc-size.c:22:37: fatal error: cmds-inspect-tree-stats.h: No 
such file or directory

  #include "cmds-inspect-tree-stats.h"
  ^

$ LANG=C ls -l btrfs-progs-v4.12/cmds-inspect-tree-stats.h 
btrfs-progs-v4.13/cmds-inspect-tree-stats.h
ls: cannot access btrfs-progs-v4.13/cmds-inspect-tree-stats.h: No such 
file or directory
-rw-rw-r-- 1 uli uli 905 Jul 28 15:29 
btrfs-progs-v4.12/cmds-inspect-tree-stats.h



The issue exists in the new btrfs-progs-v4.14.tar.gz, also.

 -John


I forgot to mention, it was in btrfs-show-super.c.

-John
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-4.13: cmds-inspect-tree-stats.h: No such file or directory

2017-11-22 Thread John L. Center


On 09/09/2017 01:48 AM, Uli Heller wrote:

I dowloaded the btrfs-progs-v4.13.tar.xz tarball and tried to build it.
Unfortunately, I get an error about cmds-inspect-tree-stats.h. The file 
seems

to be missing from the tarball, it used to exist in v4.12.

...
true -g -Os -fstack-protector --param=ssp-buffer-size=4 -Wformat 
-Werror=format-security -DBTRFS_FLAT_INCLUDES=1 -I. -Ikernel-lib -Itmp 
-include .cc-defines.h -D__CHECKER__ -D__CHECK_ENDIAN__ -Wbitwise 
-Wuninitialized -Wshadow -Wundef -U_FORTIFY_SOURCE btrfs-calc-size.c

     [CC] btrfs-calc-size.o
gcc -g -Os -fstack-protector --param=ssp-buffer-size=4 -Wformat 
-Werror=format-security -DBTRFS_FLAT_INCLUDES=1 -I. -Ikernel-lib -Itmp 
-c btrfs-calc-size.c -o btrfs-calc-size.o  \


btrfs-calc-size.c:22:37: fatal error: cmds-inspect-tree-stats.h: No such 
file or directory

  #include "cmds-inspect-tree-stats.h"
  ^

$ LANG=C ls -l btrfs-progs-v4.12/cmds-inspect-tree-stats.h 
btrfs-progs-v4.13/cmds-inspect-tree-stats.h
ls: cannot access btrfs-progs-v4.13/cmds-inspect-tree-stats.h: No such 
file or directory
-rw-rw-r-- 1 uli uli 905 Jul 28 15:29 
btrfs-progs-v4.12/cmds-inspect-tree-stats.h



The issue exists in the new btrfs-progs-v4.14.tar.gz, also.

-John

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: misc/021-image-multi-devices fails on latest btrfs-devel/misc-next

2017-11-22 Thread Lakshmipathi.G
Quick update: It continues to fail:
https://asciinema.org/a/8HAL3uZ4MG7eb5gXnlgq3jm7q

Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org


On Sat, Nov 18, 2017 at 12:59 PM, Lakshmipathi.G
 wrote:
> Hi.
>
> Few days back(Nov14th) this(misc/021-image-multi-devices) test script
> used to pass and seems like its failing
> now. may be due to recent commits? Logs,screen-casts available here:
> https://lakshmipathi.github.io/btrfsqa/
>
> 
> Cheers,
> Lakshmipathi.G
> http://www.giis.co.in http://www.webminal.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 0/4] Add the ability to do BPF directed error injection

2017-11-22 Thread Alexei Starovoitov
On Wed, Nov 22, 2017 at 04:23:29PM -0500, Josef Bacik wrote:
> This is hopefully the final version, I've addressed the comment by Igno and
> added his Acks.
> 
> v6->v7:
> - moved the opt-in macro to bpf.h out of kprobes.h.

Thanks Josef!
All patches look great to me.
We'll probably take them all into bpf-next.git to start testing together
with other bpf changes and when net-next reopens will send them to Dave.
Then optionally can send pull-req for the first patch only to tip if Ingo
thinks that there can be conflicts with the work happening in parallel
on kprobe/x86 bits.
This way hopefully there will be no conflicts during the next merge window.
Makes sense?

> v5->v6:
> - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
>   feature.  This way only functions that opt-in will be allowed to be
>   overridden.
> - added a btrfs patch to allow error injection for open_ctree() so that the 
> bpf
>   sample actually works.
> 
> v4->v5:
> - disallow kprobe_override programs from being put in the prog map array so we
>   don't tail call into something we didn't check.  This allows us to make the
>   normal path still fast without a bunch of percpu operations.
> 
> v3->v4:
> - fix a build error found by kbuild test bot (I didn't wait long enough
>   apparently.)
> - Added a warning message as per Daniels suggestion.
> 
> v2->v3:
> - added a ->kprobe_override flag to bpf_prog.
> - added some sanity checks to disallow attaching bpf progs that have
>   ->kprobe_override set that aren't for ftrace kprobes.
> - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
>   ftrace kprobe.
> - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read 
> this
>   value in the kprobe path, and thus only write to it if we're overriding or
>   clearing the override.
> 
> v1->v2:
> - moved things around to make sure that bpf_override_return could really only 
> be
>   used for an ftrace kprobe.
> - killed the special return values from trace_call_bpf.
> - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
>   it was being called from an ftrace kprobe context.
> - reworked the logic in kprobe_perf_func to take advantage of 
> bpf_kprobe_state.
> - updated the test as per Alexei's review.
> 
> - Original message -
> 
> A lot of our error paths are not well tested because we have no good way of
> injecting errors generically.  Some subystems (block, memory) have ways to
> inject errors, but they are random so it's hard to get reproduceable results.
> 
> With BPF we can add determinism to our error injection.  We can use kprobes 
> and
> other things to verify we are injecting errors at the exact case we are trying
> to test.  This patch gives us the tool to actual do the error injection part.
> It is very simple, we just set the return value of the pt_regs we're given to
> whatever we provide, and then override the PC with a dummy function that 
> simply
> returns.
> 
> Right now this only works on x86, but it would be simple enough to expand to
> other architectures.  Thanks,
> 
> Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: uncorrectable errors in Raid 10

2017-11-22 Thread Qu Wenruo


On 2017年11月23日 00:38, Steffen Sindzinski wrote:
> Hello,
> 
> I did btrfs check --readonly on both disk without finding any error. To
> reconfirm I did a scrub again which still has found 2 uncorrectable

Still metadata corruption?

> errors. I had to boot into Arch Linux 4.13.12-1-ARCH
> btrfs-progs v4.13 to run btrfs check.

Well, btrfs-progs v4.13 from Arch is out-of-data for a while.
Even ignoring the latest v4.14 release, there is no v4.13.x releases for
Arch.

> 
> 
> Checking filesystem on /dev/sde2
> UUID: 4fafd0d4-7dd9-4dcc-9a33-5f1ad9555358
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 1598669742080 bytes used, no error found
> total csum bytes: 1540740308
> total tree bytes: 19391266816
> total fs tree bytes: 9844703232
> total extent tree bytes: 6808731648
> btree space waste bytes: 3471512438
> file data blocks allocated: 2124703174656
>  referenced 1454353633280
> sudo btrfs check --readonly /dev/sde2  2708,20s user 594,58s system 15%
> cpu 5:45:40,36 total
> ***
> sudo btrfs check --readonly /dev/sdd2
> Checking filesystem on /dev/sdd2
> UUID: 4fafd0d4-7dd9-4dcc-9a33-5f1ad9555358
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 1598669742080 bytes used, no error found
> total csum bytes: 1540740308
> total tree bytes: 19391266816
> total fs tree bytes: 9844703232
> total extent tree bytes: 6808731648
> btree space waste bytes: 3471512438
> file data blocks allocated: 2124703174656
>  referenced 1454353633280
> sudo btrfs check --readonly /dev/sdd2  2655,23s user 626,48s system 15%
> cpu 5:56:16,85 total

Just as Chris mentioned, if the scrub is reporting metadata corruption,
please try "btrfs check --mode=lowmem" to see if lowmem mode can detect
such corruption.

> 
> 
> I mentioned that I cannot access 2 directories anymore. Permission is
> denied, not even as root I have permission nor I can change ownership.
> This happens in Ubuntu 17.10, kernel 4.13.0-17-generic. It looks like this:
> 
> ls -la The*
> The-Wire:
> ls: Zugriff auf 'The-Wire/.' nicht möglich: Keine Berechtigung
> ls: Zugriff auf 'The-Wire/..' nicht möglich: Keine Berechtigung
> <...>
> insgesamt 0
> d? ? ? ? ?    ? .
> d? ? ? ? ?    ? ..
> -? ? ? ? ?    ? The Wire S1E10.m4v
> -? ? ? ? ?    ? The Wire S1E11.m4v

Normally this means DIR_INDEX/DIR_ITEM missing or points to incorrect inode.

> <...>
> 
> Oddly in Arch Linux, 4.13.12-1-ARCH, btrfs-progs v4.13, I can access
> this same directory, permission is set correctly.>
> The debug trees for both drives you may find attached to this mail. They
> were done in Ubunutu.

Not really helping.
I don't know how you take the dump, but it is only a leaf of subvolume,
full of EXTENT_DATA without any useful info.

Thanks,
Qu

> 
> For reference here the scrub report on Arch linux.
> 
> 
> Steffen
> 
> 
> 
> Am 20.11.2017 um 21:26 schrieb Chris Murphy:
>> On Mon, Nov 20, 2017 at 1:36 AM, Steffen Sindzinski
>>  wrote:
>>> Hi,
>>>
>>> I did btrfs-debug-tree for this block on both devices. The result is
>>> attached to this mail.
>>>
>>> It is really weird, same block, different drives, different sector. I
>>> have
>>> no problem with bit rod. Btrfs worked perfectly fine with this both
>>> HDDs for
>>> so long on Raid1. The drive sdf1 which I attached to form a Raid 10
>>> was also
>>> in a different Btrfs in the same machine for years flawlessly.
>>>
>>> I have not found any other checksum errors than the ones from this
>>> scrub.
>>>
>>> Is there no way to just safely recreate the checksum of this particular
>>> block from the disk contents?
>>
>> I'll cc Qu because I don't understand what's going on. It's the 2nd
>> case of both copies of metadata being bad in as many days, which could
>> just be coincidence.
>>
>> I also don't understand the specific error "checksum/header error"
>> which sounds to me like Btrfs knows the leaf is otherwise OK, but
>> there is some kind of problem with either the leaf csum or its header.
>> In which case I'd like to think that btrfs check --repair can fix this
>> kind of problem.
>>
>> What do you get for btrfs check without --repair?
>>
>> Curiously, your scrub complains about this "checksum/header error" but
>> btrfs-debug-tree gives no indication that leaf has any problem at all.
>>
>>
>>
>>



signature.asc
Description: OpenPGP digital signature


Re: WARNING: CPU: 3 PID: 20953 at /usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0

2017-11-22 Thread Liu Bo
On Mon, Nov 20, 2017 at 02:00:07AM -0500, Jérôme Carretero wrote:
> Hi,
> 
> 
> 
> This was while doing a "userspace scrub" with "tar c":
> 
> [633250.707455] btrfs_print_data_csum_error: 14608 callbacks suppressed
> [633250.707459] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0xb8c194fb expected csum 0xb3680c88 mirror 2
> [633250.707465] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
> [633250.707470] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0xa5db59eb expected csum 0xb3680c88 mirror 2
> [633250.707473] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x5d244234 expected csum 0xb3680c88 mirror 2
> [633250.707475] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
> [633250.707478] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530301440 csum 0xc0a71540 expected csum 0x904f75bc mirror 2
> [633250.707480] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
> [633250.707483] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530293248 csum 0x7f422a5d expected csum 0xb3680c88 mirror 2
> [633250.707484] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530301440 csum 0x0abd2cac expected csum 0x904f75bc mirror 2
> [633250.707488] BTRFS warning (device dm-18): csum failed root 5 ino 1376 off 
> 3530301440 csum 0x0d046c34 expected csum 0x904f75bc mirror 2
> [633250.888501] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230948864 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.937373] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230952960 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.949808] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230957056 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.961703] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230961152 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.973827] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230965248 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.986271] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230969344 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633250.998517] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230973440 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633251.010537] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230977536 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633251.022767] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230981632 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633251.034990] BTRFS info (device dm-18): read error corrected: ino 1376 off 
> 230985728 (dev /dev/mapper/I8U2-4 sector 1373688)
> [633254.456570] [ cut here ]
> [633254.461294] WARNING: CPU: 3 PID: 20953 at 
> /usr/src/linux/fs/btrfs/raid56.c:848 __free_raid_bio+0x8e/0xa0


The vanilla 4.14.0 shows it is WARN_ON(!bio_list_empty(>bio_list));
but we just emptied rbio->bio_list two lines above, i.e.
struct bio *cur = bio_list_get(>bio_list);

Either we have some weird race, or the line number is misleading me.

Can you please check the code which warning fs/btrfs/raid56.c:848 points to?

thanks,
-liubo

> [633254.470863] Modules linked in: bfq twofish_avx_x86_64 twofish_x86_64_3way 
> xts twofish_x86_64 twofish_common serpent_avx_x86_64 serpent_generic lrw 
> gf128mul ablk_helper algif_skcipher af_alg nfnetlink_queue nfnetlink_log 
> nfnetlink cfg80211 rfkill usbmon fuse usb_storage dm_crypt dm_mod dax 
> coretemp hwmon intel_rapl snd_hda_codec_realtek x86_pkg_temp_thermal 
> snd_hda_codec_generic iTCO_wdt kvm_intel iTCO_vendor_support snd_hda_intel 
> kvm snd_hda_codec irqbypass snd_hwdep aesni_intel snd_hda_core aes_x86_64 
> snd_pcm xhci_pci snd_timer ehci_pci crypto_simd xhci_hcd cryptd ehci_hcd 
> sdhci_pci glue_helper pcspkr snd usbcore sdhci soundcore lpc_ich mmc_core 
> usb_common mfd_core bnx2 bonding autofs4 [last unloaded: i2c_dev]
> [633254.533987] CPU: 3 PID: 20953 Comm: kworker/u16:18 Tainted: GW
>4.14.0-Vantage-dirty #14
> [633254.543298] Hardware name: LENOVO 056851U/LENOVO, BIOS A0KT56AUS 
> 02/01/2016
> [633254.550365] Workqueue: btrfs-endio btrfs_endio_helper
> [633254.08] task: 880859523b00 task.stack: c90006164000
> [633254.561528] RIP: 0010:__free_raid_bio+0x8e/0xa0
> [633254.566143] RSP: 0018:c90006167bc8 EFLAGS: 00010282
> [633254.571457] RAX: 88052540d010 RBX: 8801ffd02800 RCX: 
> 0001
> [633254.578683] RDX: 88052540d010 RSI: 0246 RDI: 
> 88052540d000
> [633254.585912] RBP: 88052540d000 R08:  R09: 
> 0001
> [633254.593131] R10: 8805ad3a2e60 R11: 

[PATCH v7 0/4] Add the ability to do BPF directed error injection

2017-11-22 Thread Josef Bacik
This is hopefully the final version, I've addressed the comment by Igno and
added his Acks.

v6->v7:
- moved the opt-in macro to bpf.h out of kprobes.h.

v5->v6:
- add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
  feature.  This way only functions that opt-in will be allowed to be
  overridden.
- added a btrfs patch to allow error injection for open_ctree() so that the bpf
  sample actually works.

v4->v5:
- disallow kprobe_override programs from being put in the prog map array so we
  don't tail call into something we didn't check.  This allows us to make the
  normal path still fast without a bunch of percpu operations.

v3->v4:
- fix a build error found by kbuild test bot (I didn't wait long enough
  apparently.)
- Added a warning message as per Daniels suggestion.

v2->v3:
- added a ->kprobe_override flag to bpf_prog.
- added some sanity checks to disallow attaching bpf progs that have
  ->kprobe_override set that aren't for ftrace kprobes.
- added the trace_kprobe_ftrace helper to check if the trace_event_call is a
  ftrace kprobe.
- renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
  value in the kprobe path, and thus only write to it if we're overriding or
  clearing the override.

v1->v2:
- moved things around to make sure that bpf_override_return could really only be
  used for an ftrace kprobe.
- killed the special return values from trace_call_bpf.
- renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
  it was being called from an ftrace kprobe context.
- reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
- updated the test as per Alexei's review.

- Original message -

A lot of our error paths are not well tested because we have no good way of
injecting errors generically.  Some subystems (block, memory) have ways to
inject errors, but they are random so it's hard to get reproduceable results.

With BPF we can add determinism to our error injection.  We can use kprobes and
other things to verify we are injecting errors at the exact case we are trying
to test.  This patch gives us the tool to actual do the error injection part.
It is very simple, we just set the return value of the pt_regs we're given to
whatever we provide, and then override the PC with a dummy function that simply
returns.

Right now this only works on x86, but it would be simple enough to expand to
other architectures.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 5/5] btrfs: allow us to inject errors at io_ctl_init

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

This was instrumental in reproducing a space cache bug.

Signed-off-by: Josef Bacik 
Acked-by: Ingo Molnar 
---
 fs/btrfs/free-space-cache.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index cdc9f4015ec3..daa98dc1f844 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "free-space-cache.h"
 #include "transaction.h"
@@ -332,6 +333,7 @@ static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct 
inode *inode,
 
return 0;
 }
+BPF_ALLOW_ERROR_INJECTION(io_ctl_init);
 
 static void io_ctl_free(struct btrfs_io_ctl *io_ctl)
 {
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 4/5] samples/bpf: add a test for bpf_override_return

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

This adds a basic test for bpf_override_return to verify it works.  We
override the main function for mounting a btrfs fs so it'll return
-ENOMEM and then make sure that trying to mount a btrfs fs will fail.

Acked-by: Alexei Starovoitov 
Acked-by: Ingo Molnar 
Signed-off-by: Josef Bacik 
---
 samples/bpf/Makefile  |  4 
 samples/bpf/test_override_return.sh   | 15 +++
 samples/bpf/tracex7_kern.c| 16 
 samples/bpf/tracex7_user.c| 28 
 tools/include/uapi/linux/bpf.h|  7 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++-
 6 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100755 samples/bpf/test_override_return.sh
 create mode 100644 samples/bpf/tracex7_kern.c
 create mode 100644 samples/bpf/tracex7_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ea2b9e6135f3..83d06bc1f710 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -14,6 +14,7 @@ hostprogs-y += tracex3
 hostprogs-y += tracex4
 hostprogs-y += tracex5
 hostprogs-y += tracex6
+hostprogs-y += tracex7
 hostprogs-y += test_probe_write_user
 hostprogs-y += trace_output
 hostprogs-y += lathist
@@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
 tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
 tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
 tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
@@ -100,6 +102,7 @@ always += tracex3_kern.o
 always += tracex4_kern.o
 always += tracex5_kern.o
 always += tracex6_kern.o
+always += tracex7_kern.o
 always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
@@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf
 HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
+HOSTLOADLIBES_tracex7 += -lelf
 HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
 HOSTLOADLIBES_load_sock_ops += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
diff --git a/samples/bpf/test_override_return.sh 
b/samples/bpf/test_override_return.sh
new file mode 100755
index ..e68b9ee6814b
--- /dev/null
+++ b/samples/bpf/test_override_return.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+rm -f testfile.img
+dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
+DEVICE=$(losetup --show -f testfile.img)
+mkfs.btrfs -f $DEVICE
+mkdir tmpmnt
+./tracex7 $DEVICE
+if [ $? -eq 0 ]
+then
+   echo "SUCCESS!"
+else
+   echo "FAILED!"
+fi
+losetup -d $DEVICE
diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c
new file mode 100644
index ..1ab308a43e0f
--- /dev/null
+++ b/samples/bpf/tracex7_kern.c
@@ -0,0 +1,16 @@
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+SEC("kprobe/open_ctree")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   unsigned long rc = -12;
+
+   bpf_override_return(ctx, rc);
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c
new file mode 100644
index ..8a52ac492e8b
--- /dev/null
+++ b/samples/bpf/tracex7_user.c
@@ -0,0 +1,28 @@
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int argc, char **argv)
+{
+   FILE *f;
+   char filename[256];
+   char command[256];
+   int ret;
+
+   snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+   if (load_bpf_file(filename)) {
+   printf("%s", bpf_log_buf);
+   return 1;
+   }
+
+   snprintf(command, 256, "mount %s tmpmnt/", argv[1]);
+   f = popen(command, "r");
+   ret = pclose(f);
+
+   return ret ? 0 : 1;
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4a4b6e78c977..3756dde69834 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -673,6 +673,10 @@ union bpf_attr {
  * @buf: buf to fill
  * @buf_size: size of the buf
  * Return : 0 on success or negative error code
+ *
+ * int bpf_override_return(pt_regs, rc)
+ * @pt_regs: pointer to struct pt_regs
+ * @rc: the return value to set
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -732,7 +736,8 @@ union bpf_attr {
FN(xdp_adjust_meta),\
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
-   FN(getsockopt),
+   FN(getsockopt), \
+   FN(override_return),
 
 /* integer value in 'imm' field of BPF_CALL 

[PATCH v7 3/5] bpf: add a bpf_override_function helper

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov 
Acked-by: Ingo Molnar 
Signed-off-by: Josef Bacik 
---
 arch/Kconfig |  3 +++
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/kprobes.h   |  4 +++
 arch/x86/include/asm/ptrace.h|  5 
 arch/x86/kernel/kprobes/ftrace.c | 14 ++
 include/linux/filter.h   |  3 ++-
 include/linux/trace_events.h |  1 +
 include/uapi/linux/bpf.h |  7 -
 kernel/bpf/core.c|  3 +++
 kernel/bpf/verifier.c|  2 ++
 kernel/events/core.c |  7 +
 kernel/trace/Kconfig | 11 
 kernel/trace/bpf_trace.c | 38 +++
 kernel/trace/trace_kprobe.c  | 55 +++-
 kernel/trace/trace_probe.h   | 12 +
 15 files changed, 157 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d789a89cb32c..4fb618082259 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -195,6 +195,9 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_KPROBE_OVERRIDE
+   bool
+
 config HAVE_NMI
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..5126d2750dd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
+   select HAVE_KPROBE_OVERRIDE
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH   if X86_64
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 6cf65437b5e5..c6c3b1f4306a 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size;
 void arch_remove_kprobe(struct kprobe *p);
 asmlinkage void kretprobe_trampoline(void);
 
+#ifdef CONFIG_KPROBES_ON_FTRACE
+extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs);
+#endif
+
 /* Architecture specific copy of original instruction*/
 struct arch_specific_insn {
/* copy of the original instruction */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 91c04c8e67fa..f04e71800c2f 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct 
pt_regs *regs)
return regs->ax;
 }
 
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->ax = rc;
+}
+
 /*
  * user_mode(regs) determines whether a register set came from user
  * mode.  On x86_32, this is true if V8086 mode was enabled OR if the
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 041f7b6dfa0f..3c455bf490cb 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
p->ainsn.boostable = false;
return 0;
 }
+
+asmlinkage void override_func(void);
+asm(
+   ".type override_func, @function\n"
+   "override_func:\n"
+   "   ret\n"
+   ".size override_func, .-override_func\n"
+);
+
+void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
+{
+   regs->ip = (unsigned long)_func;
+}
+NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index cdd78a7beaae..dfa44fd74bae 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -458,7 +458,8 @@ struct bpf_prog {
locked:1,   /* Program image locked? */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
-   dst_needed:1;   /* Do we need dst entry? */
+   dst_needed:1,   /* Do we need dst entry? */
+   kprobe_override:1; /* Do we override a kprobe? 
*/
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 

[PATCH v7 2/5] btrfs: make open_ctree error injectable

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

This allows us to do error injection with BPF for open_ctree.

Signed-off-by: Josef Bacik 
Acked-by: Ingo Molnar 
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dfdab849037b..69d17a640b94 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb,
goto fail_block_groups;
goto retry_root_backup;
 }
+BPF_ALLOW_ERROR_INJECTION(open_ctree);
 
 static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 1/5] add infrastructure for tagging functions as error injectable

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

Using BPF we can override kprob'ed functions and return arbitrary
values.  Obviously this can be a bit unsafe, so make this feature opt-in
for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.

Signed-off-by: Josef Bacik 
Acked-by: Ingo Molnar 
---
 arch/x86/include/asm/asm.h|   6 ++
 include/asm-generic/vmlinux.lds.h |  10 +++
 include/linux/bpf.h   |  11 +++
 include/linux/kprobes.h   |   1 +
 include/linux/module.h|   5 ++
 kernel/kprobes.c  | 163 ++
 kernel/module.c   |   6 +-
 7 files changed, 201 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index b0dc91f4bedc..340f4cc43255 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -85,6 +85,12 @@
_ASM_PTR (entry);   \
.popsection
 
+# define _ASM_KPROBE_ERROR_INJECT(entry)   \
+   .pushsection "_kprobe_error_inject_list","aw" ; \
+   _ASM_ALIGN ;\
+   _ASM_PTR (entry);   \
+   .popseciton
+
 .macro ALIGN_DESTINATION
/* check for bad alignment of destination */
movl %edi,%ecx
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 8acfc1e099e1..85822804861e 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -136,6 +136,15 @@
 #define KPROBE_BLACKLIST()
 #endif
 
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define ERROR_INJECT_LIST(). = ALIGN(8);   
\
+   
VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
+   KEEP(*(_kprobe_error_inject_list))  
\
+   VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
= .;
+#else
+#define ERROR_INJECT_LIST()
+#endif
+
 #ifdef CONFIG_EVENT_TRACING
 #define FTRACE_EVENTS(). = ALIGN(8);   
\
VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
@@ -560,6 +569,7 @@
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
KPROBE_BLACKLIST()  \
+   ERROR_INJECT_LIST() \
MEM_DISCARD(init.rodata)\
CLK_OF_TABLES() \
RESERVEDMEM_OF_TABLES() \
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 520aeebe0d93..552a666a338b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -530,4 +530,15 @@ extern const struct bpf_func_proto 
bpf_sock_map_update_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define BPF_ALLOW_ERROR_INJECTION(fname)   \
+static unsigned long __used\
+   __attribute__((__section__("_kprobe_error_inject_list")))   \
+   _eil_addr_##fname = (unsigned long)fname;
+#else
+#define BPF_ALLOW_ERROR_INJECTION(fname)
+#endif
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index bd2684700b74..4f501cb73aec 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long offset);
 extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
unsigned long offset);
 
 extern bool within_kprobe_blacklist(unsigned long addr);
+extern bool within_kprobe_error_injection_list(unsigned long addr);
 
 struct kprobe_insn_cache {
struct mutex mutex;
diff --git a/include/linux/module.h b/include/linux/module.h
index fe5aa3736707..7bb1a9b9a322 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -475,6 +475,11 @@ struct module {
ctor_fn_t *ctors;
unsigned int num_ctors;
 #endif
+
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+   unsigned int num_kprobe_ei_funcs;
+   unsigned long *kprobe_ei_funcs;
+#endif
 } cacheline_aligned __randomize_layout;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1606a4224e1..bdd7dd724f6f 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned 
long hash)
return &(kretprobe_table_locks[hash].lock);
 }
 
+/* List of 

[PATCH v2 04/11] lib: add a __fprop_add_percpu_max

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

This helper allows us to add an arbitrary amount to the fprop
structures.

Signed-off-by: Josef Bacik 
---
 include/linux/flex_proportions.h | 11 +--
 lib/flex_proportions.c   |  9 +
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h
index 0d348e011a6e..9f88684bf0a0 100644
--- a/include/linux/flex_proportions.h
+++ b/include/linux/flex_proportions.h
@@ -83,8 +83,8 @@ struct fprop_local_percpu {
 int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp);
 void fprop_local_destroy_percpu(struct fprop_local_percpu *pl);
 void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl);
-void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu 
*pl,
-   int max_frac);
+void __fprop_add_percpu_max(struct fprop_global *p, struct fprop_local_percpu 
*pl,
+   unsigned long nr, int max_frac);
 void fprop_fraction_percpu(struct fprop_global *p,
struct fprop_local_percpu *pl, unsigned long *numerator,
unsigned long *denominator);
@@ -99,4 +99,11 @@ void fprop_inc_percpu(struct fprop_global *p, struct 
fprop_local_percpu *pl)
local_irq_restore(flags);
 }
 
+static inline
+void __fprop_inc_percpu_max(struct fprop_global *p,
+   struct fprop_local_percpu *pl, int max_frac)
+{
+   __fprop_add_percpu_max(p, pl, 1, max_frac);
+}
+
 #endif
diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
index b0343ae71f5e..fd95791a2c93 100644
--- a/lib/flex_proportions.c
+++ b/lib/flex_proportions.c
@@ -255,8 +255,9 @@ void fprop_fraction_percpu(struct fprop_global *p,
  * Like __fprop_inc_percpu() except that event is counted only if the given
  * type has fraction smaller than @max_frac/FPROP_FRAC_BASE
  */
-void __fprop_inc_percpu_max(struct fprop_global *p,
-   struct fprop_local_percpu *pl, int max_frac)
+void __fprop_add_percpu_max(struct fprop_global *p,
+   struct fprop_local_percpu *pl, unsigned long nr,
+   int max_frac)
 {
if (unlikely(max_frac < FPROP_FRAC_BASE)) {
unsigned long numerator, denominator;
@@ -267,6 +268,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
return;
} else
fprop_reflect_period_percpu(p, pl);
-   percpu_counter_add_batch(>events, 1, PROP_BATCH);
-   percpu_counter_add(>events, 1);
+   percpu_counter_add_batch(>events, nr, PROP_BATCH);
+   percpu_counter_add(>events, nr);
 }
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/11] writeback: add counters for metadata usage

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

Btrfs has no bounds except memory on the amount of dirty memory that we have in
use for metadata.  Historically we have used a special inode so we could take
advantage of the balance_dirty_pages throttling that comes with using pagecache.
However as we'd like to support different blocksizes it would be nice to not
have to rely on pagecache, but still get the balance_dirty_pages throttling
without having to do it ourselves.

So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
zone and bdi_writeback counters to keep track of how many bytes we have in
flight for METADATA.  We need to count in bytes as blocksizes could be
percentages of pagesize.  We simply convert the bytes to number of pages where
it is needed for the throttling.

Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
pages used for metadata on the system.  This is also needed so things like dirty
throttling know that this is dirtyable memory as well and easily reclaimed.

Signed-off-by: Josef Bacik 
---
 drivers/base/node.c  |   8 +++
 fs/fs-writeback.c|   2 +
 fs/proc/meminfo.c|   8 +++
 include/linux/backing-dev-defs.h |   2 +
 include/linux/mm.h   |   9 +++
 include/linux/mmzone.h   |   3 +
 include/trace/events/writeback.h |  13 +++-
 mm/backing-dev.c |   4 ++
 mm/page-writeback.c  | 142 +++
 mm/page_alloc.c  |  20 --
 mm/util.c|   1 +
 mm/vmscan.c  |  19 +-
 mm/vmstat.c  |   3 +
 13 files changed, 212 insertions(+), 22 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 3855902f2c5b..a39cecc8957a 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
+
 static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -99,7 +101,10 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
n += sprintf(buf + n,
   "Node %d Dirty:  %8lu kB\n"
+  "Node %d MetadataDirty:  %8lu kB\n"
   "Node %d Writeback:  %8lu kB\n"
+  "Node %d MetaWriteback:  %8lu kB\n"
+  "Node %d Metadata:   %8lu kB\n"
   "Node %d FilePages:  %8lu kB\n"
   "Node %d Mapped: %8lu kB\n"
   "Node %d AnonPages:  %8lu kB\n"
@@ -119,8 +124,11 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
,
   nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_DIRTY_BYTES)),
   nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_WRITEBACK_BYTES)),
   nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
+  nid, BtoK(node_page_state(pgdat, NR_METADATA_BYTES)),
   nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
   nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
   nid, K(i.sharedram),
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 245c430a2e41..987448ed7698 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1814,6 +1814,7 @@ static struct wb_writeback_work 
*get_next_work_item(struct bdi_writeback *wb)
return work;
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 /*
  * Add in the number of potentially dirty inodes, because each inode
  * write can dirty pagecache in the underlying blockdev.
@@ -1822,6 +1823,7 @@ static unsigned long get_nr_dirty_pages(void)
 {
return global_node_page_state(NR_FILE_DIRTY) +
global_node_page_state(NR_UNSTABLE_NFS) +
+   BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) +
get_nr_dirty_inodes();
 }
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index cdd979724c74..fa1fd24a4d99 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -42,6 +42,8 @@ static void show_val_kb(struct seq_file *m, const char *s, 
unsigned long num)
seq_write(m, " kB\n", 4);
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
+
 static int meminfo_proc_show(struct seq_file *m, void *v)
 {
struct sysinfo i;
@@ -71,6 +73,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "Buffers:", i.bufferram);
show_val_kb(m, "Cached: ", cached);
show_val_kb(m, "SwapCached: ", total_swapcache_pages());
+   show_val_kb(m, "Metadata:   ",

[PATCH v2 09/11] Btrfs: kill the btree_inode

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

In order to more efficiently support sub-page blocksizes we need to stop
allocating pages from pagecache for our metadata.  Instead switch to using the
account_metadata* counters for making sure we are keeping the system aware of
how much dirty metadata we have, and use the ->free_cached_objects super
operation in order to handle freeing up extent buffers.  This greatly simplifies
how we deal with extent buffers as now we no longer have to tie the page cache
reclaimation stuff to the extent buffer stuff.  This will also allow us to
simply kmalloc() our data for sub-page blocksizes.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/btrfs_inode.h |   1 -
 fs/btrfs/ctree.c   |  18 +-
 fs/btrfs/ctree.h   |  17 +-
 fs/btrfs/dir-item.c|   2 +-
 fs/btrfs/disk-io.c | 386 --
 fs/btrfs/extent-tree.c |  14 +-
 fs/btrfs/extent_io.c   | 906 ++---
 fs/btrfs/extent_io.h   |  51 +-
 fs/btrfs/inode.c   |   6 +-
 fs/btrfs/print-tree.c  |  13 +-
 fs/btrfs/reada.c   |   2 +-
 fs/btrfs/root-tree.c   |   2 +-
 fs/btrfs/super.c   |  31 +-
 fs/btrfs/tests/btrfs-tests.c   |  36 +-
 fs/btrfs/tests/extent-buffer-tests.c   |   3 +-
 fs/btrfs/tests/extent-io-tests.c   |   4 +-
 fs/btrfs/tests/free-space-tree-tests.c |   3 +-
 fs/btrfs/tests/inode-tests.c   |   4 +-
 fs/btrfs/tests/qgroup-tests.c  |   3 +-
 fs/btrfs/transaction.c |  13 +-
 20 files changed, 744 insertions(+), 771 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index f9c6887a8b6c..24582650622d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -241,7 +241,6 @@ static inline u64 btrfs_ino(const struct btrfs_inode *inode)
u64 ino = inode->location.objectid;
 
/*
-* !ino: btree_inode
 * type == BTRFS_ROOT_ITEM_KEY: subvol dir
 */
if (!ino || inode->location.type == BTRFS_ROOT_ITEM_KEY)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 531e0a8645b0..3c6610b5d0d3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1361,7 +1361,8 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,
 
if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
BUG_ON(tm->slot != 0);
-   eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start);
+   eb_rewin = alloc_dummy_extent_buffer(fs_info->eb_info,
+eb->start, eb->len);
if (!eb_rewin) {
btrfs_tree_read_unlock_blocking(eb);
free_extent_buffer(eb);
@@ -1444,7 +1445,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
} else if (old_root) {
btrfs_tree_read_unlock(eb_root);
free_extent_buffer(eb_root);
-   eb = alloc_dummy_extent_buffer(fs_info, logical);
+   eb = alloc_dummy_extent_buffer(root->fs_info->eb_info, logical,
+  root->fs_info->nodesize);
} else {
btrfs_set_lock_blocking_rw(eb_root, BTRFS_READ_LOCK);
eb = btrfs_clone_extent_buffer(eb_root);
@@ -1675,7 +1677,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
continue;
}
 
-   cur = find_extent_buffer(fs_info, blocknr);
+   cur = find_extent_buffer(fs_info->eb_info, blocknr);
if (cur)
uptodate = btrfs_buffer_uptodate(cur, gen, 0);
else
@@ -1748,7 +1750,7 @@ static noinline int generic_bin_search(struct 
extent_buffer *eb,
int err;
 
if (low > high) {
-   btrfs_err(eb->fs_info,
+   btrfs_err(eb->eb_info->fs_info,
 "%s: low (%d) > high (%d) eb %llu owner %llu level %d",
  __func__, low, high, eb->start,
  btrfs_header_owner(eb), btrfs_header_level(eb));
@@ -2260,7 +2262,7 @@ static void reada_for_search(struct btrfs_fs_info 
*fs_info,
 
search = btrfs_node_blockptr(node, slot);
blocksize = fs_info->nodesize;
-   eb = find_extent_buffer(fs_info, search);
+   eb = find_extent_buffer(fs_info->eb_info, search);
if (eb) {
free_extent_buffer(eb);
return;
@@ -2319,7 +2321,7 @@ static noinline void reada_for_balance(struct 
btrfs_fs_info *fs_info,
if (slot > 0) {
block1 = btrfs_node_blockptr(parent, slot - 1);
gen = btrfs_node_ptr_generation(parent, slot - 1);
-   eb = find_extent_buffer(fs_info, block1);
+   eb = find_extent_buffer(fs_info->eb_info, 

[PATCH v2 10/11] btrfs: rework end io for extent buffer reads

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

Now that the only thing that keeps eb's alive is io_pages and it's
refcount we need to hold the eb ref for the entire end io call so we
don't get it removed out from underneath us.  Also the hooks make no
sense for us now, so rework this to be cleaner.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c   | 63 -
 fs/btrfs/disk-io.h   |  1 +
 fs/btrfs/extent_io.c | 66 +++-
 3 files changed, 40 insertions(+), 90 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d9d69e181942..1a890f8c78c8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -755,33 +755,13 @@ static int check_node(struct btrfs_root *root, struct 
extent_buffer *node)
return ret;
 }
 
-static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
- u64 phy_offset, struct page *page,
- u64 start, u64 end, int mirror)
+int btrfs_extent_buffer_end_read(struct extent_buffer *eb, int mirror)
 {
+   struct btrfs_fs_info *fs_info = eb->eb_info->fs_info;
+   struct btrfs_root *root = fs_info->tree_root;
u64 found_start;
int found_level;
-   struct extent_buffer *eb;
-   struct btrfs_root *root;
-   struct btrfs_fs_info *fs_info;
int ret = 0;
-   int reads_done;
-
-   if (!page->private)
-   goto out;
-
-   eb = (struct extent_buffer *)page->private;
-
-   /* the pending IO might have been the only thing that kept this buffer
-* in memory.  Make sure we have a ref for all this other checks
-*/
-   extent_buffer_get(eb);
-   fs_info = eb->eb_info->fs_info;
-   root = fs_info->tree_root;
-
-   reads_done = atomic_dec_and_test(>io_pages);
-   if (!reads_done)
-   goto err;
 
eb->read_mirror = mirror;
if (test_bit(EXTENT_BUFFER_READ_ERR, >bflags)) {
@@ -833,45 +813,14 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
if (!ret)
set_extent_buffer_uptodate(eb);
 err:
-   if (reads_done &&
-   test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
+   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
btree_readahead_hook(eb, ret);
 
-   if (ret) {
-   /*
-* our io error hook is going to dec the io pages
-* again, we have to make sure it has something
-* to decrement.
-*
-* TODO: Kill this, we've re-arranged how this works now so we
-* don't need to do this io_pages dance.
-*/
-   atomic_inc(>io_pages);
+   if (ret)
clear_extent_buffer_uptodate(eb);
-   }
-   if (reads_done) {
-   clear_bit(EXTENT_BUFFER_READING, >bflags);
-   smp_mb__after_atomic();
-   wake_up_bit(>bflags, EXTENT_BUFFER_READING);
-   }
-   free_extent_buffer(eb);
-out:
return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-   struct extent_buffer *eb;
-
-   eb = (struct extent_buffer *)page->private;
-   set_bit(EXTENT_BUFFER_READ_ERR, >bflags);
-   eb->read_mirror = failed_mirror;
-   atomic_dec(>io_pages);
-   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, >bflags))
-   btree_readahead_hook(eb, -EIO);
-   return -EIO;/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio)
 {
struct btrfs_end_io_wq *end_io_wq = bio->bi_private;
@@ -4554,9 +4503,7 @@ static int btree_merge_bio_hook(struct page *page, 
unsigned long offset,
 static const struct extent_io_ops btree_extent_io_ops = {
/* mandatory callbacks */
.submit_bio_hook = btree_submit_bio_hook,
-   .readpage_end_io_hook = btree_readpage_end_io_hook,
.merge_bio_hook = btree_merge_bio_hook,
-   .readpage_io_failed_hook = btree_io_failed_hook,
.set_range_writeback = btrfs_set_range_writeback,
.tree_fs_info = btree_fs_info,
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 7f7c35d6347a..e1f4fef91547 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -152,6 +152,7 @@ int btree_lock_page_hook(struct page *page, void *data,
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
 int __init btrfs_end_io_wq_init(void);
 void btrfs_end_io_wq_exit(void);
+int btrfs_extent_buffer_end_read(struct extent_buffer *eb, int mirror);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 void btrfs_init_lockdep(void);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bb10dc6f4e41..e11372455fb0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "disk-io.h"
 
 static struct kmem_cache 

[PATCH v2 08/11] export radix_tree_iter_tag_set

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

We use this in btrfs for metadata writeback.

Acked-by: Matthew Wilcox 
Signed-off-by: Josef Bacik 
---
 lib/radix-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8b1feca1230a..0c1cde9fcb69 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1459,6 +1459,7 @@ void radix_tree_iter_tag_set(struct radix_tree_root *root,
 {
node_tag_set(root, iter->node, tag, iter_offset(iter));
 }
+EXPORT_SYMBOL(radix_tree_iter_tag_set);
 
 static void node_tag_clear(struct radix_tree_root *root,
struct radix_tree_node *node,
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/11] remove mapping from balance_dirty_pages*()

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
---
 drivers/mtd/devices/block2mtd.c | 12 
 fs/btrfs/disk-io.c  |  3 ++-
 fs/btrfs/file.c |  3 ++-
 fs/btrfs/ioctl.c|  3 ++-
 fs/btrfs/relocation.c   |  3 ++-
 fs/buffer.c |  3 ++-
 fs/iomap.c  |  6 --
 fs/ntfs/attrib.c| 11 ---
 fs/ntfs/file.c  |  4 ++--
 include/linux/backing-dev.h | 29 +++--
 include/linux/writeback.h   |  4 +++-
 mm/filemap.c|  4 +++-
 mm/memory.c |  5 -
 mm/page-writeback.c | 15 +++
 14 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f111a7d..7892d0b9fcb0 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, 
int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
struct page *page;
int index = to >> PAGE_SHIFT;   // page index
int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t 
to, size_t len)
memset(page_address(page), 0xff, PAGE_SIZE);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   
balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
break;
}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
loff_t to, size_t len, size_t *retlen)
 {
struct page *page;
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
int index = to >> PAGE_SHIFT;   // page index
int offset = to & ~PAGE_MASK;   // page offset
int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
memcpy(page_address(page) + offset, buf, cpylen);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
}
put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 689b9913ccb5..8b6df7688d52 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4150,7 +4150,8 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,
ret = percpu_counter_compare(_info->dirty_metadata_bytes,
 BTRFS_DIRTY_METADATA_THRESH);
if (ret > 0) {
-   
balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
+   balance_dirty_pages_ratelimited(fs_info->sb->s_bdi,
+   fs_info->sb);
}
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ab1c38f2dd8c..4bc6cd6509be 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1779,7 +1779,8 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
cond_resched();
 
-   balance_dirty_pages_ratelimited(inode->i_mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
if (dirty_pages < (fs_info->nodesize >> PAGE_SHIFT) + 1)
btrfs_btree_balance_dirty(fs_info);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6a07d4e12fd2..ec92fb5e2b51 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1368,7 +1368,8 @@ int 

[PATCH v2 03/11] lib: make the fprop batch size a multiple of PAGE_SIZE

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

We are converting the writeback counters to use bytes instead of pages,
so we need to make the batch size for the percpu modifications align
properly with the new units.  Since we used pages before, just multiply
by PAGE_SIZE to get the equivalent bytes for the batch size.

Signed-off-by: Josef Bacik 
---
 lib/flex_proportions.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
index 2cc1f94e03a1..b0343ae71f5e 100644
--- a/lib/flex_proportions.c
+++ b/lib/flex_proportions.c
@@ -166,7 +166,7 @@ void fprop_fraction_single(struct fprop_global *p,
 /*
  *  PERCPU 
  */
-#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
+#define PROP_BATCH (8*PAGE_SIZE*(1+ilog2(nr_cpu_ids)))
 
 int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp)
 {
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/11] Metadata specific accouting and dirty writeout

2017-11-22 Thread Josef Bacik
These patches are to support having metadata accounting and dirty handling
in a generic way.  For dirty metadata ext4 and xfs currently are limited by
their journal size, which allows them to handle dirty metadata flushing in a
relatively easy way.  Btrfs does not have this limiting factor, we can have as
much dirty metadata on the system as we have memory, so we have a dummy inode
that all of our metadat pages are allocated from so we can call
balance_dirty_pages() on it and make sure we don't overwhelm the system with
dirty metadata pages.

The problem with this is it severely limits our ability to do things like
support sub-pagesize blocksizes.  Btrfs also supports metadata blocksizes > page
size, which makes keeping track of our metadata and it's pages particularly
tricky.  We have the inode mapping with our pages, and we have another radix
tree for our actual metadata buffers.  This double accounting leads to some fun
shenanigans around reclaim and evicting pages we know we are done using.

To solve this we would like to switch to a scheme like xfs has, where we simply
have our metadata structures tied into the slab shrinking code, and we just use
alloc_page() for our pages, or kmalloc() when we add sub-pagesize blocksizes.
In order to do this we need infrastructure in place to make sure we still don't
overwhelm the system with dirty metadata pages.

Enter these patches.  Because metadata is tracked on a non-pagesize amount we
need to convert a bunch of our existing counters to bytes.  From there I've
added various counters for metadata, to keep track of overall metadata bytes,
how many are dirty and how many are under writeback.  I've added a super
operation to handle the dirty writeback, which is going to be handled mostly
inside the fs since we will need a little more smarts around what we writeback.

The last three patches are just there to show how we use the infrastructure in
the first 8 patches.  The actuall kill btree_inode patch is pretty big,
unfortunately ripping out all of the pagecache based handling and replacing it
with the new infrastructure has to be done whole-hog and can't be broken up
anymore than it already has been without making it un-bisectable.

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/11] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

These are counters that constantly go up in order to do bandwidth calculations.
It isn't important what the units are in, as long as they are consistent between
the two of them, so convert them to count bytes written/dirtied, and allow the
metadata accounting stuff to change the counters as well.

Signed-off-by: Josef Bacik 
Acked-by: Tejun Heo 
---
 fs/fuse/file.c   |  4 ++--
 include/linux/backing-dev-defs.h |  4 ++--
 include/linux/backing-dev.h  |  2 +-
 mm/backing-dev.c |  9 +
 mm/page-writeback.c  | 20 ++--
 5 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cb7dff5c45d7..67e7c4fac28d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1471,7 +1471,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0; i < req->num_pages; i++) {
dec_wb_stat(>wb, WB_WRITEBACK);
dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
-   wb_writeout_inc(>wb);
+   wb_writeout_add(>wb, PAGE_SIZE);
}
wake_up(>page_waitq);
 }
@@ -1776,7 +1776,7 @@ static bool fuse_writepage_in_flight(struct fuse_req 
*new_req,
 
dec_wb_stat(>wb, WB_WRITEBACK);
dec_node_page_state(page, NR_WRITEBACK_TEMP);
-   wb_writeout_inc(>wb);
+   wb_writeout_add(>wb, PAGE_SIZE);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..ded45ac2cec7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -36,8 +36,8 @@ typedef int (congested_fn)(void *, int);
 enum wb_stat_item {
WB_RECLAIMABLE,
WB_WRITEBACK,
-   WB_DIRTIED,
-   WB_WRITTEN,
+   WB_DIRTIED_BYTES,
+   WB_WRITTEN_BYTES,
NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 14e266d12620..39b8dc486ea7 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -89,7 +89,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum 
wb_stat_item item)
return percpu_counter_sum_positive(>stat[item]);
 }
 
-extern void wb_writeout_inc(struct bdi_writeback *wb);
+extern void wb_writeout_add(struct bdi_writeback *wb, long bytes);
 
 /*
  * maximal error of a stat counter.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e19606bb41a0..62a332a91b38 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -68,14 +68,15 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
wb_thresh = wb_calc_thresh(wb, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
seq_printf(m,
   "BdiWriteback:   %10lu kB\n"
   "BdiReclaimable: %10lu kB\n"
   "BdiDirtyThresh: %10lu kB\n"
   "DirtyThresh:%10lu kB\n"
   "BackgroundThresh:   %10lu kB\n"
-  "BdiDirtied: %10lu kB\n"
-  "BdiWritten: %10lu kB\n"
+  "BdiDirtiedBytes:%10lu kB\n"
+  "BdiWrittenBytes:%10lu kB\n"
   "BdiWriteBandwidth:  %10lu kBps\n"
   "b_dirty:%10lu\n"
   "b_io:   %10lu\n"
@@ -88,8 +89,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
   K(wb_thresh),
   K(dirty_thresh),
   K(background_thresh),
-  (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
-  (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
+  (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)),
+  (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)),
   (unsigned long) K(wb->write_bandwidth),
   nr_dirty,
   nr_io,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1a47d4296750..e4563645749a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -597,11 +597,11 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
  * Increment @wb's writeout completion count and the global writeout
  * completion count. Called from test_clear_page_writeback().
  */
-static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+static inline void __wb_writeout_add(struct bdi_writeback *wb, long bytes)
 {
struct wb_domain *cgdom;
 
-   inc_wb_stat(wb, WB_WRITTEN);
+   __add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
wb_domain_writeout_inc(_wb_domain, >completions,
   wb->bdi->max_prop_frac);
 
@@ -611,15 +611,15 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
  

[PATCH v2 05/11] writeback: convert the flexible prop stuff to bytes

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

The flexible proportions were all page based, but now that we are doing
metadata writeout that can be smaller or larger than page size we need
to account for this in bytes instead of number of pages.

Signed-off-by: Josef Bacik 
---
 mm/page-writeback.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e4563645749a..2a1994194cc1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -574,11 +574,11 @@ static unsigned long wp_next_time(unsigned long cur_time)
return cur_time;
 }
 
-static void wb_domain_writeout_inc(struct wb_domain *dom,
+static void wb_domain_writeout_add(struct wb_domain *dom,
   struct fprop_local_percpu *completions,
-  unsigned int max_prop_frac)
+  long bytes, unsigned int max_prop_frac)
 {
-   __fprop_inc_percpu_max(>completions, completions,
+   __fprop_add_percpu_max(>completions, completions, bytes,
   max_prop_frac);
/* First event after period switching was turned off? */
if (unlikely(!dom->period_time)) {
@@ -602,12 +602,12 @@ static inline void __wb_writeout_add(struct bdi_writeback 
*wb, long bytes)
struct wb_domain *cgdom;
 
__add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
-   wb_domain_writeout_inc(_wb_domain, >completions,
+   wb_domain_writeout_add(_wb_domain, >completions, bytes,
   wb->bdi->max_prop_frac);
 
cgdom = mem_cgroup_wb_domain(wb);
if (cgdom)
-   wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb),
+   wb_domain_writeout_add(cgdom, wb_memcg_completions(wb), bytes,
   wb->bdi->max_prop_frac);
 }
 
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/11] writeback: introduce super_operations->write_metadata

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
Reviewed-by: Tejun Heo 
---
 fs/fs-writeback.c| 72 
 fs/super.c   |  6 
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h   |  4 +++
 mm/backing-dev.c |  2 ++
 5 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 987448ed7698..fba703dff678 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback 
*wb,
return pages;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
+{
+   struct writeback_control wbc = {
+   .sync_mode  = work->sync_mode,
+   .tagged_writepages  = work->tagged_writepages,
+   .for_kupdate= work->for_kupdate,
+   .for_background = work->for_background,
+   .for_sync   = work->for_sync,
+   .range_cyclic   = work->range_cyclic,
+   .range_start= 0,
+   .range_end  = LLONG_MAX,
+   };
+   long write_chunk;
+
+   write_chunk = writeback_chunk_size(wb, work);
+   wbc.nr_to_write = write_chunk;
+   sb->s_op->write_metadata(sb, );
+   work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+   return write_chunk - wbc.nr_to_write;
+}
+
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -1505,6 +1530,7 @@ static long writeback_sb_inodes(struct super_block *sb,
unsigned long start_time = jiffies;
long write_chunk;
long wrote = 0;  /* count both pages and inodes */
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1621,12 +1647,18 @@ static long writeback_sb_inodes(struct super_block *sb,
 * background threshold and other termination conditions.
 */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+   if (!done && sb->s_op->write_metadata) {
+   spin_unlock(>list_lock);
+   wrote += writeback_sb_metadata(sb, wb, work);
+   spin_lock(>list_lock);
+   }
return wrote;
 }
 
@@ -1635,6 +1667,7 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 {
unsigned long start_time = jiffies;
long wrote = 0;
+   bool done = false;
 
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1654,12 +1687,39 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 
/* refer to the same tests at the end of writeback_sb_inodes */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+
+   if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
+   LIST_HEAD(list);
+
+   spin_unlock(>list_lock);
+   spin_lock(>bdi->sb_list_lock);
+   list_splice_init(>bdi->dirty_sb_list, );
+   while (!list_empty()) {
+   struct super_block *sb;
+
+   sb = list_first_entry(, struct super_block,
+ s_bdi_dirty_list);
+   list_move_tail(>s_bdi_dirty_list,
+  >bdi->dirty_sb_list);
+   if 

[PATCH v2 11/11] btrfs: add NR_METADATA_BYTES accounting

2017-11-22 Thread Josef Bacik
From: Josef Bacik 

Now that we have these counters, account for the private pages we
allocate in NR_METADATA_BYTES.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent_io.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e11372455fb0..7536352f424d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4802,6 +4802,8 @@ static void btrfs_release_extent_buffer_page(struct 
extent_buffer *eb)
ClearPagePrivate(page);
set_page_private(page, 0);
 
+   mod_node_page_state(page_pgdat(page), NR_METADATA_BYTES,
+   -(long)PAGE_SIZE);
/* Once for the page private. */
put_page(page);
 
@@ -5081,6 +5083,8 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
goto free_eb;
}
attach_extent_buffer_page(eb, p);
+   mod_node_page_state(page_pgdat(p), NR_METADATA_BYTES,
+   PAGE_SIZE);
eb->pages[i] = p;
}
 again:
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


notification about corrupt files from "btrfs scrub" in cron

2017-11-22 Thread ST
Hello,

I have following cron job to scrub entire root filesystem (total ca.
7.2TB and 2.3TB of them used) once a week:
/bin/btrfs scrub start -r / > /dev/null

Such scrubbing takes ca. 2 hours. How should I get notified that a
corrupt file was discovered? Does this command return some error code
back to cron so it can send an email as usual? Will cron wait 2 hours to
get that code?

I tried that command once without "> /dev/null" but got no email
notification about the results (eventhough the check was OK) - why?

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: switch uuid tree semaphore to mutex

2017-11-22 Thread David Sterba
On Mon, Nov 06, 2017 at 07:24:02PM +0100, David Sterba wrote:
> The uuid_tree_rescan_sem is used as a mutex (initialized with value 1
> and with at most one active user), no reason to obscure that as a
> semaphore.
> 
> Signed-off-by: David Sterba 

[  136.287143] [ cut here ]
[  136.291962] DEBUG_LOCKS_WARN_ON(depth <= 0)
[  136.292033] WARNING: CPU: 0 PID: 1762 at kernel/locking/lockdep.c:3766 
lock_release+0x253/0x3a0
[  136.305245] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 
dns_resolver nfs lockd grace sunrpc fscache af_packet br_netfilter bridge stp 
llc iscsi_ibft iscsi_boot_sysfs btrfs xor zstd_decompress zstd_c
ompress xxhash zlib_deflate i2c_algo_bit raid6_pq drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops dm_mod ttm dax drm tg3 kvm_amd ptp pps_core 
tpm_infineon libphy kvm mptctl shpchp tpm_tis tpm_tis_c
ore i2c_piix4 tpm acpi_cpufreq irqbypass k10temp pcspkr button ext4 mbcache 
jbd2 sr_mod cdrom ohci_pci ehci_pci ohci_hcd ehci_hcd mptsas scsi_transport_sas 
mptscsih serio_raw ata_generic mptbase sata_svw usbcor
e pata_serverworks sg scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[  136.366011] CPU: 0 PID: 1762 Comm: btrfs-uuid Not tainted 
4.14.0-1.ge195904-vanilla+ #91
[  136.374300] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
[  136.380954] task: 895de4e45140 task.stack: abc901ea8000
[  136.387005] RIP: 0010:lock_release+0x253/0x3a0
[  136.391574] RSP: 0018:abc901eabc80 EFLAGS: 00010082
[  136.396932] RAX: 001f RBX: 895de4e45140 RCX: 895de4e45140
[  136.404189] RDX: 895de4e45140 RSI: 0001 RDI: 8b0edca0
[  136.411473] RBP: 895dd4723470 R08:  R09: 
[  136.418725] R10:  R11: 0001 R12: 0282
[  136.425987] R13: c103b852 R14:  R15: 
[  136.433259] FS:  () GS:895defc0() 
knlGS:
[  136.441543] CS:  0010 DS:  ES:  CR0: 80050033
[  136.447408] CR2: 7f0f708bf000 CR3: 00022290f000 CR4: 06f0
[  136.454682] Call Trace:
[  136.457320]  ? btrfs_search_forward+0x168/0x320 [btrfs]
[  136.462682]  __mutex_unlock_slowpath+0x3b/0x260
[  136.467368]  btrfs_uuid_scan_kthread+0x242/0x350 [btrfs]
[  136.472849]  kthread+0x13e/0x170
[  136.476241]  ? get_chunk_map+0xc0/0xc0 [btrfs]
[  136.480821]  ? kthread_stop+0x300/0x300
[  136.484786]  ret_from_fork+0x24/0x30
[  136.507678] ---[ end trace cd10910755287819 ]---

this apparently relies on the semantics of a semaphore where up and down do not
pair the same way as is required for a mutex. I'm going to drop this
patch and eventually replace it with a clearer waiting logic.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[josef-btrfs:kill-btree-inode 33/33] fs/btrfs/extent_io.c:4806:9: warning: overflow in implicit constant conversion

2017-11-22 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
kill-btree-inode
head:   6dbc29989846e20ef73ae6b9abedad570706c019
commit: 6dbc29989846e20ef73ae6b9abedad570706c019 [33/33] btrfs: add 
NR_METADATA_BYTES accounting
config: x86_64-randconfig-x000-201747 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
git checkout 6dbc29989846e20ef73ae6b9abedad570706c019
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   fs/btrfs/extent_io.c: In function 'btrfs_release_extent_buffer_page':
   fs/btrfs/extent_io.c:4805:34: error: 'p' undeclared (first use in this 
function)
  mod_node_page_state(page_pgdat(p), NR_METADATA_BYTES,
 ^
   fs/btrfs/extent_io.c:4805:34: note: each undeclared identifier is reported 
only once for each function it appears in
>> fs/btrfs/extent_io.c:4806:9: warning: overflow in implicit constant 
>> conversion [-Woverflow]
-PAGE_SIZE);
^
   In file included from arch/x86/include/asm/bitops.h:15:0,
from include/linux/bitops.h:37,
from fs/btrfs/extent_io.c:1:
   fs/btrfs/extent_io.c: At top level:
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 'strcpy' which is not static
   __f = { \
   ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^~
   include/linux/string.h:421:2: note: in expansion of macro 'if'
 if (p_size == (size_t)-1 && q_size == (size_t)-1)
 ^~
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 'kmemdup' which is not static
   __f = { \
   ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^~
   include/linux/string.h:411:2: note: in expansion of macro 'if'
 if (p_size < size)
 ^~
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 'kmemdup' which is not static
   __f = { \
   ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^~
   include/linux/string.h:409:2: note: in expansion of macro 'if'
 if (__builtin_constant_p(size) && p_size < size)
 ^~
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 'memchr_inv' which is not static
   __f = { \
   ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^~
   include/linux/string.h:400:2: note: in expansion of macro 'if'
 if (p_size < size)
 ^~
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 'memchr_inv' which is not static
   __f = { \
   ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^~
   include/linux/string.h:398:2: note: in expansion of macro 'if'
 if (__builtin_constant_p(size) && p_size < size)
 ^~
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 'memchr' which is not static
   __f = { \
   ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^~
   include/linux/string.h:389:2: note: in expansion of macro 'if'
 if (p_size < size)
 ^~
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 'memchr' which is not static
   __f = { \
   ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^~
   include/linux/string.h:387:2: note: in expansion of macro 'if'
 if (__builtin_constant_p(size) && p_size < size)
 ^~
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 'memcmp' which is not static
   __f = { \
   ^
   include/linux/compiler.h:154:23: note: in expansion of macro '__trace_if'
#define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
  ^~
   include/linux/string.h:379:2: note: in expansion of macro 'if'
 if (p_size < size || q_size < size)
 ^~
   include/linux/compiler.h:162:4: warning: '__f' is static but declared in 
inline function 

btrfs restore corrupt file

2017-11-22 Thread Jorge Bastos
Hello,

While doing btrfs checksum testing I purposely corrupted a file and
got the expect I/O error when trying to copy it, I also tested btrfs
restore to see if I could recover a known corrupt file and it did copy
it but there was no checksum error or warning. I used btrfs restore -v

Is this expect behavior or should restore warn about checksum failures?

Kernel used was 4.13.13,  btrfs-progs v4.13.2

Thanks,
Jorge
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs restore corrupt file

2017-11-22 Thread Jorge Bastos
Hello,

While doing btrfs checksum testing I purposely corrupted a file and
got the expect I/O error when trying to copy it, I also tested btrfs
restore to see if I could recover a known corrupt file and it did copy
it and there was no checksum error or warning, is this expect behavior
or should restore warn about checksum failures?

Kernel used was 4.13.13,  btrfs-progs v4.13.2

Thanks,
Jorge
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: uncorrectable errors in Raid 10

2017-11-22 Thread Chris Murphy
On Wed, Nov 22, 2017 at 9:38 AM, Steffen Sindzinski  wrote:
> Hello,
>
> I did btrfs check --readonly on both disk without finding any error. To
> reconfirm I did a scrub again which still has found 2 uncorrectable errors.

Try --mode=lowmem option with btrfs-progs 4.3.3 or better 4.14. This
is a new implementation of btrfs check and sometimes it comes up with
different results. It's strange that there's only this error found by
scrub and not by btrfs check which should be fully checking all
metadata for sanity, and in the process it would surely hit a bad
checksum.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mail notifications to root once quota limit reached

2017-11-22 Thread Chris Murphy
On Wed, Nov 22, 2017 at 8:36 AM, ST  wrote:
> Hello,
>
> is it possible to get mail notifications to root once quota limit is
> reached? If not should I file a feature request?

You should look at Nagios or OpenNMS, etc. That's where such
functionality belongs.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mail notifications to root once quota limit reached

2017-11-22 Thread ST
Hello,

is it possible to get mail notifications to root once quota limit is
reached? If not should I file a feature request?

Thank you!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] Btrfs: retry write for non-raid56

2017-11-22 Thread Nikolay Borisov


On 22.11.2017 02:35, Liu Bo wrote:
> If the underlying protocal doesn't support retry and there are some
> transient errors happening somewhere in our IO stack, we'd like to
> give an extra chance for IO.
> 
> In btrfs, read retry is handled in bio_readpage_error() with the retry
> unit being page size , for write retry however, we're going to do it
> in a different way, as a write may consist of several writes onto
> different stripes, retry write needs to be done right after the IO on
> each stripe completes and reaches to endio.
> 
> As write errors are really corner cases, performance impact is not
> considered.
> 
> This adds code to retry write on errors _sector by sector_ in order to
> find out which sector is really bad so that we can mark it somewhere.
> And since endio is running in interrupt context, the real retry work
> is scheduled in system_unbound_wq in order to get a normal context.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/volumes.c | 162 
> +
>  fs/btrfs/volumes.h |   3 +
>  2 files changed, 142 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 853f9ce..c11db0b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio)
>   }
>  }
>  
> -static void btrfs_end_bio(struct bio *bio)
> +static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
>  {
>   struct btrfs_bio *bbio = bio->bi_private;
> - int is_orig_bio = 0;
> + unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
> +
> + BUG_ON(stripe_index >= bbio->num_stripes);
> + return bbio->stripes[stripe_index].dev;
> +}
> +
> +/*
> + * return 1 if every sector retry returns successful.
> + * return 0 if one or more sector retries fails.
> + */
> +int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
> +{
> + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> + u64 sectors_to_write;
> + u64 offset;
> + u64 orig;
> + u64 unit;
> + u64 block_sectors;
> + int ok = 1;
> + struct bio *wbio;
> +
> + /* offset and unit are bytes aligned, not 512-bytes aligned. */
> + sectors_to_write = io_bio->iter.bi_size >> 9;
> + orig = io_bio->iter.bi_sector;
> + offset = 0;
> + block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
> + unit = block_sectors;
> + ASSERT(unit == 1);
> +
> + while (1) {
> + if (!sectors_to_write)
> + break;
> + /*
> +  * LIUBO: I don't think unit > sectors_to_write could
> +  * happen, sectors_to_write should be aligned to PAGE_SIZE
> +  * which is > unit.  Just in case.
> +  */
> + if (unit > sectors_to_write) {
> + WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", 
> unit, sectors_to_write);
> + unit = sectors_to_write;
> + }
> +
> + /* write @unit bytes at @offset */
> + /* this would never fail, check btrfs_bio_clone(). */
> + wbio = btrfs_bio_clone(bio);
> + wbio->bi_opf = REQ_OP_WRITE;
> + wbio->bi_iter = io_bio->iter;
> +
> + bio_trim(wbio, offset, unit);
> + bio_copy_dev(wbio, bio);
> +
> + /* submit in sync way */
> + /*
> +  * LIUBO: There is an issue, if this bio is quite
> +  * large, say 1M or 2M, and sector size is just 512,
> +  * then this may take a while.
> +  *
> +  * May need to schedule the job to workqueue.
> +  */
> + if (submit_bio_wait(wbio) < 0) {
> + ok = 0 && ok;
> + /*
> +  * This is not correct if badblocks is enabled
> +  * as we need to record every bad sector by
> +  * trying sectors one by one.
> +  */
> + break;
> + }
> +
> + bio_put(wbio);
> + offset += unit;
> + sectors_to_write -= unit;
> + unit = block_sectors;
> + }
> + return ok;
> +}
> +
> +void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
> +{
> + if (bio->bi_status == BLK_STS_IOERR ||
> + bio->bi_status == BLK_STS_TARGET) {
> + if (dev->bdev) {
> + if (bio_data_dir(bio) == WRITE)
> + btrfs_dev_stat_inc(dev,
> +BTRFS_DEV_STAT_WRITE_ERRS);
> + else
> + btrfs_dev_stat_inc(dev,
> +BTRFS_DEV_STAT_READ_ERRS);
> + if (bio->bi_opf & REQ_PREFLUSH)
> + btrfs_dev_stat_inc(dev,
> +   

Re: Kernel 4.14 RAID5 multi disk array on bcache not mounting

2017-11-22 Thread Holger Hoffstätte
On 11/21/17 23:22, Lionel Bouton wrote:
> Le 21/11/2017 à 23:04, Andy Leadbetter a écrit :
>> I have a 4 disk array on top of 120GB bcache setup, arranged as follows
> [...]
>> Upgraded today to 4.14.1 from their PPA and the
> 
> 4.14 and 4.14.1 have a nasty bug affecting bcache users. See for example
> :
> https://www.reddit.com/r/linux/comments/7eh2oz/serious_regression_in_linux_414_using_bcache_can/

4.14.2 (just out as rc1) will have the fix.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] writeback: add counters for metadata usage

2017-11-22 Thread Jan Kara
On Tue 14-11-17 16:56:52, Josef Bacik wrote:
> From: Josef Bacik 
> 
> Btrfs has no bounds except memory on the amount of dirty memory that we have 
> in
> use for metadata.  Historically we have used a special inode so we could take
> advantage of the balance_dirty_pages throttling that comes with using 
> pagecache.
> However as we'd like to support different blocksizes it would be nice to not
> have to rely on pagecache, but still get the balance_dirty_pages throttling
> without having to do it ourselves.
> 
> So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
> zone and bdi_writeback counters to keep track of how many bytes we have in
> flight for METADATA.  We need to count in bytes as blocksizes could be
> percentages of pagesize.  We simply convert the bytes to number of pages where
> it is needed for the throttling.
> 
> Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
> pages used for metadata on the system.  This is also needed so things like 
> dirty
> throttling know that this is dirtyable memory as well and easily reclaimed.

NR_METADATA_BYTES never gets set in the patch set. Either remove this or
implement it properly. Also for memory reclaim properties we already have
NR_SLAB_RECLAIMABLE so you should make sure your metadata buffers are not
double accounted.

Another catch is that node and zone counters are kept in longs. So on
32-bit archs you will overflow the counters if number of metadata (or dirty
metadata) ever exceeds 2GB. That should be rare but still possible. Not
sure what the right answer to this is... Account in 512-byte units?

> @@ -1549,12 +1579,17 @@ static inline void wb_dirty_limits(struct 
> dirty_throttle_control *dtc)
>* deltas.
>*/
>   if (dtc->wb_thresh < 2 * wb_stat_error(wb)) {
> - wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
> - dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
> + wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE) +
> + (wb_stat_sum(wb, WB_METADATA_DIRTY_BYTES) >> 
> PAGE_SHIFT);
> + wb_writeback = wb_stat_sum(wb, WB_WRITEBACK) +
> + (wb_stat_sum(wb, WB_METADATA_WRITEBACK_BYTES) >> 
> PAGE_SHIFT);
>   } else {
> - wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
> - dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
> + wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE) +
> + (wb_stat(wb, WB_METADATA_DIRTY_BYTES) >> PAGE_SHIFT);
> + wb_writeback = wb_stat(wb, WB_WRITEBACK) +
> + (wb_stat(wb, WB_METADATA_WRITEBACK_BYTES) >> 
> PAGE_SHIFT);
>   }
> + dtc->wb_dirty = wb_reclaimable + wb_writeback;
>  }

Use BtoP here as well? You have it defined anyway...

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 13d711dd8776..415b003e475c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -225,7 +225,8 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data 
> *pgdat)
>  
>   nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
>node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> -  node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
> +  node_page_state_snapshot(pgdat, NR_ISOLATED_FILE) +
> +  (node_page_state_snapshot(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT);
>  
>   if (get_nr_swap_pages() > 0)
>   nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +

This function never gets called in current kernel. I'll send a patch to
remove it.

> @@ -3812,6 +3813,7 @@ static inline unsigned long 
> node_unmapped_file_pages(struct pglist_data *pgdat)
>  static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
>  {
>   unsigned long nr_pagecache_reclaimable;
> + unsigned long nr_metadata_reclaimable;
>   unsigned long delta = 0;
>  
>   /*
> @@ -3833,7 +3835,20 @@ static unsigned long node_pagecache_reclaimable(struct 
> pglist_data *pgdat)
>   if (unlikely(delta > nr_pagecache_reclaimable))
>   delta = nr_pagecache_reclaimable;
>  
> - return nr_pagecache_reclaimable - delta;
> + nr_metadata_reclaimable =
> + node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT;
> + /*
> +  * We don't do writeout through the shrinkers so subtract any
> +  * dirty/writeback metadata bytes from the reclaimable count.
> +  */
> + if (nr_metadata_reclaimable) {
> + unsigned long unreclaimable =
> + node_page_state(pgdat, NR_METADATA_DIRTY_BYTES) +
> + node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES);
> + unreclaimable >>= PAGE_SHIFT;
> + nr_metadata_reclaimable -= unreclaimable;
> + }
> + return nr_metadata_reclaimable + nr_pagecache_reclaimable - delta;
>  }

So I've checked both places that use this function and I think they are fine
with the change. 

Re: [PATCH] btrfs: handle errors while updating refcounts in update_ref_for_cow

2017-11-22 Thread Qu Wenruo


On 2017年11月22日 02:58, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Since commit fb235dc06fa (btrfs: qgroup: Move half of the qgroup
> accounting time out of commit trans) the assumption that
> btrfs_add_delayed_{data,tree}_ref can only return 0 or -ENOMEM has
> been false.  The qgroup operations call into btrfs_search_slot
> and friends and can now return the full spectrum of error codes.
> 
> Fortunately, the fix here is easy since update_ref_for_cow failing
> is already handled so we just need to bail early with the error
> code.
> 
> Fixes: fb235dc06fa (btrfs: qgroup: Move half of the qgroup accounting ...)
> Cc:  # v4.11+
> Signed-off-by: Jeff Mahoney 

Looks good.

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  fs/btrfs/ctree.c | 18 --
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 531e0a8645b0..1e74cf826532 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1032,14 +1032,17 @@ static noinline int update_ref_for_cow(struct 
> btrfs_trans_handle *trans,
>root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) &&
>   !(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)) {
>   ret = btrfs_inc_ref(trans, root, buf, 1);
> - BUG_ON(ret); /* -ENOMEM */
> + if (ret)
> + return ret;
>  
>   if (root->root_key.objectid ==
>   BTRFS_TREE_RELOC_OBJECTID) {
>   ret = btrfs_dec_ref(trans, root, buf, 0);
> - BUG_ON(ret); /* -ENOMEM */
> + if (ret)
> + return ret;
>   ret = btrfs_inc_ref(trans, root, cow, 1);
> - BUG_ON(ret); /* -ENOMEM */
> + if (ret)
> + return ret;
>   }
>   new_flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
>   } else {
> @@ -1049,7 +1052,8 @@ static noinline int update_ref_for_cow(struct 
> btrfs_trans_handle *trans,
>   ret = btrfs_inc_ref(trans, root, cow, 1);
>   else
>   ret = btrfs_inc_ref(trans, root, cow, 0);
> - BUG_ON(ret); /* -ENOMEM */
> + if (ret)
> + return ret;
>   }
>   if (new_flags != 0) {
>   int level = btrfs_header_level(buf);
> @@ -1068,9 +1072,11 @@ static noinline int update_ref_for_cow(struct 
> btrfs_trans_handle *trans,
>   ret = btrfs_inc_ref(trans, root, cow, 1);
>   else
>   ret = btrfs_inc_ref(trans, root, cow, 0);
> - BUG_ON(ret); /* -ENOMEM */
> + if (ret)
> + return ret;
>   ret = btrfs_dec_ref(trans, root, buf, 1);
> - BUG_ON(ret); /* -ENOMEM */
> + if (ret)
> + return ret;
>   }
>   clean_tree_block(fs_info, buf);
>   *last_ref = 1;
> 



signature.asc
Description: OpenPGP digital signature


[PATCH 11/11] btrfs-progs: fsck-tests: Introduce test case with keyed data backref with shared tree blocks

2017-11-22 Thread Qu Wenruo
For snapshot shared tree blocks with source subvolume, the keyed backref
counter only counts the exclusive owned references.

In the following case, 258 is a snapshot of 257, which inherits all the
reference to this data extent.
--
item 4 key (12582912 EXTENT_ITEM 524288) itemoff 3741 itemsize 140
refs 179 gen 9 flags DATA
extent data backref root 257 objectid 258 offset 0 count 49
extent data backref root 257 objectid 257 offset 0 count 1
extent data backref root 256 objectid 258 offset 0 count 128
extent data backref root 256 objectid 257 offset 0 count 1
--

However lowmem mode used to iterate the whole inode to find all
reference, and doesn't care if the reference is already counted by the
shared tree block.

And the test case to check it.

Signed-off-by: Qu Wenruo 
---
 .../keyed_data_ref_with_shared_leaf.img | Bin 0 -> 19456 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 
tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_with_shared_leaf.img

diff --git 
a/tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_with_shared_leaf.img 
b/tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_with_shared_leaf.img
new file mode 100644
index 
..2ce5068f01c693e577de8b9a536fdeb360029e5f
GIT binary patch
literal 19456
zcmeIZby!qi_c%IqrzoHxp-8usbaw~{2q>i>jevj>LkLJHC?E*Z-5?;%01|@IAkra7
zNq5Zbdj|Zx@AuyOzR&^YvGv}fegIR8Qb
z65U9Si9W){(1$b`^bz(FeIS>C@L;e^^r86zc0t3l7qBfZ1k1>tX#2
zK817m9UcUiDV@VlArQu=d=3Ym!z$;nJsLJwJ%^dltZZGt=QuiQ=j~%uP

[PATCH 04/11] btrfs-progs: lowmem check: Fix false backref lost warning for keyed extent data ref

2017-11-22 Thread Qu Wenruo
For keyed extent ref, its offset is calculated offset (file offset -
file extent offset), just like inlined extent data ref.

However the code is using file offset to hash extent data ref offset,
causing false backref lost warning like:
--
ERROR: data extent[16913485824 7577600] backref lost
--

Fixes: b0d360b541f0 ("btrfs-progs: check: introduce function to check data 
backref in extent tree")
Reported-by: Chris Murphy 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 9f9eb504ae8e..452e715bf245 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -12126,7 +12126,7 @@ static int check_extent_data_item(struct btrfs_root 
*root,
dbref_key.objectid = btrfs_file_extent_disk_bytenr(eb, fi);
dbref_key.type = BTRFS_EXTENT_DATA_REF_KEY;
dbref_key.offset = hash_extent_data_ref(root->objectid,
-   fi_key.objectid, fi_key.offset);
+   fi_key.objectid, fi_key.offset - offset);
 
ret = btrfs_search_slot(NULL, root->fs_info->extent_root,
_key, , 0, 0);
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/11] btrfs-progs: lowmem check: Fix inlined data extent ref lookup

2017-11-22 Thread Qu Wenruo
When lowmem fsck tries to find backref of a specified file extent, it
searches inlined data ref first.

However, extent data ref contains both owner root objectid, inode number
and calculated offset (file offset - extent offset).

The code only checks owner root objectid, not checking inode number nor
calculated offset.

This makes lowmem mode fails to detect any backref mismatch if there is
a inlined data ref with the same owner objectid.

Fix it by also checking extent data ref's objectid and offset.

Fixes: b0d360b541f0 ("btrfs-progs: check: introduce function to check data 
backref in extent tree")
Reported-by: Chris Murphy 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 03ff89a4221c..9f9eb504ae8e 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -12005,11 +12005,11 @@ static int check_extent_data_item(struct btrfs_root 
*root,
u64 disk_num_bytes;
u64 extent_num_bytes;
u64 extent_flags;
+   u64 offset;
u32 item_size;
unsigned long end;
unsigned long ptr;
int type;
-   u64 ref_root;
int found_dbackref = 0;
int slot = pathp->slots[0];
int err = 0;
@@ -12027,6 +12027,7 @@ static int check_extent_data_item(struct btrfs_root 
*root,
disk_bytenr = btrfs_file_extent_disk_bytenr(eb, fi);
disk_num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
extent_num_bytes = btrfs_file_extent_num_bytes(eb, fi);
+   offset = btrfs_file_extent_offset(eb, fi);
 
/* Check unaligned disk_num_bytes and num_bytes */
if (!IS_ALIGNED(disk_num_bytes, root->fs_info->sectorsize)) {
@@ -12081,6 +12082,11 @@ static int check_extent_data_item(struct btrfs_root 
*root,
strict = should_check_extent_strictly(root, nrefs, -1);
 
while (ptr < end) {
+   u64 ref_root;
+   u64 ref_objectid;
+   u64 ref_offset;
+   bool match = false;
+
iref = (struct btrfs_extent_inline_ref *)ptr;
type = btrfs_extent_inline_ref_type(leaf, iref);
dref = (struct btrfs_extent_data_ref *)(>offset);
@@ -12092,9 +12098,15 @@ static int check_extent_data_item(struct btrfs_root 
*root,
}
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
ref_root = btrfs_extent_data_ref_root(leaf, dref);
-   if (ref_root == root->objectid)
+   ref_objectid = btrfs_extent_data_ref_objectid(leaf, 
dref);
+   ref_offset = btrfs_extent_data_ref_offset(leaf, dref);
+
+   if (ref_objectid == fi_key.objectid &&
+   ref_offset == fi_key.offset - offset)
+   match = true;
+   if (ref_root == root->objectid && match)
found_dbackref = 1;
-   else if (!strict && owner == ref_root)
+   else if (!strict && owner == ref_root && match)
found_dbackref = 1;
} else if (type == BTRFS_SHARED_DATA_REF_KEY) {
found_dbackref = !check_tree_block_ref(root, NULL,
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/11] Lowmem mode btrfs fixes exposed by complex tree

2017-11-22 Thread Qu Wenruo
The patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/lowmem_fix

The patchset is mostly rebased to v4.14, since there is some conflicts
with lowmem repair enhancement from Su Yue.

However the lowmem repair enhancement from Su Yue caused one regression,
and the original lowmem code also has some problem handling tree reloc
tree.

The first 2 patches are newly introduced to address this.
The rest are not touched at all (except the conflicts).

As always, the 9th patch may not reach mail list due to its size.
Please use github repo to fetch that file.

Qu Wenruo (11):
  btrfs-progs: lowmem check: Fix regression which screws up extent
allocator
  btrfs-progs: lowmem check: Fix NULL pointer access caused by large
tree reloc tree
  btrfs-progs: lowmem check: Fix inlined data extent ref lookup
  btrfs-progs: lowmem check: Fix false backref lost warning for keyed
extent data ref
  btrfs-progs: fsck-test: Introduce test case for false data extent
backref lost
  btrfs-progs: backref: Allow backref walk to handle direct parent ref
  btrfs-progs: lowmem check: Fix function call stack overflow caused by
wrong tree reloc tree detection
  btrfs-progs: lowmem check: Fix false alerts for image with shared
block ref only backref
  btrfs-progs: fsck-test: Add new image with shared block ref only
metadata backref
  btrfs-progs: lowmem check: Fix false alerts of referencer count
mismatch for snapshot
  btrfs-progs: fsck-tests: Introduce test case with keyed data backref
with shared tree blocks

 backref.c  |   3 +
 cmds-check.c   |  79 +++--
 .../020-extent-ref-cases/keyed_data_ref_only.img   | Bin 0 -> 4096 bytes
 .../keyed_data_ref_with_shared_leaf.img| Bin 0 -> 19456 bytes
 .../shared_block_ref_only.raw.xz   | Bin 0 -> 217204 bytes
 5 files changed, 61 insertions(+), 21 deletions(-)
 create mode 100644 
tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_only.img
 create mode 100644 
tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_with_shared_leaf.img
 create mode 100644 
tests/fsck-tests/020-extent-ref-cases/shared_block_ref_only.raw.xz

-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/11] btrfs-progs: lowmem check: Fix regression which screws up extent allocator

2017-11-22 Thread Qu Wenruo
[BUG]
Commit 723427d7e6b7 ("btrfs-progs: check: change the way lowmem mode traverses
metadata") introduces a regression which could make some fsck self test
case to fail.

For fsck test case 004-no-dir-item, btrfs check --mode=lowmem --repair
can cause BUG_ON() with ret = -17 (-EEXIST) when committing transaction.

The problem happens with the following backtrace:

./btrfs(+0x22045)[0x555d0dade045]
./btrfs(+0x2216f)[0x555d0dade16f]
./btrfs(+0x29df1)[0x555d0dae5df1]
./btrfs(+0x2a142)[0x555d0dae6142]
./btrfs(btrfs_alloc_free_block+0x78)[0x555d0dae6202]
./btrfs(__btrfs_cow_block+0x177)[0x555d0dad00a2]
./btrfs(btrfs_cow_block+0x116)[0x555d0dad05a8]
./btrfs(commit_tree_roots+0x91)[0x555d0db1fd4f]
./btrfs(btrfs_commit_transaction+0x18c)[0x555d0db20100]
./btrfs(btrfs_fix_super_size+0x190)[0x555d0db005a4]
./btrfs(btrfs_fix_device_and_super_size+0x177)[0x555d0db00771]
./btrfs(cmd_check+0x1757)[0x555d0db4f6ab]
./btrfs(main+0x138)[0x555d0dace5dd]
/usr/lib/libc.so.6(__libc_start_main+0xea)[0x7fa5e4613f6a]
./btrfs(_start+0x2a)[0x555d0dacddda]

The bug is triggered by that, extent allocator considers range [29360128,
29376512) is free and allocate it.
However when inserting EXTENT_ITEM, btrfs finds there is already one
tree block (fs tree root), returning -EEXIST and causing later BUG_ON().

[CAUSE]
The cause is in repair mode, lowmem check always pins all metadata
blocks.
However pinned metadata blocks will be unpin when transaction commits,
and will be marked as *FREE* space.

So later extent allocator will consider such range free and allocate
them wrongly.

[FIX]
Don't pin metadata blocks without valid reason or preparation (like
discard all free space cache to re-calculate free space on next write).

Fixes: 723427d7e6b7 ("btrfs-progs: check: change the way lowmem mode traverses 
metadata")
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index a93ac2c88a38..644ee084cb8e 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -6623,8 +6623,6 @@ out:
return ret;
 }
 
-static int pin_metadata_blocks(struct btrfs_fs_info *fs_info);
-
 /*
  * Iterate all items in the tree and call check_inode_item() to check.
  *
@@ -13306,8 +13304,6 @@ out:
return err;
 }
 
-static int pin_metadata_blocks(struct btrfs_fs_info *fs_info);
-
 /*
  * Low memory usage version check_chunks_and_extents.
  */
@@ -13326,12 +13322,6 @@ static int check_chunks_and_extents_v2(struct 
btrfs_fs_info *fs_info)
root = fs_info->fs_root;
 
if (repair) {
-   /* pin every tree block to avoid extent overwrite */
-   ret = pin_metadata_blocks(fs_info);
-   if (ret) {
-   error("failed to pin metadata blocks");
-   return ret;
-   }
trans = btrfs_start_transaction(fs_info->extent_root, 1);
if (IS_ERR(trans)) {
error("failed to start transaction before check");
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/11] btrfs-progs: lowmem check: Fix NULL pointer access caused by large tree reloc tree

2017-11-22 Thread Qu Wenruo
[BUG]
v4.14 btrfs-progs can't pass new self test image with large tree reloc
trees.
It will fail with later "shared_block_ref_only.raw.xz" test image with
NULL pointer access.

[CAUSE]
For image with higher (level >= 2) tree reloc tree, for function
need_check() its ulist will be empty as tree reloc tree won't be
accounted in btrfs_find_all_roots().
Then accessing ulist->roots with rb_first() will return NULL pointer.

[FIX]
For need_check() function, if @roots is empty, meaning it's a tree reloc
tree, always check them.
Although this can be slow, but at least it's safe that we won't skip any
possible wrong tree block.

Fixes: 5e2dc770471b ("btrfs-progs: check: skip shared node or leaf check for 
low_memory mode")
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 644ee084cb8e..03ff89a4221c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2149,7 +2149,12 @@ static int need_check(struct btrfs_root *root, struct 
ulist *roots)
struct rb_node *node;
struct ulist_node *u;
 
-   if (roots->nnodes == 1)
+   /*
+* @roots can be empty if it belongs to tree reloc tree
+* In that case, we should always check the leaf, as we can't use
+* the tree owner to ensure some other root will check it.
+*/
+   if (roots->nnodes == 1 || roots->nnodes == 0)
return 1;
 
node = rb_first(>root);
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/11] btrfs-progs: backref: Allow backref walk to handle direct parent ref

2017-11-22 Thread Qu Wenruo
[BUG]
Btrfs lowmem mode fails with the following ASSERT() on certain valid
image.
--
backref.c:466: __add_missing_keys: Assertion `ref->root_id` failed, value 0
--

[REASON]
Lowmem mode uses btrfs_find_all_roots() when walking down fs trees.

However if a tree block with only shared parent backref like below,
backref code from btrfs-progs doesn't handle it correct.
--
item 72 key (604653731840 METADATA_ITEM 0) itemoff 13379 itemsize 60
refs 4 gen 7198 flags TREE_BLOCK|FULL_BACKREF
tree block skinny level 0
shared block backref parent 604498477056
shared block backref parent 604498460672
shared block backref parent 604498444288
shared block backref parent 604498411520
--

Such shared block ref is *direct* ref, which means we don't need to
solve its key, nor its rootid.

As the objective of backref walk is to find all direct parents until it
reaches tree root.
So for such direct ref, it should be pended to pref_stat->pending, other
than pending it to pref_stat->pending_missing_key.

[FIX]
For direct ref, pending it to pref_state->pending directly to solve the
problem.

Reported-by: Chris Murphy 
Signed-off-by: Qu Wenruo 
---
 backref.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/backref.c b/backref.c
index 8615f6b8677a..27309e07a1e9 100644
--- a/backref.c
+++ b/backref.c
@@ -206,6 +206,9 @@ static int __add_prelim_ref(struct pref_state *prefstate, 
u64 root_id,
if (key) {
ref->key_for_search = *key;
head = >pending;
+   } else if (parent) {
+   memset(>key_for_search, 0, sizeof(ref->key_for_search));
+   head = >pending;
} else {
memset(>key_for_search, 0, sizeof(ref->key_for_search));
head = >pending_missing_keys;
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/11] btrfs-progs: lowmem check: Fix function call stack overflow caused by wrong tree reloc tree detection

2017-11-22 Thread Qu Wenruo
For tree reloc tree root, its backref points to it self.
So for such case, we should finish the lookup.

Previous end condition is to ensure it's tree reloc tree *and* needs its
root bytenr to match the bytenr passed in.

However the @root passed in can be other tree, e.g. other tree reloc
tree which shares the node/leaf.
This makes any check based on @root passed in invalid.

The patch removes the unreliable root objectid detection, and only use
root->bytenr check.
For the possibility of invalid self-pointing backref, extent tree
checker should have already handled it, so we don't need to bother in
fs tree checker.

Fixes: 54c8f9152fd9 ("btrfs-progs: check: Fix lowmem mode stack overflow caused 
by fsck/023")
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 452e715bf245..7eb08b6cb962 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -11700,16 +11700,12 @@ static int check_tree_block_ref(struct btrfs_root 
*root,
u32 nodesize = root->fs_info->nodesize;
u32 item_size;
u64 offset;
-   int tree_reloc_root = 0;
int found_ref = 0;
int err = 0;
int ret;
int strict = 1;
int parent = 0;
 
-   if (root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID &&
-   btrfs_header_bytenr(root->node) == bytenr)
-   tree_reloc_root = 1;
btrfs_init_path();
key.objectid = bytenr;
if (btrfs_fs_incompat(root->fs_info, SKINNY_METADATA))
@@ -11817,8 +11813,12 @@ static int check_tree_block_ref(struct btrfs_root 
*root,
/*
 * Backref of tree reloc root points to itself, no need
 * to check backref any more.
+*
+* This may be an error of loop backref, but extent tree
+* checker should have already handled it.
+* Here we only need to avoid infinite iteration.
 */
-   if (tree_reloc_root) {
+   if (offset == bytenr) {
found_ref = 1;
} else {
/*
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/11] btrfs-progs: lowmem check: Fix false alerts for image with shared block ref only backref

2017-11-22 Thread Qu Wenruo
[BUG]
For image with shared block ref only metadata item like:
--
item 66 key (21573632 METADATA_ITEM 0) itemoff 3971 itemsize 24
refs 66 gen 9 flags TREE_BLOCK|FULL_BACKREF
tree block skinny level 0
item 0 key (21573632 SHARED_BLOCK_REF 21676032) itemoff 3995 itemsize 0
shared block backref
item 1 key (21573632 SHARED_BLOCK_REF 21921792) itemoff 3995 itemsize 0
shared block backref
item 2 key (21573632 SHARED_BLOCK_REF 21995520) itemoff 3995 itemsize 0
shared block backref
item 3 key (21573632 SHARED_BLOCK_REF 22077440) itemoff 3995 itemsize 0
shared block backref
...
--

Lowmem mode check will report false alerts like:
--
ERROR: extent[21573632 4096] backref lost (owner: 256, level: 0)
--

[CAUSE]
In fact, the false alerts is not even from extent tree verfication,  but
a fs tree helper which is designed to make sure there is some tree block
referring to the fs tree block.

The idea is to find inlined tree backref then keyed TREE_BLOCK_REF_KEY.
However it missed SHARED_BLOCK_REF_KEY, and caused such false alert.

[FIX]
Add SHARED_BLOCK_REF_KEY to make the warning shut up.

Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 7eb08b6cb962..791fab6b3e6a 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -11850,6 +11850,30 @@ static int check_tree_block_ref(struct btrfs_root 
*root,
if (!ret)
found_ref = 1;
}
+   /*
+* Finally check SHARED BLOCK REF, any found will be good
+* Here we're not doing comprehensive extent backref checking,
+* only need to ensure there is some extent referring to this
+* tree block.
+*/
+   if (!found_ref) {
+   btrfs_release_path();
+   key.objectid = bytenr;
+   key.type = BTRFS_SHARED_BLOCK_REF_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(NULL, extent_root, , , 0, 0);
+   if (ret < 0) {
+   err |= BACKREF_MISSING;
+   goto out;
+   }
+   ret = btrfs_previous_extent_item(extent_root, , bytenr);
+   if (ret) {
+   err |= BACKREF_MISSING;
+   goto out;
+   }
+   found_ref = 1;
+   }
if (!found_ref)
err |= BACKREF_MISSING;
 out:
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/11] btrfs-progs: fsck-test: Introduce test case for false data extent backref lost

2017-11-22 Thread Qu Wenruo
Introduce a new test image, which has an extent item with no inlined
extent data ref, but all keyed extent data ref.

Only in this case we can trigger fase data extent backref lost bug in
lowmem mode.

Signed-off-by: Qu Wenruo 
---
 .../020-extent-ref-cases/keyed_data_ref_only.img | Bin 0 -> 4096 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 
tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_only.img

diff --git a/tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_only.img 
b/tests/fsck-tests/020-extent-ref-cases/keyed_data_ref_only.img
new file mode 100644
index 
..668589f46d330cb5d6ff10b108e62585875548ca
GIT binary patch
literal 4096

[PATCH 10/11] btrfs-progs: lowmem check: Fix false alerts of referencer count mismatch for snapshot

2017-11-22 Thread Qu Wenruo
Btrfs lowmem check reports such false alerts:
--
ERROR: extent[366498091008, 134217728] referencer count mismatch (root: 827, 
owner: 73782, offset: 134217728) wanted: 4, have: 26
ERROR: extent[366498091008, 134217728] referencer count mismatch (root: 818, 
owner: 73782, offset: 134217728) wanted: 4, have: 26
ERROR: extent[366498091008, 134217728] referencer count mismatch (root: 870, 
owner: 73782, offset: 134217728) wanted: 4, have: 26
--

While in extent tree, the extent has:
--
item 81 key (366498091008 EXTENT_ITEM 134217728) itemoff 9008 itemsize 
169
refs 39 gen 224 flags DATA
extent data backref root 827 objectid 73782 offset 134217728 
count 4
extent data backref root 818 objectid 73782 offset 134217728 
count 4
extent data backref root 259 objectid 73482 offset 134217728 
count 1
extent data backref root 644 objectid 73782 offset 134217728 
count 26
extent data backref root 870 objectid 73782 offset 134217728 
count 4
--

And in root 827, there is one leaf with 4 references to that extent
which is owned by 827:
--
leaf 714964992 items 68 free space 10019 generation 641 owner 827
leaf 714964992 flags 0x1(WRITTEN) backref revision 1
..
item 64 key (73782 EXTENT_DATA 134217728) itemoff 11878 itemsize 53
generation 224 type 1 (regular)
extent data disk byte 366498091008 nr 134217728
extent data offset 0 nr 6410240 ram 134217728
extent compression 0 (none)
item 65 key (73782 EXTENT_DATA 140627968) itemoff 11825 itemsize 53
generation 224 type 1 (regular)
extent data disk byte 366498091008 nr 134217728
extent data offset 6410240 nr 512 ram 134217728
extent compression 0 (none)
item 66 key (73782 EXTENT_DATA 145747968) itemoff 11772 itemsize 53
generation 224 type 1 (regular)
extent data disk byte 366498091008 nr 134217728
extent data offset 11530240 nr 7675904 ram 134217728
extent compression 0 (none)
item 67 key (73782 EXTENT_DATA 153423872) itemoff 11719 itemsize 53
generation 224 type 1 (regular)
extent data disk byte 366498091008 nr 134217728
extent data offset 19206144 nr 6397952 ram 134217728
extent compression 0 (none)
--

And starts from next leaf, there are 22 references to the data extent:
--
leaf 894861312 items 208 free space 59 generation 261 owner 644
leaf 894861312 flags 0x1(WRITTEN) backref revision 1
item 0 key (73782 EXTENT_DATA 159821824) itemoff 16230 itemsize 53
generation 224 type 1 (regular)
extent data disk byte 366498091008 nr 134217728
extent data offset 25604096 nr 8192 ram 134217728
extent compression 0 (none)
item 1 key (73782 EXTENT_DATA 159830016) itemoff 16177 itemsize 53
generation 224 type 1 (regular)
extent data disk byte 366498091008 nr 134217728
extent data offset 25612288 nr 7675904 ram 134217728
extent compression 0 (none)
..
--

However the next leaf is owned by other subvolume, normally owned by (part of)
the snapshot source.

Fix it by also checking the leaf's owner before increasing the reference
counter.

Reported-by: Chris Murphy 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/cmds-check.c b/cmds-check.c
index 791fab6b3e6a..e746ee7b281d 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -12518,11 +12518,17 @@ static int check_extent_data_backref(struct 
btrfs_fs_info *fs_info,
 * Except normal disk bytenr and disk num bytes, we still
 * need to do extra check on dbackref offset as
 * dbackref offset = file_offset - file_extent_offset
+*
+* Also, we must check the leaf owner.
+* In case of shared tree blocks (snapshots) we can inherit
+* leaves from source snapshot.
+* In that case, reference from source snapshot should not
+* count.
 */
if (btrfs_file_extent_disk_bytenr(leaf, fi) == bytenr &&
btrfs_file_extent_disk_num_bytes(leaf, fi) == len &&
(u64)(key.offset - btrfs_file_extent_offset(leaf, fi)) ==
-   offset)
+   offset && btrfs_header_owner(leaf) == root_id)
found_count++;
 
 next:
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: quotas: failure on removing a file via SFTP/SSH

2017-11-22 Thread Qu Wenruo


On 2017年11月22日 16:32, ST wrote:
> 
> 
> 
> On Wed, 2017-11-22 at 08:39 +0800, Qu Wenruo wrote:
>>
>> On 2017年11月22日 05:00, ST wrote:
>>> On Tue, 2017-11-21 at 11:33 -0700, Chris Murphy wrote:
 On Tue, Nov 21, 2017 at 8:29 AM, ST  wrote:
> I'm trying to use quotas for a simple chrooted sftp setup, limiting
> space for each user's subvolume (now for testing to 1M).
>
> I tried to hit the limit by uploading files and once it comes to the
> limit I face following problem: if I try to free space by removing a
> file via Linux sftp client (or Filezilla) - I get error:
> "Couldn't delete file: Failure"
>
> Sometimes, but not always, if I repeat it for 3-5 times it does 
> removes
> the file at the end.
> If I login as root and try to remove the file via SSH I get the error:
> "rm: cannot remove 'example.txt': Disk quota exceeded"
>
> What is the problem? And how can I solve it?

 Kernel version first.

 If it's possible, please use latest kernel, at least newer than v4.10,
 since we have a lot of qgroup reservation related fixes in newer 
 kernel.

 Then, for small quota, due to the nature of btrfs metadata CoW and
 relative large default node size (16K), it's quite easy to hit disk
 quota for metadata.
>>>
>>> Yes, but why I get the error specifically on REMOVING a file? Even if I
>>> hit disk quota - if I free up space - it should be possible, isn't it?
>>
>> It's only true for fs modifying its metadata in-place (and use journal
>> to protect it).
>>
>> For fs using metadata CoW, even freeing space needs extra space for new
>> metadata.
>>
>
> Wait, it doesn't sound like a bug, but rather like a flaw in design.
> This means - each time a user hits his quota limit he will get stuck
> without being able to free space?!!

 It's a good question if quotas can make it possible for a user to get
 wedged into a situation that will require an admin to temporarily
 raise the quota in order to make file deletion possible.
>>>
>>> Why question? It's a fact. That's what I face right now.
>>>
  This is not a
 design flaw, all COW file systems *add* data when deleting. The
 challenge is how to teach the quota system to act like a hard limit
 for data writes that clearly bust the quota, versus a soft limit that
 tolerates some extra amount above the quota for the purpose of
 eventually deleting data. That's maybe non-trivial. It's not that it's
 a design flaw. Metadata can contain inline data, so how exactly to you
 tell what kinds of writes are permitted (deleting a file) and what
 kind of writes are not (append data to a file, or create new file)?

 But for sure the user space tools should prevent setting too low a
 quota limit. If the limit cannot be reasonably expected to work, it
 should be disallowed. So maybe the user space tools need to enforce a
 minimum quota, something like 100MiB, or whatever.

>>>
>>> Would you like to open an issue with your enhancement suggestions on the
>>> bug tracker so this case doesn't get forgotten?
>>
>> That's why I ask for the kernel version.
>>
>> IIRC in newer kernel, quota doesn't limit deletion anymore, preventing
>> you from hitting such dilemma.
> 
> I'm sorry. I've mentioned it in another mail in this thread, here it is:
> 
> I am on Debian 9 (stable), so kernel version is:
> uname -r
> 4.9.0-4-amd64

This seems before the workaround for file deletion, but I can also be
wrong about it.

If you could try v4.13/14, it will be possible to verify the behavior.

Thanks,
Qu

> btrfs-tools (4.7.3-1)
> 
> I hope kernel 4.13 will move from Debian stable-backports to stable in
> some not so distant future. Is this issue already resolved in 4.13?
> 
> Thank you!
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 03/10] lib: add a batch size to fprop_global

2017-11-22 Thread Jan Kara
On Wed 22-11-17 09:47:16, Jan Kara wrote:
> On Tue 14-11-17 16:56:49, Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > The flexible proportion stuff has been used to track how many pages we
> > are writing out over a period of time, so counts everything in single
> > increments.  If we wanted to use another base value we need to be able
> > to adjust the batch size to fit our the units we'll be using for the
> > proportions.
> > 
> > Signed-off-by: Josef Bacik 
> 
> Frankly, I had to look into the code to understand what the patch is about.
> Can we rephrase the changelog like:
> 
> Currently flexible proportion code is using fixed per-cpu counter batch size
> since all the counters use only increment / decrement to track number of
> pages which completed writeback. When we start tracking amount of done
> writeback in different units, we need to update per-cpu counter batch size
> accordingly. Make counter batch size configurable on a per-proportion
> domain basis to allow for this.

Actually, now that I'm looking at other patches: Since fprop code is only
used for bdi writeback tracking, I guess there's no good reason to make
this configurable on a per-proportion basis. Just drop this patch and
bump PROP_BATCH in the following patch and we are done. Am I missing
something?

Honza

> > ---
> >  include/linux/flex_proportions.h |  4 +++-
> >  lib/flex_proportions.c   | 11 +--
> >  2 files changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/flex_proportions.h 
> > b/include/linux/flex_proportions.h
> > index 0d348e011a6e..853f4305d1b2 100644
> > --- a/include/linux/flex_proportions.h
> > +++ b/include/linux/flex_proportions.h
> > @@ -20,7 +20,7 @@
> >   */
> >  #define FPROP_FRAC_SHIFT 10
> >  #define FPROP_FRAC_BASE (1UL << FPROP_FRAC_SHIFT)
> > -
> > +#define FPROP_BATCH_SIZE (8*(1+ilog2(nr_cpu_ids)))
> >  /*
> >   *  Global proportion definitions 
> >   */
> > @@ -31,6 +31,8 @@ struct fprop_global {
> > unsigned int period;
> > /* Synchronization with period transitions */
> > seqcount_t sequence;
> > +   /* batch size */
> > +   s32 batch_size;
> >  };
> >  
> >  int fprop_global_init(struct fprop_global *p, gfp_t gfp);
> > diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> > index 2cc1f94e03a1..5552523b663a 100644
> > --- a/lib/flex_proportions.c
> > +++ b/lib/flex_proportions.c
> > @@ -44,6 +44,7 @@ int fprop_global_init(struct fprop_global *p, gfp_t gfp)
> > if (err)
> > return err;
> > seqcount_init(>sequence);
> > +   p->batch_size = FPROP_BATCH_SIZE;
> > return 0;
> >  }
> >  
> > @@ -166,8 +167,6 @@ void fprop_fraction_single(struct fprop_global *p,
> >  /*
> >   *  PERCPU 
> >   */
> > -#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
> > -
> >  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp)
> >  {
> > int err;
> > @@ -204,11 +203,11 @@ static void fprop_reflect_period_percpu(struct 
> > fprop_global *p,
> > if (period - pl->period < BITS_PER_LONG) {
> > s64 val = percpu_counter_read(>events);
> >  
> > -   if (val < (nr_cpu_ids * PROP_BATCH))
> > +   if (val < (nr_cpu_ids * p->batch_size))
> > val = percpu_counter_sum(>events);
> >  
> > percpu_counter_add_batch(>events,
> > -   -val + (val >> (period-pl->period)), PROP_BATCH);
> > +   -val + (val >> (period-pl->period)), p->batch_size);
> > } else
> > percpu_counter_set(>events, 0);
> > pl->period = period;
> > @@ -219,7 +218,7 @@ static void fprop_reflect_period_percpu(struct 
> > fprop_global *p,
> >  void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu 
> > *pl)
> >  {
> > fprop_reflect_period_percpu(p, pl);
> > -   percpu_counter_add_batch(>events, 1, PROP_BATCH);
> > +   percpu_counter_add_batch(>events, 1, p->batch_size);
> > percpu_counter_add(>events, 1);
> >  }
> >  
> > @@ -267,6 +266,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
> > return;
> > } else
> > fprop_reflect_period_percpu(p, pl);
> > -   percpu_counter_add_batch(>events, 1, PROP_BATCH);
> > +   percpu_counter_add_batch(>events, 1, p->batch_size);
> > percpu_counter_add(>events, 1);
> >  }
> > -- 
> > 2.7.5
> > 
> -- 
> Jan Kara 
> SUSE Labs, CR
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] lib: add a batch size to fprop_global

2017-11-22 Thread Jan Kara
On Tue 14-11-17 16:56:49, Josef Bacik wrote:
> From: Josef Bacik 
> 
> The flexible proportion stuff has been used to track how many pages we
> are writing out over a period of time, so counts everything in single
> increments.  If we wanted to use another base value we need to be able
> to adjust the batch size to fit our the units we'll be using for the
> proportions.
> 
> Signed-off-by: Josef Bacik 

Frankly, I had to look into the code to understand what the patch is about.
Can we rephrase the changelog like:

Currently flexible proportion code is using fixed per-cpu counter batch size
since all the counters use only increment / decrement to track number of
pages which completed writeback. When we start tracking amount of done
writeback in different units, we need to update per-cpu counter batch size
accordingly. Make counter batch size configurable on a per-proportion
domain basis to allow for this.

Honza
> ---
>  include/linux/flex_proportions.h |  4 +++-
>  lib/flex_proportions.c   | 11 +--
>  2 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/flex_proportions.h 
> b/include/linux/flex_proportions.h
> index 0d348e011a6e..853f4305d1b2 100644
> --- a/include/linux/flex_proportions.h
> +++ b/include/linux/flex_proportions.h
> @@ -20,7 +20,7 @@
>   */
>  #define FPROP_FRAC_SHIFT 10
>  #define FPROP_FRAC_BASE (1UL << FPROP_FRAC_SHIFT)
> -
> +#define FPROP_BATCH_SIZE (8*(1+ilog2(nr_cpu_ids)))
>  /*
>   *  Global proportion definitions 
>   */
> @@ -31,6 +31,8 @@ struct fprop_global {
>   unsigned int period;
>   /* Synchronization with period transitions */
>   seqcount_t sequence;
> + /* batch size */
> + s32 batch_size;
>  };
>  
>  int fprop_global_init(struct fprop_global *p, gfp_t gfp);
> diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c
> index 2cc1f94e03a1..5552523b663a 100644
> --- a/lib/flex_proportions.c
> +++ b/lib/flex_proportions.c
> @@ -44,6 +44,7 @@ int fprop_global_init(struct fprop_global *p, gfp_t gfp)
>   if (err)
>   return err;
>   seqcount_init(>sequence);
> + p->batch_size = FPROP_BATCH_SIZE;
>   return 0;
>  }
>  
> @@ -166,8 +167,6 @@ void fprop_fraction_single(struct fprop_global *p,
>  /*
>   *  PERCPU 
>   */
> -#define PROP_BATCH (8*(1+ilog2(nr_cpu_ids)))
> -
>  int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp)
>  {
>   int err;
> @@ -204,11 +203,11 @@ static void fprop_reflect_period_percpu(struct 
> fprop_global *p,
>   if (period - pl->period < BITS_PER_LONG) {
>   s64 val = percpu_counter_read(>events);
>  
> - if (val < (nr_cpu_ids * PROP_BATCH))
> + if (val < (nr_cpu_ids * p->batch_size))
>   val = percpu_counter_sum(>events);
>  
>   percpu_counter_add_batch(>events,
> - -val + (val >> (period-pl->period)), PROP_BATCH);
> + -val + (val >> (period-pl->period)), p->batch_size);
>   } else
>   percpu_counter_set(>events, 0);
>   pl->period = period;
> @@ -219,7 +218,7 @@ static void fprop_reflect_period_percpu(struct 
> fprop_global *p,
>  void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu 
> *pl)
>  {
>   fprop_reflect_period_percpu(p, pl);
> - percpu_counter_add_batch(>events, 1, PROP_BATCH);
> + percpu_counter_add_batch(>events, 1, p->batch_size);
>   percpu_counter_add(>events, 1);
>  }
>  
> @@ -267,6 +266,6 @@ void __fprop_inc_percpu_max(struct fprop_global *p,
>   return;
>   } else
>   fprop_reflect_period_percpu(p, pl);
> - percpu_counter_add_batch(>events, 1, PROP_BATCH);
> + percpu_counter_add_batch(>events, 1, p->batch_size);
>   percpu_counter_add(>events, 1);
>  }
> -- 
> 2.7.5
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] btrfs-progs: check: change traversal way of lowmem mode

2017-11-22 Thread Su Yue



On 11/22/2017 04:15 PM, Qu Wenruo wrote:

Well, David changed the title so it takes me some time to locate the patch.

Although it's too late to point out problems in this patch after being
merged.
But I still think it's worth mention some problems in this patch,
especially considering how it's causing problems.


Thanks, it indeed has many problems.
Since it was merged, I'll try to fix bugs which it brings.


On 2017年08月23日 10:33, Lu Fengqi wrote:

From: Su Yue 

This patch is a preparation for extent-tree repair in lowmem mode.
In the lowmem mode, checking tree blocks of various tree is in
recursive way.
But if during repair, add or delete of item(s) may modify upper nodes
which will cause the repair to be complicated and dangerous.

Before this patch:
One problem of lowmem check is that it only checks the lowest node's
backref in check_tree_block_ref.
This way ensures checked tree blocks are legal and avoids to traverse
all trees for consideration about speed.
However, there is one shortcoming that it can not detect backref mistake
if one extent whose owner == offset but lacks of other backref(s).

In check, correctness is more important than speed.
If errors can not be detected, repair is impossible.

Change of the patch:
check_chunks_and_extents now has to check *ALL* trees so lowmem check
will behave like original mode.
Changing the way of traversal to be same as fs tree which calls
walk_down_tree_v2() and walk_up_tree_v2() is easy for further
repair.

Signed-off-by: Su Yue 
---
  cmds-check.c | 695 +--
  1 file changed, 443 insertions(+), 252 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 829f7c5..7c9036c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1878,10 +1878,15 @@ struct node_refs {
u64 bytenr[BTRFS_MAX_LEVEL];
u64 refs[BTRFS_MAX_LEVEL];
int need_check[BTRFS_MAX_LEVEL];
+   /* field for check all trees*/
+   int checked[BTRFS_MAX_LEVEL];
+   /* the correspond extent should mark as full backref or not */
+   int full_backref[BTRFS_MAX_LEVEL];
  };
  
  static int update_nodes_refs(struct btrfs_root *root, u64 bytenr,

-struct node_refs *nrefs, u64 level);
+struct extent_buffer *eb, struct node_refs *nrefs,
+u64 level, int check_all);
  static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path,
unsigned int ext_ref);
  
@@ -1943,7 +1948,7 @@ again:
  
  		ret = update_nodes_refs(root,

path->nodes[i]->start,
-   nrefs, i);
+   path->nodes[i], nrefs, i, 0);
if (ret)
goto out;
  
@@ -2062,25 +2067,42 @@ static int need_check(struct btrfs_root *root, struct ulist *roots)

return 1;
  }
  
+static int calc_extent_flag_v2(struct btrfs_root *root,

+  struct extent_buffer *eb,
+  u64 *flags_ret);
  /*
   * for a tree node or leaf, we record its reference count, so later if we 
still
   * process this node or leaf, don't need to compute its reference count again.
+ *
+ * @bytenr  if @bytenr == (u64)-1, only update nrefs->full_backref[level]
   */
  static int update_nodes_refs(struct btrfs_root *root, u64 bytenr,
-struct node_refs *nrefs, u64 level)
+struct extent_buffer *eb, struct node_refs *nrefs,
+u64 level, int check_all)
  {
-   int check, ret;
-   u64 refs;
struct ulist *roots;
+   u64 refs = 0;
+   u64 flags = 0;
+   int root_level = btrfs_header_level(root->node);
+   int check;
+   int ret;
  
-	if (nrefs->bytenr[level] != bytenr) {

+   if (nrefs->bytenr[level] == bytenr)
+   return 0;
+
+   if (bytenr != (u64)-1) {
+   /* the return value of this function seems a mistake */
ret = btrfs_lookup_extent_info(NULL, root, bytenr,
-  level, 1, , NULL);
-   if (ret < 0)
+  level, 1, , );
+   /* temporary fix */
+   if (ret < 0 && !check_all)
return ret;
  
  		nrefs->bytenr[level] = bytenr;

nrefs->refs[level] = refs;
+   nrefs->full_backref[level] = 0;
+   nrefs->checked[level] = 0;
+
if (refs > 1) {
ret = btrfs_find_all_roots(NULL, root->fs_info, bytenr,
   0, );
@@ -2091,13 +2113,56 @@ static int update_nodes_refs(struct btrfs_root *root, 
u64 bytenr,
ulist_free(roots);
nrefs->need_check[level] = check;
} else {
-

Re: quotas: failure on removing a file via SFTP/SSH

2017-11-22 Thread ST



On Wed, 2017-11-22 at 08:39 +0800, Qu Wenruo wrote:
> 
> On 2017年11月22日 05:00, ST wrote:
> > On Tue, 2017-11-21 at 11:33 -0700, Chris Murphy wrote:
> >> On Tue, Nov 21, 2017 at 8:29 AM, ST  wrote:
> >>> I'm trying to use quotas for a simple chrooted sftp setup, limiting
> >>> space for each user's subvolume (now for testing to 1M).
> >>>
> >>> I tried to hit the limit by uploading files and once it comes to the
> >>> limit I face following problem: if I try to free space by removing a
> >>> file via Linux sftp client (or Filezilla) - I get error:
> >>> "Couldn't delete file: Failure"
> >>>
> >>> Sometimes, but not always, if I repeat it for 3-5 times it does 
> >>> removes
> >>> the file at the end.
> >>> If I login as root and try to remove the file via SSH I get the error:
> >>> "rm: cannot remove 'example.txt': Disk quota exceeded"
> >>>
> >>> What is the problem? And how can I solve it?
> >>
> >> Kernel version first.
> >>
> >> If it's possible, please use latest kernel, at least newer than v4.10,
> >> since we have a lot of qgroup reservation related fixes in newer 
> >> kernel.
> >>
> >> Then, for small quota, due to the nature of btrfs metadata CoW and
> >> relative large default node size (16K), it's quite easy to hit disk
> >> quota for metadata.
> >
> > Yes, but why I get the error specifically on REMOVING a file? Even if I
> > hit disk quota - if I free up space - it should be possible, isn't it?
> 
>  It's only true for fs modifying its metadata in-place (and use journal
>  to protect it).
> 
>  For fs using metadata CoW, even freeing space needs extra space for new
>  metadata.
> 
> >>>
> >>> Wait, it doesn't sound like a bug, but rather like a flaw in design.
> >>> This means - each time a user hits his quota limit he will get stuck
> >>> without being able to free space?!!
> >>
> >> It's a good question if quotas can make it possible for a user to get
> >> wedged into a situation that will require an admin to temporarily
> >> raise the quota in order to make file deletion possible.
> > 
> > Why question? It's a fact. That's what I face right now.
> > 
> >>  This is not a
> >> design flaw, all COW file systems *add* data when deleting. The
> >> challenge is how to teach the quota system to act like a hard limit
> >> for data writes that clearly bust the quota, versus a soft limit that
> >> tolerates some extra amount above the quota for the purpose of
> >> eventually deleting data. That's maybe non-trivial. It's not that it's
> >> a design flaw. Metadata can contain inline data, so how exactly to you
> >> tell what kinds of writes are permitted (deleting a file) and what
> >> kind of writes are not (append data to a file, or create new file)?
> >>
> >> But for sure the user space tools should prevent setting too low a
> >> quota limit. If the limit cannot be reasonably expected to work, it
> >> should be disallowed. So maybe the user space tools need to enforce a
> >> minimum quota, something like 100MiB, or whatever.
> >>
> > 
> > Would you like to open an issue with your enhancement suggestions on the
> > bug tracker so this case doesn't get forgotten?
> 
> That's why I ask for the kernel version.
> 
> IIRC in newer kernel, quota doesn't limit deletion anymore, preventing
> you from hitting such dilemma.

I'm sorry. I've mentioned it in another mail in this thread, here it is:

I am on Debian 9 (stable), so kernel version is:
uname -r
4.9.0-4-amd64
btrfs-tools (4.7.3-1)

I hope kernel 4.13 will move from Debian stable-backports to stable in
some not so distant future. Is this issue already resolved in 4.13?

Thank you!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] btrfs-progs: check: change traversal way of lowmem mode

2017-11-22 Thread Qu Wenruo
Well, David changed the title so it takes me some time to locate the patch.

Although it's too late to point out problems in this patch after being
merged.
But I still think it's worth mention some problems in this patch,
especially considering how it's causing problems.

On 2017年08月23日 10:33, Lu Fengqi wrote:
> From: Su Yue 
> 
> This patch is a preparation for extent-tree repair in lowmem mode.
> In the lowmem mode, checking tree blocks of various tree is in
> recursive way.
> But if during repair, add or delete of item(s) may modify upper nodes
> which will cause the repair to be complicated and dangerous.
> 
> Before this patch:
> One problem of lowmem check is that it only checks the lowest node's
> backref in check_tree_block_ref.
> This way ensures checked tree blocks are legal and avoids to traverse
> all trees for consideration about speed.
> However, there is one shortcoming that it can not detect backref mistake
> if one extent whose owner == offset but lacks of other backref(s).
> 
> In check, correctness is more important than speed.
> If errors can not be detected, repair is impossible.
> 
> Change of the patch:
> check_chunks_and_extents now has to check *ALL* trees so lowmem check
> will behave like original mode.
> Changing the way of traversal to be same as fs tree which calls
> walk_down_tree_v2() and walk_up_tree_v2() is easy for further
> repair.
> 
> Signed-off-by: Su Yue 
> ---
>  cmds-check.c | 695 
> +--
>  1 file changed, 443 insertions(+), 252 deletions(-)
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index 829f7c5..7c9036c 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -1878,10 +1878,15 @@ struct node_refs {
>   u64 bytenr[BTRFS_MAX_LEVEL];
>   u64 refs[BTRFS_MAX_LEVEL];
>   int need_check[BTRFS_MAX_LEVEL];
> + /* field for check all trees*/
> + int checked[BTRFS_MAX_LEVEL];
> + /* the correspond extent should mark as full backref or not */
> + int full_backref[BTRFS_MAX_LEVEL];
>  };
>  
>  static int update_nodes_refs(struct btrfs_root *root, u64 bytenr,
> -  struct node_refs *nrefs, u64 level);
> +  struct extent_buffer *eb, struct node_refs *nrefs,
> +  u64 level, int check_all);
>  static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path,
>   unsigned int ext_ref);
>  
> @@ -1943,7 +1948,7 @@ again:
>  
>   ret = update_nodes_refs(root,
>   path->nodes[i]->start,
> - nrefs, i);
> + path->nodes[i], nrefs, i, 0);
>   if (ret)
>   goto out;
>  
> @@ -2062,25 +2067,42 @@ static int need_check(struct btrfs_root *root, struct 
> ulist *roots)
>   return 1;
>  }
>  
> +static int calc_extent_flag_v2(struct btrfs_root *root,
> +struct extent_buffer *eb,
> +u64 *flags_ret);
>  /*
>   * for a tree node or leaf, we record its reference count, so later if we 
> still
>   * process this node or leaf, don't need to compute its reference count 
> again.
> + *
> + * @bytenr  if @bytenr == (u64)-1, only update nrefs->full_backref[level]
>   */
>  static int update_nodes_refs(struct btrfs_root *root, u64 bytenr,
> -  struct node_refs *nrefs, u64 level)
> +  struct extent_buffer *eb, struct node_refs *nrefs,
> +  u64 level, int check_all)
>  {
> - int check, ret;
> - u64 refs;
>   struct ulist *roots;
> + u64 refs = 0;
> + u64 flags = 0;
> + int root_level = btrfs_header_level(root->node);
> + int check;
> + int ret;
>  
> - if (nrefs->bytenr[level] != bytenr) {
> + if (nrefs->bytenr[level] == bytenr)
> + return 0;
> +
> + if (bytenr != (u64)-1) {
> + /* the return value of this function seems a mistake */
>   ret = btrfs_lookup_extent_info(NULL, root, bytenr,
> -level, 1, , NULL);
> - if (ret < 0)
> +level, 1, , );
> + /* temporary fix */
> + if (ret < 0 && !check_all)
>   return ret;
>  
>   nrefs->bytenr[level] = bytenr;
>   nrefs->refs[level] = refs;
> + nrefs->full_backref[level] = 0;
> + nrefs->checked[level] = 0;
> +
>   if (refs > 1) {
>   ret = btrfs_find_all_roots(NULL, root->fs_info, bytenr,
>  0, );
> @@ -2091,13 +2113,56 @@ static int update_nodes_refs(struct btrfs_root *root, 
> u64 bytenr,
>   ulist_free(roots);
>   nrefs->need_check[level] = check;
>   } else {
> -