Re: [PATCH 0/4] eventfs: Some more minor fixes
On Tue, Nov 21, 2023 at 06:10:03PM -0500, Steven Rostedt wrote: > Mark Rutland reported some crashes from the latest eventfs updates. > This fixes most of them. > > He still has one splat that he can trigger but I can not. Still looking > into that. Reviewed-by: Josef Bacik Thanks, Josef
Re: [RFC v3 0/2] vfs / btrfs: add support for ustat()
On 4/15/21 1:53 PM, Luis Chamberlain wrote: On Wed, Aug 23, 2017 at 3:31 PM Jeff Mahoney wrote: On 8/15/14 5:29 AM, Al Viro wrote: On Thu, Aug 14, 2014 at 07:58:56PM -0700, Luis R. Rodriguez wrote: Christoph had noted that this seemed associated to the problem that the btrfs uses different assignments for st_dev than s_dev, but much as I'd like to see that changed based on discussions so far its unclear if this is going to be possible unless strong commitment is reached. Resurrecting a dead thread since we've been carrying this patch anyway since then. Explain, please. Whose commitment and commitment to what, exactly? Having different ->st_dev values for different files on the same fs is a bloody bad idea; why does btrfs do that at all? If nothing else, it breaks the usual "are those two files on the same fs?" tests... It's because btrfs snapshots would have inode number collisions. Changing the inode numbers for snapshots would negate a big benefit of btrfs snapshots: the quick creation and lightweight on-disk representation due to metadata sharing. The thing is that ustat() used to work. Your commit 0ee5dc676a5f8 (btrfs: kill magical embedded struct superblock) had a regression: Since it replaced the superblock with a simple dev_t, it rendered the device no longer discoverable by user_get_super. We need a list_head to attach for searching. There's an argument that this is hacky. It's valid. The only other feedback I've heard is to use a real superblock for subvolumes to do this instead. That doesn't work either, due to things like freeze/thaw and inode writeback. Ultimately, what we need is a single file system with multiple namespaces. Years ago we just needed different inode namespaces, but as people have started adopting btrfs for containers, we need more than that. I've heard requests for per-subvolume security contexts. I'd imagine user namespaces are on someone's wish list. A working df can be done with ->d_automount, but the way btrfs handles having a "canonical" subvolume location has always been a way to avoid directory loops. I'd like to just automount subvolumes everywhere they're referenced. One solution, for which I have no code yet, is to have something like a superblock-light that we can hang things like a security context, a user namespace, and an anonymous dev. Most file systems would have just one. Btrfs would have one per subvolume. That's a big project with a bunch of discussion. 4 years have gone by and this patch is still being carried around for btrfs. Other than resolving this ustat() issue for btrfs are there new reasons to support this effort done to be done properly? Are there other filesystems that would benefit? I'd like to get an idea of the stakeholder here before considering taking this on or not. Not really sure why this needs to be addressed, we have statfs(), and what we have has worked forever now. There's a lot of larger things that need to be addressed in general to support the volume approach inside file systems that is going to require a lot of work inside of VFS. If you feel like tackling that work and then wiring up btrfs by all means have at it, but I'm not seeing a urgent need to address this. Thanks, Josef
Re: the qemu-nbd process automatically exit with the commit 43347d56c 'livepatch: send a fake signal to all blocking tasks'
On 4/14/21 11:21 AM, xiaojun.zhao...@gmail.com wrote: On Wed, 14 Apr 2021 13:27:43 +0200 (CEST) Miroslav Benes wrote: Hi, On Wed, 14 Apr 2021, xiaojun.zhao...@gmail.com wrote: I found the qemu-nbd process(started with qemu-nbd -t -c /dev/nbd0 nbd.qcow2) will automatically exit when I patched for functions of the nbd with livepatch. The nbd relative source: static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *bdev) { struct nbd_config *config = nbd->config; int ret; ret = nbd_start_device(nbd); if (ret) return ret; if (max_part) bdev->bd_invalidated = 1; mutex_unlock(>config_lock); ret = wait_event_interruptible(config->recv_wq, atomic_read(>recv_threads) == 0); if (ret) sock_shutdown(nbd); flush_workqueue(nbd->recv_workq); mutex_lock(>config_lock); nbd_bdev_reset(bdev); /* user requested, ignore socket errors */ if (test_bit(NBD_RT_DISCONNECT_REQUESTED, >runtime_flags)) ret = 0; if (test_bit(NBD_RT_TIMEDOUT, >runtime_flags)) ret = -ETIMEDOUT; return ret; } So my understanding is that ndb spawns a number (config->recv_threads) of workqueue jobs and then waits for them to finish. It waits interruptedly. Now, any signal would make wait_event_interruptible() to return -ERESTARTSYS. Livepatch fake signal is no exception there. The error is then propagated back to the userspace. Unless a user requested a disconnection or there is timeout set. How does the userspace then reacts to it? Is _interruptible there because the userspace sends a signal in case of NBD_RT_DISCONNECT_REQUESTED set? How does the userspace handles ordinary signals? This all sounds a bit strange, but I may be missing something easily. When the nbd waits for atomic_read(>recv_threads) == 0, the klp will send a fake signal to it then the qemu-nbd process exits. And the signal of sysfs to control this action was removed in the commit 10b3d52790e 'livepatch: Remove signal sysfs attribute'. Are there other ways to control this action? How? No, there is no way currently. We send a fake signal automatically. Regards Miroslav It occurs IO error of the nbd device when I use livepatch of the nbd, and I guess that any livepatch on other kernel source maybe cause the IO error. Well, now I decide to workaround for this problem by adding a livepatch for the klp to disable a automatic fake signal. Would wait_event_killable() fix this problem? I'm not sure any client implementations depend on being able to send other signals to the client process, so it should be safe from that standpoint. Not sure if the livepatch thing would still get an error at that point tho. Thanks, Josef
Re: [PATCH][RESEND] Revert "PM: ACPI: reboot: Use S5 for reboot"
On 3/18/21 1:42 AM, Kai-Heng Feng wrote: On Thu, Mar 18, 2021 at 1:25 AM Josef Bacik wrote: [snipped] "shutdown now" works fine with and without your patch. Thanks, Rafael, Please revert the patch while we are working on it. Josef, Can you please test the following patch: That made it work fine. Thanks, Josef
Re: [PATCH][RESEND] Revert "PM: ACPI: reboot: Use S5 for reboot"
On 3/17/21 12:14 PM, Kai-Heng Feng wrote: On Wed, Mar 17, 2021 at 11:19 PM Josef Bacik wrote: On 3/16/21 10:50 PM, Kai-Heng Feng wrote: Hi, On Wed, Mar 17, 2021 at 10:17 AM Josef Bacik wrote: This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095. This patch causes a panic when rebooting my Dell Poweredge r440. I do not have the full panic log as it's lost at that stage of the reboot and I do not have a serial console. Reverting this patch makes my system able to reboot again. But this patch also helps many HP laptops, so maybe we should figure out what's going on on Poweredge r440. Does it also panic on shutdown? Sure I'll test whatever to get it fixed, but I just wasted 3 days bisecting and lost a weekend of performance testing on btrfs because of this regression, so until you figure out how it broke it needs to be reverted so people don't have to figure out why reboot suddenly isn't working. That's unfortunate to hear. However, I've been spending tons of time on bisecting kernels. To me it's just a normal part of kernel development so I won't call it "wasted". Feel free to revert the patch though. Running "halt" has the same effect with and without your patch, it gets to "system halted" and just sits there without powering off. Not entirely sure why that is, but there's no panic. What about shutdown? pm_power_off_prepare() is used by shutdown but it's not used by halt. "shutdown now" works fine with and without your patch. Thanks, Josef
Re: [PATCH][RESEND] Revert "PM: ACPI: reboot: Use S5 for reboot"
On 3/16/21 10:50 PM, Kai-Heng Feng wrote: Hi, On Wed, Mar 17, 2021 at 10:17 AM Josef Bacik wrote: This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095. This patch causes a panic when rebooting my Dell Poweredge r440. I do not have the full panic log as it's lost at that stage of the reboot and I do not have a serial console. Reverting this patch makes my system able to reboot again. But this patch also helps many HP laptops, so maybe we should figure out what's going on on Poweredge r440. Does it also panic on shutdown? Sure I'll test whatever to get it fixed, but I just wasted 3 days bisecting and lost a weekend of performance testing on btrfs because of this regression, so until you figure out how it broke it needs to be reverted so people don't have to figure out why reboot suddenly isn't working. Running "halt" has the same effect with and without your patch, it gets to "system halted" and just sits there without powering off. Not entirely sure why that is, but there's no panic. The panic itself is lost, but I see there's an NMI and I have the RIP (gdb) list *('mwait_idle_with_hints.constprop.0'+0x4b) 0x816dabdb is in mwait_idle_with_hints (./arch/x86/include/asm/current.h:15). 10 11 DECLARE_PER_CPU(struct task_struct *, current_task); 12 13 static __always_inline struct task_struct *get_current(void) 14 { 15 return this_cpu_read_stable(current_task); 16 } 17 18 #define current get_current() 19 :jmp0x936dac02 :nopl (%rax) :jmp0x936dabac :nopl (%rax) :mfence :mov%gs:0x17bc0,%rax : clflush (%rax) : mfence : xor%edx,%edx : mov%rdx,%rcx : mov%gs:0x17bc0,%rax : monitor %rax,%rcx,%rdx : mov(%rax),%rax : test $0x8,%al : jne0x936dabdb : jmpq 0x936dabd0 : verw 0x9f9fec(%rip)# 0x940d4bbc : mov$0x1,%ecx : mov%rdi,%rax : mwait %rax,%rcx : mov%gs:0x17bc0,%rax : lock andb $0xdf,0x2(%rax) : lock addl $0x0,-0x4(%rsp) : mov(%rax),%rax : test $0x8,%al : je 0x936dac01 : andl $0x7fff,%gs:0x6c93cf7f(%rip)# 0x17b80 : retq : mov%gs:0x17bc0,%rax : lock orb $0x20,0x2(%rax) : mov(%rax),%rax : test $0x8,%al : jne0x936dabdb : jmpq 0x936dab95 : nopl 0x0(%rax) 0x4b is after the mwait, which means we're panicing in the current_clr_polling(), where we do clear_thread_flag(TIF_POLLING_NRFLAG). Thanks, Josef
[PATCH][RESEND] Revert "PM: ACPI: reboot: Use S5 for reboot"
This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095. This patch causes a panic when rebooting my Dell Poweredge r440. I do not have the full panic log as it's lost at that stage of the reboot and I do not have a serial console. Reverting this patch makes my system able to reboot again. Signed-off-by: Josef Bacik --- - apologies, I mistyped the lkml list email. kernel/reboot.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/reboot.c b/kernel/reboot.c index eb1b15850761..a6ad5eb2fa73 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -244,8 +244,6 @@ void migrate_to_reboot_cpu(void) void kernel_restart(char *cmd) { kernel_restart_prepare(cmd); - if (pm_power_off_prepare) - pm_power_off_prepare(); migrate_to_reboot_cpu(); syscore_shutdown(); if (!cmd) -- 2.26.2
Re: [PATCH v4 02/28] mm: Add an unlock function for PG_private_2/PG_fscache
On 3/16/21 8:43 PM, Linus Torvalds wrote: [ Adding btrfs people explicitly, maybe they see this on the fs-devel list, but maybe they don't react .. ] On Tue, Mar 16, 2021 at 12:07 PM Matthew Wilcox wrote: This isn't a problem with this patch per se, but I'm concerned about private2 and expected page refcounts. Ugh. You are very right. It would be good to just change the rules - I get the feeling nobody actually depended on them anyway because they were _so_ esoteric. static inline int is_page_cache_freeable(struct page *page) { /* * A freeable page cache page is referenced only by the caller * that isolated the page, the page cache and optional buffer * heads at page->private. */ int page_cache_pins = thp_nr_pages(page); return page_count(page) - page_has_private(page) == 1 + page_cache_pins; You're right, that "page_has_private()" is really really nasty. The comment is, I think, the traditional usage case, which used to be about page->buffers. Obviously these days it is now about page->private with PG_private set, pointing to buffers (attach_page_private() and detach_page_private()). But as you point out: #define PAGE_FLAGS_PRIVATE \ (1UL << PG_private | 1UL << PG_private_2) So ... a page with both flags cleared should have a refcount of N. A page with one or both flags set should have a refcount of N+1. Could we just remove the PG_private_2 thing in this context entirely, and make the rule be that (a) PG_private means that you have some local private data in page->private, and that's all that matters for the "freeable" thing. (b) PG_private_2 does *not* have the same meaning, and has no bearing on freeability (and only the refcount matters) I _)think_ the btrfs behavior is to only use PagePrivate2() when it has a reference to the page, so btrfs doesn't care? I think fscache is already happy to take the page count when using PG_private_2 for locking, exactly because I didn't want to have any confusion about lifetimes. But this "page_has_private()" math ends up meaning it's confusing anyway. btrfs people? What are the semantics for PG_private_2? Is it just a flag, and you really don't want it to have anything to do with any page lifetime decisions? Or? Yeah it's just a flag, we use it to tell that the page is part of a range that has been allocated for IO. The lifetime of the page is independent of the page, but is generally either dirty or under writeback, so either it goes through truncate and we clear PagePrivate2 there, or it actually goes through IO and is cleared before we drop the page in our endio. We _always_ have PG_private set on the page as long as we own it, and PG_private_2 is only set in this IO related context, so we're safe there because of the rules around PG_dirty/PG_writeback. We don't need it to have an extra ref for it being set. Thanks, Josef
Re: KASAN: use-after-free Read in nbd_genl_connect
On 2/22/21 3:25 AM, syzbot wrote: Hello, syzbot found the following issue on: HEAD commit:f40ddce8 Linux 5.11 git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=179e8d22d0 kernel config: https://syzkaller.appspot.com/x/.config?x=e53d04227c52a0df dashboard link: https://syzkaller.appspot.com/bug?extid=429d3f82d757c211bff3 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10d190cad0 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=13dc8a82d0 Bisection is inconclusive: the issue happens on the oldest tested release. bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=1128ae60d0 final oops: https://syzkaller.appspot.com/x/report.txt?x=1328ae60d0 console output: https://syzkaller.appspot.com/x/log.txt?x=1528ae60d0 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+429d3f82d757c211b...@syzkaller.appspotmail.com #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git nbd-kasan-fix
Re: KASAN: use-after-free Read in nbd_genl_connect
On 2/22/21 4:34 AM, Hillf Danton wrote: Mon, 22 Feb 2021 00:25:18 -0800 Hello, syzbot found the following issue on: HEAD commit:f40ddce8 Linux 5.11 git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=179e8d22d0 kernel config: https://syzkaller.appspot.com/x/.config?x=e53d04227c52a0df dashboard link: https://syzkaller.appspot.com/bug?extid=429d3f82d757c211bff3 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10d190cad0 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=13dc8a82d0 Bisection is inconclusive: the issue happens on the oldest tested release. bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=1128ae60d0 final oops: https://syzkaller.appspot.com/x/report.txt?x=1328ae60d0 console output: https://syzkaller.appspot.com/x/log.txt?x=1528ae60d0 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+429d3f82d757c211b...@syzkaller.appspotmail.com == BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:71 [inline] BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] BUG: KASAN: use-after-free in refcount_dec_not_one+0x71/0x1e0 lib/refcount.c:76 Read of size 4 at addr 888014ca19a0 by task syz-executor980/8540 CPU: 0 PID: 8540 Comm: syz-executor980 Not tainted 5.11.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230 __kasan_report mm/kasan/report.c:396 [inline] kasan_report.cold+0x79/0xd5 mm/kasan/report.c:413 check_memory_region_inline mm/kasan/generic.c:179 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:185 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] refcount_dec_not_one+0x71/0x1e0 lib/refcount.c:76 refcount_dec_and_mutex_lock+0x19/0x140 lib/refcount.c:115 nbd_put drivers/block/nbd.c:248 [inline] This put is unbalanced, given the 1978 line below. nbd_genl_connect+0xee7/0x1560 drivers/block/nbd.c:1980 genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:739 genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:800 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 sys_sendmsg+0x6e8/0x810 net/socket.c:2345 ___sys_sendmsg+0xf3/0x170 net/socket.c:2399 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x440709 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:7ffd63e9cc88 EFLAGS: 0246 ORIG_RAX: 002e RAX: ffda RBX: e3fb RCX: 00440709 RDX: RSI: 2340 RDI: 0003 RBP: R08: R09: 7ffd63e9ce28 R10: R11: 0246 R12: 7ffd63e9cc9c R13: 431bde82d7b634db R14: 004ae018 R15: 004004a0 Allocated by task 8536: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:401 [inline] kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:429 kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:682 [inline] nbd_dev_add+0x44/0x8e0 drivers/block/nbd.c:1673 nbd_genl_connect+0x59c/0x1560 drivers/block/nbd.c:1838 genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:739 genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:800 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 sys_sendmsg+0x6e8/0x810 net/socket.c:2345 ___sys_sendmsg+0xf3/0x170 net/socket.c:2399 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
Re: [PATCH v6 0/2] fix a NULL pointer bug and simplify the code
On 2/18/21 7:26 AM, Sun Ke wrote: fix a NULL pointer bug and simplify the code v6: Just add if (nbd->recv_workq) to nbd_disconnect_and_put(). v5: Adjust the title and add “Suggested-by”. v4: Share exception handling code for if branches and move put_nbd adjustment to a separate patch. v3: Do not use unlock and add put_nbd. v2: Use jump target unlock. Sun Ke (2): nbd: Fix NULL pointer in flush_workqueue nbd: share nbd_put and return by goto put_nbd drivers/block/nbd.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) You can add Reviewed-by: Josef Bacik to both of these, thanks, Josef
Re: [PATCH v5 1/2] nbd: Fix NULL pointer in flush_workqueue
On 2/5/21 1:56 AM, Sun Ke wrote: Open /dev/nbdX first, the config_refs will be 1 and the pointers in nbd_device are still null. Disconnect /dev/nbdX, then reference a null recv_workq. The protection by config_refs in nbd_genl_disconnect is useless. [ 656.366194] BUG: kernel NULL pointer dereference, address: 0020 [ 656.368943] #PF: supervisor write access in kernel mode [ 656.369844] #PF: error_code(0x0002) - not-present page [ 656.370717] PGD 10cc87067 P4D 10cc87067 PUD 1074b4067 PMD 0 [ 656.371693] Oops: 0002 [#1] SMP [ 656.372242] CPU: 5 PID: 7977 Comm: nbd-client Not tainted 5.11.0-rc5-00040-g76c057c84d28 #1 [ 656.373661] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 656.375904] RIP: 0010:mutex_lock+0x29/0x60 [ 656.376627] Code: 00 0f 1f 44 00 00 55 48 89 fd 48 83 05 6f d7 fe 08 01 e8 7a c3 ff ff 48 83 05 6a d7 fe 08 01 31 c0 65 48 8b 14 25 00 6d 01 00 48 0f b1 55 d [ 656.378934] RSP: 0018:c95eb9b0 EFLAGS: 00010246 [ 656.379350] RAX: RBX: RCX: [ 656.379915] RDX: 888104cf2600 RSI: aae8f452 RDI: 0020 [ 656.380473] RBP: 0020 R08: R09: 88813bd6b318 [ 656.381039] R10: 00c7 R11: fefefefefefefeff R12: 888102710b40 [ 656.381599] R13: c95eb9e0 R14: b2930680 R15: 88810770ef00 [ 656.382166] FS: 7fdf117ebb40() GS:88813bd4() knlGS: [ 656.382806] CS: 0010 DS: ES: CR0: 80050033 [ 656.383261] CR2: 0020 CR3: 000100c84000 CR4: 06e0 [ 656.383819] DR0: DR1: DR2: [ 656.384370] DR3: DR6: fffe0ff0 DR7: 0400 [ 656.384927] Call Trace: [ 656.385111] flush_workqueue+0x92/0x6c0 [ 656.385395] nbd_disconnect_and_put+0x81/0xd0 [ 656.385716] nbd_genl_disconnect+0x125/0x2a0 [ 656.386034] genl_family_rcv_msg_doit.isra.0+0x102/0x1b0 [ 656.386422] genl_rcv_msg+0xfc/0x2b0 [ 656.386685] ? nbd_ioctl+0x490/0x490 [ 656.386954] ? genl_family_rcv_msg_doit.isra.0+0x1b0/0x1b0 [ 656.387354] netlink_rcv_skb+0x62/0x180 [ 656.387638] genl_rcv+0x34/0x60 [ 656.387874] netlink_unicast+0x26d/0x590 [ 656.388162] netlink_sendmsg+0x398/0x6c0 [ 656.388451] ? netlink_rcv_skb+0x180/0x180 [ 656.388750] sys_sendmsg+0x1da/0x320 [ 656.389038] ? sys_recvmsg+0x130/0x220 [ 656.389334] ___sys_sendmsg+0x8e/0xf0 [ 656.389605] ? ___sys_recvmsg+0xa2/0xf0 [ 656.389889] ? handle_mm_fault+0x1671/0x21d0 [ 656.390201] __sys_sendmsg+0x6d/0xe0 [ 656.390464] __x64_sys_sendmsg+0x23/0x30 [ 656.390751] do_syscall_64+0x45/0x70 [ 656.391017] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: e9e006f5fcf2 ("nbd: fix max number of supported devs") Suggested-by: Markus Elfring Signed-off-by: Sun Ke --- v4: Share exception handling code for if branches v3: Do not use unlock and add put_nbd. v2: Use jump target unlock. --- drivers/block/nbd.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index e6ea5d344f87..3c9b3bf3f4c2 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -2014,17 +2014,20 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info) mutex_lock(_index_mutex); nbd = idr_find(_index_idr, index); if (!nbd) { - mutex_unlock(_index_mutex); printk(KERN_ERR "nbd: couldn't find device at index %d\n", index); - return -EINVAL; + goto unlock_index; } - if (!refcount_inc_not_zero(>refs)) { - mutex_unlock(_index_mutex); + mutex_lock(>config_lock); + if (!refcount_inc_not_zero(>refs) || !nbd->recv_workq) { We can't safely take the ->config_log if we don't have a ref. Just add if (nbd->recv_workq) flush_workqueue(nbd->recv_workq); to nbd_disconnect_and_put(). Problem solved, we can't drop it until we drop our last config ref, and we're holding a config ref here. We could probably add it in the meantime, but at this point we've disconnected all of our sockets so it doesn't matter. Thanks, Josef
Re: [PATCH] nbd: Convert to DEFINE_SHOW_ATTRIBUTE
On 2/6/21 2:10 AM, winnd...@163.com wrote: From: Liao Pingfang Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Liao Pingfang Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice
On 1/27/21 8:57 AM, Michal Rostecki wrote: From: Michal Rostecki Before this change, the btrfs_get_io_geometry() function was calling btrfs_get_chunk_map() to get the extent mapping, necessary for calculating the I/O geometry. It was using that extent mapping only internally and freeing the pointer after its execution. That resulted in calling btrfs_get_chunk_map() de facto twice by the __btrfs_map_block() function. It was calling btrfs_get_io_geometry() first and then calling btrfs_get_chunk_map() directly to get the extent mapping, used by the rest of the function. This change fixes that by passing the extent mapping to the btrfs_get_io_geometry() function as an argument. v2: When btrfs_get_chunk_map() returns an error in btrfs_submit_direct(): - Use errno_to_blk_status(PTR_ERR(em)) as the status - Set em to NULL Signed-off-by: Michal Rostecki This panic'ed all of my test vms in their overnight xfstests runs, the panic is this [ 2449.936502] BTRFS critical (device dm-7): mapping failed logical 1113825280 bio len 40960 len 24576 [ 2449.937073] [ cut here ] [ 2449.937329] kernel BUG at fs/btrfs/volumes.c:6450! [ 2449.937604] invalid opcode: [#1] SMP NOPTI [ 2449.937855] CPU: 0 PID: 259045 Comm: kworker/u5:0 Not tainted 5.11.0-rc5+ #122 [ 2449.938252] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 [ 2449.938713] Workqueue: btrfs-worker-high btrfs_work_helper [ 2449.939016] RIP: 0010:btrfs_map_bio.cold+0x5a/0x5c [ 2449.939392] Code: 37 87 ff ff e8 ed d4 8a ff 48 83 c4 18 e9 b5 52 8b ff 49 89 c8 4c 89 fa 4c 89 f1 48 c7 c6 b0 c0 61 8b 48 89 ef e8 11 87 ff ff <0f> 0b 4c 89 e7 e8 42 09 86 ff e9 fd 59 8b ff 49 8b 7a 50 44 89 f2 [ 2449.940402] RSP: :9f24c1637d90 EFLAGS: 00010282 [ 2449.940689] RAX: 0057 RBX: 90c78ff716b8 RCX: [ 2449.941080] RDX: 90c7fbc27ae0 RSI: 90c7fbc19110 RDI: 90c7fbc19110 [ 2449.941467] RBP: 90c7911d4000 R08: R09: [ 2449.941853] R10: 9f24c1637b48 R11: 8b9723e8 R12: [ 2449.942243] R13: R14: a000 R15: 4263a000 [ 2449.942632] FS: () GS:90c7fbc0() knlGS: [ 2449.943072] CS: 0010 DS: ES: CR0: 80050033 [ 2449.943386] CR2: 5575163c3080 CR3: 00010ad6c004 CR4: 00370ef0 [ 2449.943772] Call Trace: [ 2449.943915] ? lock_release+0x1c3/0x290 [ 2449.944135] run_one_async_done+0x3a/0x60 [ 2449.944360] btrfs_work_helper+0x136/0x520 [ 2449.944588] process_one_work+0x26e/0x570 [ 2449.944812] worker_thread+0x55/0x3c0 [ 2449.945016] ? process_one_work+0x570/0x570 [ 2449.945250] kthread+0x137/0x150 [ 2449.945430] ? __kthread_bind_mask+0x60/0x60 [ 2449.945666] ret_from_fork+0x1f/0x30 it happens when you run btrfs/060. Please make sure to run xfstests against patches before you submit them upstream. Thanks, Josef
Re: [PATCH] btrfs: Simplify the calculation of variables
On 1/27/21 3:11 AM, Abaci Team wrote: Fix the following coccicheck warnings: ./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot Suggested-by: Jiapeng Zhong Signed-off-by: Abaci Team Reviewed-by: Josef Bacik Thanks, Josef
Re: [RFC PATCH 00/37] block: introduce bio_init_fields()
On 1/19/21 12:05 AM, Chaitanya Kulkarni wrote: Hi, This is a *compile only RFC* which adds a generic helper to initialize the various fields of the bio that is repeated all the places in file-systems, block layer, and drivers. The new helper allows callers to initialize various members such as bdev, sector, private, end io callback, io priority, and write hints. The objective of this RFC is to only start a discussion, this it not completely tested at all. It would help to know what you're trying to accomplish here. I'd echo Mike's comments about how it makes it annoying to update things in the future. In addition, there's so many fields that I'm not going to remember what each one is without having to look it up, which makes it annoying to use and to review. If it's simply to make sure fields are initialized then you could add debug sanity checks to submit_bio(). If it's to clean up duplication, well I'd argue that the duplication is much clearer than positional arguments in a giant function call. If you are wanting to change a particular part of the bio to be initialized properly, like Dennis's work to make sure the bi_blkg was initialized at bi_bdev set time, then a more targeted patch series with a specific intent will be more useful and more successful. Thanks, Josef
Re: [PATCH] nbd: Respect max_part for all partition scans
On 12/17/20 3:58 AM, Josh Triplett wrote: The creation path of the NBD device respects max_part and only scans for partitions if max_part is not 0. However, some other code paths ignore max_part, and unconditionally scan for partitions. Add a check for max_part on each partition scan. Signed-off-by: Josh Triplett Reviewed-by: Josef Bacik Thanks, Josef
Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
On 12/9/20 10:38 PM, Bart Van Assche wrote: On 12/7/20 10:55 AM, Palmer Dabbelt wrote: All in all, I've found it a bit hard to figure out what sort of interest people have in dm-user: when I bring this up I seem to run into people who've done similar things before and are vaguely interested, but certainly nobody is chomping at the bit. I'm sending it out in this early state to try and figure out if it's interesting enough to keep going. Cc-ing Josef and Mike since their nbd contributions make me wonder whether this new driver could be useful to their use cases? Sorry gmail+imap sucks and I can't get my email client to get at the original thread. However here is my take. 1) The advantages of using dm-user of NBD that you listed aren't actually problems for NBD. We have NBD working in production where you can hand off the sockets for the server without ending in timeouts, it was actually the main reason we wrote our own server so we could use the FD transfer stuff to restart the server without impacting any clients that had the device in use. 2) The extra copy is a big deal, in fact we already have too many copies in our existing NBD setup and are actively looking for ways to avoid those. Don't take this as I don't think dm-user is a good idea, but I think at the very least it should start with the very best we have to offer, starting with as few copies as possible. If you are using it currently in production then cool, there's clearly a usecase for it. Personally as I get older and grouchier I want less things in the kernel, so if this enables us to eventually do everything NBD related in userspace with no performance drop then I'd be down. I don't think you need to make that your primary goal, but at least polishing this up so it could potentially be abused in the future would make it more compelling for merging. Thanks, Josef
[LSFMMBPF 2021] A status update
Hello, We on the program committee hope everybody has been able to stay safe and healthy during this challenging time, and look forward to being able to see all of you in person again when it is safe. The current plans for LSFMMBPF 2021 are to schedule an in person conference in H2 (after June) of 2021. The tentative plan is to use the same hotel that we had planned to use for 2020, as we still have contracts with them. However clearly that is not set in stone. The Linux Foundation has done a wonderful job of working with us to formulate a plan and figure out the logistics that will work the best for everybody, I really can't thank them enough for their help. Once we have a finalized date we will redo the CFP emails, probably coming out March time frame. If you have any questions or concerns please feel free to respond to this email, or email me or any of the other PC members privately and we will do our best to answer your questions. Rest assured the general timing of the conference is going to take into account the wide variety of schedules that we are dealing with, and we will do our best to come up with something that works for as many as people as possible. We hope that you and your families continue to stay safe and health. Thank you on behalf of the program committee: Josef Bacik (Filesystems) Amir Goldstein (Filesystems) Martin K. Petersen (Storage) Omar Sandoval (Storage) Michal Hocko (MM) Dan Williams (MM) Alexei Starovoitov (BPF) Daniel Borkmann (BPF)
Re: [GIT PULL][PATCH v5 0/9] Update to zstd-1.4.6
On 11/3/20 1:05 AM, Nick Terrell wrote: From: Nick Terrell Please pull from g...@github.com:terrelln/linux.git tags/v5-zstd-1.4.6 to get these changes. Alternatively the patchset is included. Where did we come down on the code formatting question? Personally I'm of the mind that as long as the consumers themselves adhere to the proper coding style I'm fine not maintaining the code style as long as we get the benefit of easily syncing in code from the upstream project. Thanks, Josef
Re: [PATCH v5 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 11/3/20 1:05 AM, Nick Terrell wrote: From: Nick Terrell Move away from the compatibility wrapper to the zstd-1.4.6 API. This code is functionally equivalent. Signed-off-by: Nick Terrell --- fs/btrfs/zstd.c | 48 1 file changed, 28 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c index a7367ff573d4..6b466e090cd7 100644 --- a/fs/btrfs/zstd.c +++ b/fs/btrfs/zstd.c @@ -16,7 +16,7 @@ #include #include #include -#include +#include #include "misc.h" #include "compression.h" #include "ctree.h" @@ -159,8 +159,8 @@ static void zstd_calc_ws_mem_sizes(void) zstd_get_btrfs_parameters(level, ZSTD_BTRFS_MAX_INPUT); size_t level_size = max_t(size_t, - ZSTD_CStreamWorkspaceBound(params.cParams), - ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT)); + ZSTD_estimateCStreamSize_usingCParams(params.cParams), + ZSTD_estimateDStreamSize(ZSTD_BTRFS_MAX_INPUT)); max_size = max_t(size_t, max_size, level_size); zstd_ws_mem_sizes[level - 1] = max_size; @@ -389,13 +389,23 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, *total_in = 0; /* Initialize the stream */ - stream = ZSTD_initCStream(params, len, workspace->mem, - workspace->size); + stream = ZSTD_initStaticCStream(workspace->mem, workspace->size); if (!stream) { - pr_warn("BTRFS: ZSTD_initCStream failed\n"); + pr_warn("BTRFS: ZSTD_initStaticCStream failed\n"); ret = -EIO; goto out; } + { + size_t ret2; + + ret2 = ZSTD_initCStream_advanced(stream, NULL, 0, params, len); + if (ZSTD_isError(ret2)) { + pr_warn("BTRFS: ZSTD_initCStream_advanced returned %s\n", + ZSTD_getErrorName(ret2)); + ret = -EIO; + goto out; + } + } Please don't do this, you can just add size_t ret2 at the top and not put this in a block. Other than that the code looks fine, once you fix that you can add Reviewed-by: Josef Bacik Thanks, Josef
Re: [btrfs] 96bed17ad9: fio.write_iops -59.7% regression
On 11/4/20 1:16 AM, kernel test robot wrote: Greeting, FYI, we noticed a -59.7% regression of fio.write_iops due to commit: commit: 96bed17ad9d425ff6958a2e6f87179453a3d76f2 ("btrfs: simplify the logic in need_preemptive_flushing") https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master in testcase: fio-basic on test machine: 192 threads Intel(R) Xeon(R) CPU @ 2.20GHz with 192G memory with following parameters: disk: 1SSD fs: btrfs runtime: 300s nr_task: 8 rw: write bs: 4k ioengine: sync test_size: 256g cpufreq_governor: performance ucode: 0x4002f01 test-description: Fio is a tool that will spawn a number of threads or processes doing a particular type of I/O action as specified by the user. test-url: https://github.com/axboe/fio I generally ignore these reports, but since it's FIO I figured at least the test itself was valid. However once again I'm unable to reproduce the results linus master: task_0: (groupid=0, jobs=8): err= 0: pid=38586: Wed Nov 4 08:13:36 2020 write: IOPS=168k, BW=655MiB/s (687MB/s)(192GiB/31msec); 0 zone resets clat (usec): min=26, max=786, avg=47.15, stdev= 7.21 lat (usec): min=26, max=786, avg=47.21, stdev= 7.21 clat percentiles (nsec): | 1.00th=[31872], 5.00th=[35584], 10.00th=[37632], 20.00th=[40704], | 30.00th=[43264], 40.00th=[45312], 50.00th=[47360], 60.00th=[48896], | 70.00th=[50944], 80.00th=[52992], 90.00th=[56064], 95.00th=[59136], | 99.00th=[65280], 99.50th=[68096], 99.90th=[74240], 99.95th=[77312], | 99.99th=[88576] bw ( KiB/s): min=63752, max=112864, per=12.50%, avg=83810.53, stdev=3403.48, samples=4792 iops: min=15938, max=28216, avg=20952.61, stdev=850.87, samples=4792 lat (usec) : 50=65.73%, 100=34.27%, 250=0.01%, 500=0.01%, 750=0.01% lat (usec) : 1000=0.01% cpu : usr=2.22%, sys=97.77%, ctx=5054, majf=0, minf=63 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,50298940,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: bw=655MiB/s (687MB/s), 655MiB/s-655MiB/s (687MB/s-687MB/s), io=192GiB (206GB), run=31-31msec kdave/for-next-20201104 task_0: (groupid=0, jobs=8): err= 0: pid=6652: Wed Nov 4 08:41:52 2020 write: IOPS=180k, BW=705MiB/s (739MB/s)(207GiB/31msec); 0 zone resets clat (usec): min=17, max=10603, avg=43.91, stdev= 9.62 lat (usec): min=17, max=10603, avg=43.98, stdev= 9.62 clat percentiles (nsec): | 1.00th=[25984], 5.00th=[31104], 10.00th=[33536], 20.00th=[37120], | 30.00th=[39168], 40.00th=[41216], 50.00th=[43264], 60.00th=[45824], | 70.00th=[47872], 80.00th=[50944], 90.00th=[54528], 95.00th=[57600], | 99.00th=[64768], 99.50th=[68096], 99.90th=[74240], 99.95th=[78336], | 99.99th=[90624] bw ( KiB/s): min=66760, max=123160, per=12.50%, avg=90221.11, stdev=9052.52, samples=4792 iops: min=16690, max=30790, avg=22555.24, stdev=2263.14, samples=4792 lat (usec) : 20=0.01%, 50=77.24%, 100=22.75%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 20=0.01% cpu : usr=1.67%, sys=98.31%, ctx=4806, majf=0, minf=68 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,54134917,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: bw=705MiB/s (739MB/s), 705MiB/s-705MiB/s (739MB/s-739MB/s), io=207GiB (222GB), run=31-31msec So instead of -60% iops regression, I'm seeing a 7% iops improvement. The only difference is that my machine doesn't have 192 threads, it has 80. Thanks, Josef
Re: [PATCH] fs: fix NULL dereference due to data race in prepend_path()
On 10/14/20 4:45 PM, Andrii Nakryiko wrote: Fix data race in prepend_path() with re-reading mnt->mnt_ns twice without holding the lock. is_mounted() does check for NULL, but is_anon_ns(mnt->mnt_ns) might re-read the pointer again which could be NULL already, if in between reads one of kern_unmount()/kern_unmount_array()/umount_tree() sets mnt->mnt_ns to NULL. This is seen in production with the following stack trace: [22942.418012] BUG: kernel NULL pointer dereference, address: 0048 ... [22942.976884] RIP: 0010:prepend_path.isra.4+0x1ce/0x2e0 [22943.037706] Code: 89 c6 e9 0d ff ff ff 49 8b 85 c0 00 00 00 48 85 c0 0f 84 9d 00 00 00 48 3d 00 f0 ff ff 0f 87 91 00 00 00 49 8b 86 e0 00 00 00 <48> 83 78 48 00 0f 94 c0 0f b6 c0 83 c0 01 e9 3b ff ff ff 39 0d 29 [22943.264141] RSP: 0018:c90020d6fd98 EFLAGS: 00010283 [22943.327058] RAX: RBX: RCX: 0007e5ee [22943.413041] RDX: 889fb56ac0c0 RSI: ffd05dc8147e RDI: 88b1f845ab7b [22943.499015] RBP: 889fbf8100c0 R08: c90020d6fe30 R09: c90020d6fe2c [22943.584992] R10: c90020d6fe2c R11: ea00095836c0 R12: c90020d6fe30 [22943.670968] R13: 88b7d336bea0 R14: 88b7d336be80 R15: 88aeb78db980 [22943.756944] FS: 7f228447e980() GS:889fc00c() knlGS: [22943.854448] CS: 0010 DS: ES: CR0: 80050033 [22943.923653] CR2: 0048 CR3: 001ed235e001 CR4: 007606e0 [22944.009630] DR0: DR1: DR2: [22944.095604] DR3: DR6: fffe0ff0 DR7: 0400 [22944.181581] PKRU: 5554 [22944.214100] Call Trace: [22944.243485] d_path+0xe6/0x150 [22944.280202] proc_pid_readlink+0x8f/0x100 [22944.328449] vfs_readlink+0xf8/0x110 [22944.371456] ? touch_atime+0x33/0xd0 [22944.414466] do_readlinkat+0xfd/0x120 [22944.458522] __x64_sys_readlinkat+0x1a/0x20 [22944.508868] do_syscall_64+0x42/0x110 [22944.552928] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: f2683bd8d5bd ("[PATCH] fix d_absolute_path() interplay with fsmount()") Signed-off-by: Andrii Nakryiko --- fs/d_path.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/d_path.c b/fs/d_path.c index 0f1fc1743302..a69e2cd36e6e 100644 --- a/fs/d_path.c +++ b/fs/d_path.c @@ -102,6 +102,8 @@ static int prepend_path(const struct path *path, if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) { struct mount *parent = READ_ONCE(mnt->mnt_parent); + struct mnt_namespace *mnt_ns; + /* Escaped? */ if (dentry != vfsmnt->mnt_root) { bptr = *buffer; @@ -116,7 +118,9 @@ static int prepend_path(const struct path *path, vfsmnt = >mnt; continue; } - if (is_mounted(vfsmnt) && !is_anon_ns(mnt->mnt_ns)) + mnt_ns = READ_ONCE(mnt->mnt_ns); + /* open-coded is_mounted() to use local mnt_ns */ + if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns)) error = 1; // absolute root else I had to go look at this code carefully to make sure that mnt == real_mount(vfsmnt), which it does. I was also afraid that if we could have mnt->mnt_ns change in between checks that we were just trading a possible NULL deref with a UAF, but we're under RCU here so we're good there as well. You can add Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] btrfs: ref-verify: Fix memleak in add_extent_data_ref
On 8/27/20 3:43 AM, Dinghao Liu wrote: When lookup_root_entry() fails, ref should be freed just like when insert_ref_entry() fails. Signed-off-by: Dinghao Liu Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 6/7] btrfs: Promote to unsigned long long before shifting
On 10/4/20 2:04 PM, Matthew Wilcox (Oracle) wrote: On 32-bit systems, this shift will overflow for files larger than 4GB. Cc: sta...@vger.kernel.org Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6") Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 5/7] btrfs: Promote to unsigned long long before shifting
On 10/4/20 2:04 PM, Matthew Wilcox (Oracle) wrote: On 32-bit systems, this shift will overflow for files larger than 4GB. Cc: sta...@vger.kernel.org Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release") Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] btrfs: fix memdup.cocci warnings
On 9/22/20 6:21 AM, Julia Lawall wrote: From: kernel test robot fs/btrfs/send.c:3854:8-15: WARNING opportunity for kmemdup Use kmemdup rather than duplicating its implementation Generated by: scripts/coccinelle/api/memdup.cocci Fixes: 28314eb24e6c ("btrfs: send, recompute reference path after orphanization of a directory") Signed-off-by: kernel test robot Signed-off-by: Julia Lawall Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] btrfs: Fix potential null pointer deref
On 9/21/20 3:12 PM, Alex Dewar wrote: In btrfs_destroy_inode(), the variable root may be NULL, but the check for this takes place after its value has already been dereferenced to access its fs_info member. Move the dereference operation to later in the function. Fixes: a6dbd429d8dd ("Btrfs: fix panic when trying to destroy a newly allocated") Addresses-Coverity: CID 1497103: Null pointer dereferences (REVERSE_INULL) Signed-off-by: Alex Dewar Reviewed-by: Josef Bacik Thanks, Josef
Re: remove revalidate_disk()
On 9/1/20 11:57 AM, Christoph Hellwig wrote: Hi Jens, this series removes the revalidate_disk() function, which has been a really odd duck in the last years. The prime reason why most people use it is because it propagates a size change from the gendisk to the block_device structure. But it also calls into the rather ill defined ->revalidate_disk method which is rather useless for the callers. So this adds a new helper to just propagate the size, and cleans up all kinds of mess around this area. Follow on patches will eventuall kill of ->revalidate_disk entirely, but ther are a lot more patches needed for that. I applied and built everything on Jens's for-next, patch #2 was fuzzy but it applied. I checked through everything, the only thing that was strange to me is not calling revalidate_disk_size() in nvdimm, but since it's during attach you point out it doesn't matter. You can add Reviewed-by: Josef Bacik To the series, thanks, Josef
Re: [PATCH] btrfs: check return value of filemap_fdatawrite_range()
On 8/21/20 8:41 AM, Alex Dewar wrote: In btrfs_dio_imap_begin(), filemap_fdatawrite_range() is called without checking the return value. Add a check to catch errors. Fixes: c0aaf9b7a114f ("btrfs: switch to iomap_dio_rw() for dio") Addresses-Coverity: ("Unused value") Signed-off-by: Alex Dewar --- fs/btrfs/inode.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7b57aaa1f9acc..38fde20b4a81b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7347,9 +7347,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, * outstanding dirty pages are on disk. */ if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, -_I(inode)->runtime_flags)) +_I(inode)->runtime_flags)) { ret = filemap_fdatawrite_range(inode->i_mapping, start, start + length - 1); + if (ret) + return ret; + } dio_data = kzalloc(sizeof(*dio_data), GFP_NOFS); if (!dio_data) Had to check to make sure there's no cleanup that's needed, there isn't, you can add Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH][v2] proc: use vmalloc for our kernel buffer
On 8/13/20 5:10 PM, David Laight wrote: From: Josef Bacik Sent: 13 August 2020 18:19 ... We wouldn't even need the extra +1 part, since we're only copying in how much the user wants anyway, we could just go ahead and convert this to left -= snprintf(buffer, left, "0x%04x\n", *(unsigned int *) table->data); and be fine, right? Or am I misunderstanding what you're looking for? Thanks, Doesn't that need to be scnprintf()? IIRC snprintf() returns the number of bytes that would have been written were the buffer infinite size? (I suspect this is an 'accidental' return value from the original SYSV? userspace implementation that just dumped characters that wouldn't fit in the buffer somewhere.) Yeah, if you look at the patches I just sent you'll notice I used scnprintf() everywhere. Thanks, Josef
[PATCH 5/6] parport: rework procfs handlers to take advantage of the new buffer
The buffer coming from higher up the stack has an extra byte to handle the NULL terminator in the string. Instead of using a temporary buffer to sprintf into and then copying into the buffer, just scnprintf directly into the buffer and update lenp as appropriate. Signed-off-by: Josef Bacik --- drivers/parport/procfs.c | 108 +-- 1 file changed, 36 insertions(+), 72 deletions(-) diff --git a/drivers/parport/procfs.c b/drivers/parport/procfs.c index d740eba3c099..453d035ad5f6 100644 --- a/drivers/parport/procfs.c +++ b/drivers/parport/procfs.c @@ -37,9 +37,8 @@ static int do_active_device(struct ctl_table *table, int write, void *result, size_t *lenp, loff_t *ppos) { struct parport *port = (struct parport *)table->extra1; - char buffer[256]; struct pardevice *dev; - int len = 0; + size_t ret = 0; if (write) /* can't happen anyway */ return -EACCES; @@ -48,24 +47,19 @@ static int do_active_device(struct ctl_table *table, int write, *lenp = 0; return 0; } - + for (dev = port->devices; dev ; dev = dev->next) { if(dev == port->cad) { - len += sprintf(buffer, "%s\n", dev->name); + ret += scnprintf(result + ret, *lenp - ret, "%s\n", +dev->name); } } - if(!len) { - len += sprintf(buffer, "%s\n", "none"); - } - - if (len > *lenp) - len = *lenp; - else - *lenp = len; + if (!ret) + ret = scnprintf(result, *lenp, "%s\n", "none"); - *ppos += len; - memcpy(result, buffer, len); + *lenp = ret; + *ppos += ret; return 0; } @@ -75,8 +69,7 @@ static int do_autoprobe(struct ctl_table *table, int write, { struct parport_device_info *info = table->extra2; const char *str; - char buffer[256]; - int len = 0; + size_t ret = 0; if (write) /* permissions stop this */ return -EACCES; @@ -85,30 +78,24 @@ static int do_autoprobe(struct ctl_table *table, int write, *lenp = 0; return 0; } - + if ((str = info->class_name) != NULL) - len += sprintf (buffer + len, "CLASS:%s;\n", str); + ret += scnprintf(result + ret, *lenp - ret, "CLASS:%s;\n", str); if ((str = info->model) != NULL) - len += sprintf (buffer + len, "MODEL:%s;\n", str); + ret += scnprintf(result + ret, *lenp - ret, "MODEL:%s;\n", str); if ((str = info->mfr) != NULL) - len += sprintf (buffer + len, "MANUFACTURER:%s;\n", str); + ret += scnprintf(result + ret, *lenp - ret, "MANUFACTURER:%s;\n", str); if ((str = info->description) != NULL) - len += sprintf (buffer + len, "DESCRIPTION:%s;\n", str); + ret += scnprintf(result + ret, *lenp - ret, "DESCRIPTION:%s;\n", str); if ((str = info->cmdset) != NULL) - len += sprintf (buffer + len, "COMMAND SET:%s;\n", str); - - if (len > *lenp) - len = *lenp; - else - *lenp = len; + ret += scnprintf(result + ret, *lenp - ret, "COMMAND SET:%s;\n", str); - *ppos += len; - - memcpy(result, buffer, len); + *lenp = ret; + *ppos += ret; return 0; } #endif /* IEEE1284.3 support. */ @@ -117,8 +104,7 @@ static int do_hardware_base_addr(struct ctl_table *table, int write, void *result, size_t *lenp, loff_t *ppos) { struct parport *port = (struct parport *)table->extra1; - char buffer[20]; - int len = 0; + size_t ret; if (*ppos) { *lenp = 0; @@ -128,15 +114,10 @@ static int do_hardware_base_addr(struct ctl_table *table, int write, if (write) /* permissions prevent this anyway */ return -EACCES; - len += sprintf (buffer, "%lu\t%lu\n", port->base, port->base_hi); - - if (len > *lenp) - len = *lenp; - else - *lenp = len; - - *ppos += len; - memcpy(result, buffer, len); + ret = scnprintf(result, *lenp, "%lu\t%lu\n", port->base, + port->base_hi); + *lenp = ret; + *ppos += ret; return 0; } @@ -144,8 +125,7 @@ static int do_hardware_irq(struct ctl_table *table, int write, void *result, size_t *lenp, loff_t *ppos) { struct parport *port = (struct parport *)table->
[PATCH 6/6] sunrpc: rework proc handlers to take advantage of the new buffer
Now that we're allocating an extra slot for the NULL terminated string, use scnprintf() and write directly into the buffer. Signed-off-by: Josef Bacik --- net/sunrpc/sysctl.c| 10 ++ net/sunrpc/xprtrdma/svc_rdma.c | 16 ++-- 2 files changed, 4 insertions(+), 22 deletions(-) diff --git a/net/sunrpc/sysctl.c b/net/sunrpc/sysctl.c index 999eee1ed61c..31ed530d9846 100644 --- a/net/sunrpc/sysctl.c +++ b/net/sunrpc/sysctl.c @@ -117,14 +117,8 @@ proc_dodebug(struct ctl_table *table, int write, void *buffer, size_t *lenp, if (strcmp(table->procname, "rpc_debug") == 0) rpc_show_tasks(_net); } else { - len = sprintf(tmpbuf, "0x%04x", *(unsigned int *) table->data); - if (len > left) - len = left; - memcpy(buffer, tmpbuf, len); - if ((left -= len) > 0) { - *((char *)buffer + len) = '\n'; - left--; - } + len = scnprintf(buffer, *lenp, "0x%04x\n", *(unsigned int *) table->data); + left -= len; } done: diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c index 526da5d4710b..9b3a113598af 100644 --- a/net/sunrpc/xprtrdma/svc_rdma.c +++ b/net/sunrpc/xprtrdma/svc_rdma.c @@ -90,20 +90,8 @@ static int read_reset_stat(struct ctl_table *table, int write, if (write) atomic_set(stat, 0); else { - char str_buf[32]; - int len = snprintf(str_buf, 32, "%d\n", atomic_read(stat)); - if (len >= 32) - return -EFAULT; - len = strlen(str_buf); - if (*ppos > len) { - *lenp = 0; - return 0; - } - len -= *ppos; - if (len > *lenp) - len = *lenp; - if (len) - memcpy(buffer, str_buf, len); + size_t len = scnprintf(buffer, *lenp, "%d\n", + atomic_read(stat)); *lenp = len; *ppos += len; } -- 2.24.1
[PATCH 4/6] sysctl: make proc_put_long() use scnprintf
Now that we're passing down a kernel buffer with enough space to account for an extra NULL terminator, go ahead and use scnprintf() to print out a long in proc_put_long(). count here includes NULL terminator slot in the buffer, so we will get the correct behavior we're looking for. Signed-off-by: Josef Bacik --- kernel/sysctl.c | 15 +-- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 287862f91717..d8cc8737f58f 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -484,6 +484,7 @@ static int proc_get_long(char **buf, size_t *size, return 0; } +#undef TMPBUFLEN /** * proc_put_long - converts an integer to a decimal ASCII formatted string @@ -498,18 +499,12 @@ static int proc_get_long(char **buf, size_t *size, */ static void proc_put_long(void **buf, size_t *size, unsigned long val, bool neg) { - int len; - char tmp[TMPBUFLEN], *p = tmp; + size_t ret; - sprintf(p, "%s%lu", neg ? "-" : "", val); - len = strlen(tmp); - if (len > *size) - len = *size; - memcpy(*buf, tmp, len); - *size -= len; - *buf += len; + ret = scnprintf(*buf, *size, "%s%lu", neg ? "-" : "", val); + *size -= ret; + *buf += ret; } -#undef TMPBUFLEN static void proc_put_char(void **buf, size_t *size, char c) { -- 2.24.1
[PATCH 0/6] Some buffer management fixes for proc
This initialy started with [PATCH 1/6] proc: use vmalloc for our kernel buffer Which came about because we were getting page alloc failures when cat tried to do a read with a 64kib buffer, triggering an order 4 allocation. We need to switch to kvmalloc for this buffer to avoid these high order allocations. Then Christoph suggested renaming vmemdup_user to kvmemdup_user, so then we have this [PATCH 2/6] tree-wide: rename vmemdup_user to kvmemdup_user And then finally Viro noticed that if we allocate an extra byte for the NULL terminator then we can use scnprintf() in a few places, and thus the next set of patches [PATCH 3/6] proc: allocate count + 1 for our read buffer [PATCH 4/6] sysctl: make proc_put_long() use scnprintf [PATCH 5/6] parport: rework procfs handlers to take advantage of the [PATCH 6/6] sunrpc: rework proc handlers to take advantage of the new There's one case that I didn't convert, _proc_do_string, and that's because it's one of the few places that takes into account ppos, and so we'll skip forward in the string we're provided with from the caller. In this case it makes sense to just leave it the way it is. I'm pretty sure I caught all the other people who directly mess with the buffer, but there's around 800 ->proc_handler's, and my eyes started to glaze over after a while. Josef
[PATCH 1/6] proc: use vmalloc for our kernel buffer
Since sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blind kmalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, use vmalloc for these buffers. Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Josef Bacik --- fs/proc/proc_sysctl.c | 6 +++--- include/linux/string.h | 1 + mm/util.c | 27 +++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 6c1166ccdaea..8e19bad83b45 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, goto out; if (write) { - kbuf = memdup_user_nul(ubuf, count); + kbuf = kvmemdup_user_nul(ubuf, count); if (IS_ERR(kbuf)) { error = PTR_ERR(kbuf); goto out; } } else { - kbuf = kzalloc(count, GFP_KERNEL); + kbuf = kvzalloc(count, GFP_KERNEL); if (!kbuf) goto out; } @@ -600,7 +600,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, error = count; out_free_buf: - kfree(kbuf); + kvfree(kbuf); out: sysctl_head_finish(head); diff --git a/include/linux/string.h b/include/linux/string.h index 9b7a0632e87a..21bb6d3d88c4 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -12,6 +12,7 @@ extern char *strndup_user(const char __user *, long); extern void *memdup_user(const void __user *, size_t); extern void *vmemdup_user(const void __user *, size_t); +extern void *kvmemdup_user_nul(const void __user *, size_t); extern void *memdup_user_nul(const void __user *, size_t); /* diff --git a/mm/util.c b/mm/util.c index 5ef378a2a038..cf454d57d3e2 100644 --- a/mm/util.c +++ b/mm/util.c @@ -208,6 +208,33 @@ void *vmemdup_user(const void __user *src, size_t len) } EXPORT_SYMBOL(vmemdup_user); +/** + * kvmemdup_user_nul - duplicate memory region from user space and NUL-terminate + * + * @src: source address in user space + * @len: number of bytes to copy + * + * Return: an ERR_PTR() on failure. Result may be not + * physically contiguous. Use kvfree() to free. + */ +void *kvmemdup_user_nul(const void __user *src, size_t len) +{ + char *p; + + p = kvmalloc(len + 1, GFP_USER); + if (!p) + return ERR_PTR(-ENOMEM); + + if (copy_from_user(p, src, len)) { + kvfree(p); + return ERR_PTR(-EFAULT); + } + p[len] = '\0'; + + return p; +} +EXPORT_SYMBOL(kvmemdup_user_nul); + /** * strndup_user - duplicate an existing string from user space * @s: The string to duplicate -- 2.24.1
[PATCH 3/6] proc: allocate count + 1 for our read buffer
Al suggested that if we allocate enough space to add in the '\0' character at the end of our strings, we could just use scnprintf() in our ->proc_handler functions without having to be fancy about keeping track of space. There are a lot of these handlers, so the follow ups will be separate, but start with allocating the extra byte to handle the null termination of strings. Signed-off-by: Josef Bacik --- fs/proc/proc_sysctl.c | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 8e19bad83b45..446e7a949025 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -548,6 +548,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, struct ctl_table *table = PROC_I(inode)->sysctl_entry; void *kbuf; ssize_t error; + size_t orig_count = count; if (IS_ERR(head)) return PTR_ERR(head); @@ -577,9 +578,23 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, goto out; } } else { - kbuf = kvzalloc(count, GFP_KERNEL); + /* +* To make our lives easier in ->proc_handler, we allocate an +* extra byte to allow us to use scnprintf() for handling the +* buffer output. This works properly because scnprintf() will +* only return the number of bytes that it was able to write +* out, _NOT_ including the NULL byte. This means the handler's +* will only ever return a maximum of count as what they've +* copied. +* +* HOWEVER, we do not assume that ->proc_handlers are without +* bugs, so further down we'll do an extra check to make sure +* that count isn't larger than the orig_count. +*/ + kbuf = kvzalloc(count + 1, GFP_KERNEL); if (!kbuf) goto out; + count += 1; } error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, , , @@ -593,6 +608,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, goto out_free_buf; if (!write) { + /* +* This shouldn't happen, but those are the last words before +* somebody adds a security vulnerability, so just make sure +* that count isn't larger than orig_count. +*/ + if (count > orig_count) + count = orig_count; error = -EFAULT; if (copy_to_user(ubuf, kbuf, count)) goto out_free_buf; -- 2.24.1
[PATCH 2/6] tree-wide: rename vmemdup_user to kvmemdup_user
This helper uses kvmalloc, not vmalloc, so rename it to kvmemdup_user to make it clear we're using kvmalloc() and will need to use kvfree(). Signed-off-by: Josef Bacik --- arch/x86/kvm/cpuid.c | 6 +++--- drivers/gpu/drm/virtio/virtgpu_ioctl.c | 2 +- drivers/tty/vt/consolemap.c| 2 +- include/linux/string.h | 2 +- mm/util.c | 6 +++--- sound/core/control.c | 4 ++-- virt/kvm/kvm_main.c| 4 ++-- 7 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 3fd6eec202d7..22834ea499ee 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -200,9 +200,9 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, if (cpuid->nent > KVM_MAX_CPUID_ENTRIES) goto out; if (cpuid->nent) { - cpuid_entries = vmemdup_user(entries, -array_size(sizeof(struct kvm_cpuid_entry), - cpuid->nent)); + cpuid_entries = kvmemdup_user(entries, + array_size(sizeof(struct kvm_cpuid_entry), +cpuid->nent)); if (IS_ERR(cpuid_entries)) { r = PTR_ERR(cpuid_entries); goto out; diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c index 7a2430e34e00..c2f973aa3680 100644 --- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c +++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c @@ -147,7 +147,7 @@ static int virtio_gpu_execbuffer_ioctl(struct drm_device *dev, void *data, bo_handles = NULL; } - buf = vmemdup_user(u64_to_user_ptr(exbuf->command), exbuf->size); + buf = kvmemdup_user(u64_to_user_ptr(exbuf->command), exbuf->size); if (IS_ERR(buf)) { ret = PTR_ERR(buf); goto out_unused_fd; diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c index 5947b54d92be..2cffa8b3c74b 100644 --- a/drivers/tty/vt/consolemap.c +++ b/drivers/tty/vt/consolemap.c @@ -542,7 +542,7 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list) if (!ct) return 0; - unilist = vmemdup_user(list, array_size(sizeof(struct unipair), ct)); + unilist = kvmemdup_user(list, array_size(sizeof(struct unipair), ct)); if (IS_ERR(unilist)) return PTR_ERR(unilist); diff --git a/include/linux/string.h b/include/linux/string.h index 21bb6d3d88c4..a6f7218124a0 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -11,7 +11,7 @@ extern char *strndup_user(const char __user *, long); extern void *memdup_user(const void __user *, size_t); -extern void *vmemdup_user(const void __user *, size_t); +extern void *kvmemdup_user(const void __user *, size_t); extern void *kvmemdup_user_nul(const void __user *, size_t); extern void *memdup_user_nul(const void __user *, size_t); diff --git a/mm/util.c b/mm/util.c index cf454d57d3e2..f434634b6ba3 100644 --- a/mm/util.c +++ b/mm/util.c @@ -183,7 +183,7 @@ void *memdup_user(const void __user *src, size_t len) EXPORT_SYMBOL(memdup_user); /** - * vmemdup_user - duplicate memory region from user space + * kvmemdup_user - duplicate memory region from user space * * @src: source address in user space * @len: number of bytes to copy @@ -191,7 +191,7 @@ EXPORT_SYMBOL(memdup_user); * Return: an ERR_PTR() on failure. Result may be not * physically contiguous. Use kvfree() to free. */ -void *vmemdup_user(const void __user *src, size_t len) +void *kvmemdup_user(const void __user *src, size_t len) { void *p; @@ -206,7 +206,7 @@ void *vmemdup_user(const void __user *src, size_t len) return p; } -EXPORT_SYMBOL(vmemdup_user); +EXPORT_SYMBOL(kvmemdup_user); /** * kvmemdup_user_nul - duplicate memory region from user space and NUL-terminate diff --git a/sound/core/control.c b/sound/core/control.c index aa0c0cf182af..b712f4d261de 100644 --- a/sound/core/control.c +++ b/sound/core/control.c @@ -1297,7 +1297,7 @@ static int replace_user_tlv(struct snd_kcontrol *kctl, unsigned int __user *buf, if (size > 1024 * 128) /* sane value */ return -EINVAL; - container = vmemdup_user(buf, size); + container = kvmemdup_user(buf, size); if (IS_ERR(container)) return PTR_ERR(container); @@ -1365,7 +1365,7 @@ static int snd_ctl_elem_init_enum_names(struct user_element *ue) if (ue->info.value.enumerated.names_length > 64 * 1024) return -EINVAL; - names = vmemdup_user((const void __user *)user_ptrval, + names = kvmemdup_user((const void __user *)user_ptrval, ue
Re: [PATCH][v2] proc: use vmalloc for our kernel buffer
On 8/13/20 1:31 PM, Al Viro wrote: On Thu, Aug 13, 2020 at 01:19:18PM -0400, Josef Bacik wrote: in sunrpc proc_dodebug() turns into left -= snprintf(buffer, left, "0x%04x\n", left + 1, that is. *(unsigned int *) table->data); and that's not the only example. We wouldn't even need the extra +1 part, since we're only copying in how much the user wants anyway, we could just go ahead and convert this to left -= snprintf(buffer, left, "0x%04x\n", *(unsigned int *) table->data); and be fine, right? Or am I misunderstanding what you're looking for? Thanks, snprintf() always produces a NUL-terminated string. And if you are passing 7 as len, you want 0xf0ad\n to be copied to user. For that you need 8 passed to snprintf, and 8-byte buffer given to it. Right, gotcha. I'll rig that up and see how it looks. I'd recommend looking through what I do with a fine tooth comb, I'm obviously not batting 1000 today. Thanks, Josef
Re: [PATCH][v2] proc: use vmalloc for our kernel buffer
On 8/13/20 12:20 PM, Al Viro wrote: On Thu, Aug 13, 2020 at 05:41:17PM +0200, Christoph Hellwig wrote: On Thu, Aug 13, 2020 at 11:40:00AM -0400, Josef Bacik wrote: On 8/13/20 11:37 AM, Christoph Hellwig wrote: On Thu, Aug 13, 2020 at 11:33:56AM -0400, Josef Bacik wrote: Since sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blind kmalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, use vmalloc for these buffers. Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Josef Bacik --- v1->v2: - Make vmemdup_user_nul actually do the right thing...sorry about that. fs/proc/proc_sysctl.c | 6 +++--- include/linux/string.h | 1 + mm/util.c | 27 +++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 6c1166ccdaea..207ac6e6e028 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, goto out; if (write) { - kbuf = memdup_user_nul(ubuf, count); + kbuf = vmemdup_user_nul(ubuf, count); Given that this can also do a kmalloc and thus needs to be paired with kvfree shouldn't it be kvmemdup_user_nul? There's an existing vmemdup_user that does kvmalloc, so I followed the existing naming convention. Do you want me to change them both? Thanks, I personally would, and given that it only has a few users it might even be feasible. FWIW, how about following or combining that with "allocate count + 1 bytes on the read side"? Allows some nice cleanups - e.g. len = sprintf(tmpbuf, "0x%04x", *(unsigned int *) table->data); if (len > left) len = left; memcpy(buffer, tmpbuf, len); if ((left -= len) > 0) { *((char *)buffer + len) = '\n'; left--; } in sunrpc proc_dodebug() turns into left -= snprintf(buffer, left, "0x%04x\n", *(unsigned int *) table->data); and that's not the only example. We wouldn't even need the extra +1 part, since we're only copying in how much the user wants anyway, we could just go ahead and convert this to left -= snprintf(buffer, left, "0x%04x\n", *(unsigned int *) table->data); and be fine, right? Or am I misunderstanding what you're looking for? Thanks, Josef
Re: [PATCH] proc: use vmalloc for our kernel buffer
On 8/13/20 12:19 PM, David Laight wrote: From: Josef Bacik Sent: 13 August 2020 15:53 sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blind kmalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, use vmalloc for these buffers. What happens if I run 'dd bs=16M ...' ? David /* don't even try if the size is too large */ error = -ENOMEM; if (count >= KMALLOC_MAX_SIZE) goto out; is above this code, thanks, Josef
[PATCH 2/2] tree-wide: rename vmemdup_user to kvmemdup_user
This helper uses kvmalloc, not vmalloc, so rename it to kvmemdup_user to make it clear we're using kvmalloc() and will need to use kvfree(). Signed-off-by: Josef Bacik --- arch/x86/kvm/cpuid.c | 6 +++--- drivers/gpu/drm/virtio/virtgpu_ioctl.c | 2 +- drivers/tty/vt/consolemap.c| 2 +- include/linux/string.h | 2 +- mm/util.c | 6 +++--- sound/core/control.c | 4 ++-- virt/kvm/kvm_main.c| 4 ++-- 7 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 3fd6eec202d7..22834ea499ee 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -200,9 +200,9 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, if (cpuid->nent > KVM_MAX_CPUID_ENTRIES) goto out; if (cpuid->nent) { - cpuid_entries = vmemdup_user(entries, -array_size(sizeof(struct kvm_cpuid_entry), - cpuid->nent)); + cpuid_entries = kvmemdup_user(entries, + array_size(sizeof(struct kvm_cpuid_entry), +cpuid->nent)); if (IS_ERR(cpuid_entries)) { r = PTR_ERR(cpuid_entries); goto out; diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c index 7a2430e34e00..c2f973aa3680 100644 --- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c +++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c @@ -147,7 +147,7 @@ static int virtio_gpu_execbuffer_ioctl(struct drm_device *dev, void *data, bo_handles = NULL; } - buf = vmemdup_user(u64_to_user_ptr(exbuf->command), exbuf->size); + buf = kvmemdup_user(u64_to_user_ptr(exbuf->command), exbuf->size); if (IS_ERR(buf)) { ret = PTR_ERR(buf); goto out_unused_fd; diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c index 5947b54d92be..2cffa8b3c74b 100644 --- a/drivers/tty/vt/consolemap.c +++ b/drivers/tty/vt/consolemap.c @@ -542,7 +542,7 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list) if (!ct) return 0; - unilist = vmemdup_user(list, array_size(sizeof(struct unipair), ct)); + unilist = kvmemdup_user(list, array_size(sizeof(struct unipair), ct)); if (IS_ERR(unilist)) return PTR_ERR(unilist); diff --git a/include/linux/string.h b/include/linux/string.h index 21bb6d3d88c4..a6f7218124a0 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -11,7 +11,7 @@ extern char *strndup_user(const char __user *, long); extern void *memdup_user(const void __user *, size_t); -extern void *vmemdup_user(const void __user *, size_t); +extern void *kvmemdup_user(const void __user *, size_t); extern void *kvmemdup_user_nul(const void __user *, size_t); extern void *memdup_user_nul(const void __user *, size_t); diff --git a/mm/util.c b/mm/util.c index cf454d57d3e2..f434634b6ba3 100644 --- a/mm/util.c +++ b/mm/util.c @@ -183,7 +183,7 @@ void *memdup_user(const void __user *src, size_t len) EXPORT_SYMBOL(memdup_user); /** - * vmemdup_user - duplicate memory region from user space + * kvmemdup_user - duplicate memory region from user space * * @src: source address in user space * @len: number of bytes to copy @@ -191,7 +191,7 @@ EXPORT_SYMBOL(memdup_user); * Return: an ERR_PTR() on failure. Result may be not * physically contiguous. Use kvfree() to free. */ -void *vmemdup_user(const void __user *src, size_t len) +void *kvmemdup_user(const void __user *src, size_t len) { void *p; @@ -206,7 +206,7 @@ void *vmemdup_user(const void __user *src, size_t len) return p; } -EXPORT_SYMBOL(vmemdup_user); +EXPORT_SYMBOL(kvmemdup_user); /** * kvmemdup_user_nul - duplicate memory region from user space and NUL-terminate diff --git a/sound/core/control.c b/sound/core/control.c index aa0c0cf182af..b712f4d261de 100644 --- a/sound/core/control.c +++ b/sound/core/control.c @@ -1297,7 +1297,7 @@ static int replace_user_tlv(struct snd_kcontrol *kctl, unsigned int __user *buf, if (size > 1024 * 128) /* sane value */ return -EINVAL; - container = vmemdup_user(buf, size); + container = kvmemdup_user(buf, size); if (IS_ERR(container)) return PTR_ERR(container); @@ -1365,7 +1365,7 @@ static int snd_ctl_elem_init_enum_names(struct user_element *ue) if (ue->info.value.enumerated.names_length > 64 * 1024) return -EINVAL; - names = vmemdup_user((const void __user *)user_ptrval, + names = kvmemdup_user((const void __user *)user_ptrval, ue
[PATCH 1/2][v3] proc: use vmalloc for our kernel buffer
Since sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blind kmalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, use vmalloc for these buffers. Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Josef Bacik --- v2->v3: - Rename vmemdup_user_nul to kvmemdup_user_nul. v1->v2: - Make vmemdup_user_nul actually do the right thing...sorry about that. fs/proc/proc_sysctl.c | 6 +++--- include/linux/string.h | 1 + mm/util.c | 27 +++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 6c1166ccdaea..8e19bad83b45 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, goto out; if (write) { - kbuf = memdup_user_nul(ubuf, count); + kbuf = kvmemdup_user_nul(ubuf, count); if (IS_ERR(kbuf)) { error = PTR_ERR(kbuf); goto out; } } else { - kbuf = kzalloc(count, GFP_KERNEL); + kbuf = kvzalloc(count, GFP_KERNEL); if (!kbuf) goto out; } @@ -600,7 +600,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, error = count; out_free_buf: - kfree(kbuf); + kvfree(kbuf); out: sysctl_head_finish(head); diff --git a/include/linux/string.h b/include/linux/string.h index 9b7a0632e87a..21bb6d3d88c4 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -12,6 +12,7 @@ extern char *strndup_user(const char __user *, long); extern void *memdup_user(const void __user *, size_t); extern void *vmemdup_user(const void __user *, size_t); +extern void *kvmemdup_user_nul(const void __user *, size_t); extern void *memdup_user_nul(const void __user *, size_t); /* diff --git a/mm/util.c b/mm/util.c index 5ef378a2a038..cf454d57d3e2 100644 --- a/mm/util.c +++ b/mm/util.c @@ -208,6 +208,33 @@ void *vmemdup_user(const void __user *src, size_t len) } EXPORT_SYMBOL(vmemdup_user); +/** + * kvmemdup_user_nul - duplicate memory region from user space and NUL-terminate + * + * @src: source address in user space + * @len: number of bytes to copy + * + * Return: an ERR_PTR() on failure. Result may be not + * physically contiguous. Use kvfree() to free. + */ +void *kvmemdup_user_nul(const void __user *src, size_t len) +{ + char *p; + + p = kvmalloc(len + 1, GFP_USER); + if (!p) + return ERR_PTR(-ENOMEM); + + if (copy_from_user(p, src, len)) { + kvfree(p); + return ERR_PTR(-EFAULT); + } + p[len] = '\0'; + + return p; +} +EXPORT_SYMBOL(kvmemdup_user_nul); + /** * strndup_user - duplicate an existing string from user space * @s: The string to duplicate -- 2.24.1
Re: [PATCH][v2] proc: use vmalloc for our kernel buffer
On 8/13/20 11:37 AM, Christoph Hellwig wrote: On Thu, Aug 13, 2020 at 11:33:56AM -0400, Josef Bacik wrote: Since sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blind kmalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, use vmalloc for these buffers. Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Josef Bacik --- v1->v2: - Make vmemdup_user_nul actually do the right thing...sorry about that. fs/proc/proc_sysctl.c | 6 +++--- include/linux/string.h | 1 + mm/util.c | 27 +++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 6c1166ccdaea..207ac6e6e028 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, goto out; if (write) { - kbuf = memdup_user_nul(ubuf, count); + kbuf = vmemdup_user_nul(ubuf, count); Given that this can also do a kmalloc and thus needs to be paired with kvfree shouldn't it be kvmemdup_user_nul? There's an existing vmemdup_user that does kvmalloc, so I followed the existing naming convention. Do you want me to change them both? Thanks, Josef
[PATCH][v2] proc: use vmalloc for our kernel buffer
Since sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blind kmalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, use vmalloc for these buffers. Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Josef Bacik --- v1->v2: - Make vmemdup_user_nul actually do the right thing...sorry about that. fs/proc/proc_sysctl.c | 6 +++--- include/linux/string.h | 1 + mm/util.c | 27 +++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 6c1166ccdaea..207ac6e6e028 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, goto out; if (write) { - kbuf = memdup_user_nul(ubuf, count); + kbuf = vmemdup_user_nul(ubuf, count); if (IS_ERR(kbuf)) { error = PTR_ERR(kbuf); goto out; } } else { - kbuf = kzalloc(count, GFP_KERNEL); + kbuf = kvzalloc(count, GFP_KERNEL); if (!kbuf) goto out; } @@ -600,7 +600,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, error = count; out_free_buf: - kfree(kbuf); + kvfree(kbuf); out: sysctl_head_finish(head); diff --git a/include/linux/string.h b/include/linux/string.h index 9b7a0632e87a..aee3689fb865 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -12,6 +12,7 @@ extern char *strndup_user(const char __user *, long); extern void *memdup_user(const void __user *, size_t); extern void *vmemdup_user(const void __user *, size_t); +extern void *vmemdup_user_nul(const void __user *, size_t); extern void *memdup_user_nul(const void __user *, size_t); /* diff --git a/mm/util.c b/mm/util.c index 5ef378a2a038..9d0ad7aafc27 100644 --- a/mm/util.c +++ b/mm/util.c @@ -208,6 +208,33 @@ void *vmemdup_user(const void __user *src, size_t len) } EXPORT_SYMBOL(vmemdup_user); +/** + * vmemdup_user_nul - duplicate memory region from user space and NUL-terminate + * + * @src: source address in user space + * @len: number of bytes to copy + * + * Return: an ERR_PTR() on failure. Result may be not + * physically contiguous. Use kvfree() to free. + */ +void *vmemdup_user_nul(const void __user *src, size_t len) +{ + char *p; + + p = kvmalloc(len + 1, GFP_USER); + if (!p) + return ERR_PTR(-ENOMEM); + + if (copy_from_user(p, src, len)) { + kvfree(p); + return ERR_PTR(-EFAULT); + } + p[len] = '\0'; + + return p; +} +EXPORT_SYMBOL(vmemdup_user_nul); + /** * strndup_user - duplicate an existing string from user space * @s: The string to duplicate -- 2.24.1
Re: [PATCH] proc: use vmalloc for our kernel buffer
On 8/13/20 10:59 AM, Matthew Wilcox wrote: On Thu, Aug 13, 2020 at 10:53:05AM -0400, Josef Bacik wrote: +/** + * vmemdup_user - duplicate memory region from user space and NUL-terminate vmemdup_user_nul() +void *vmemdup_user_nul(const void __user *src, size_t len) +{ + void *p; + + p = kvmalloc(len, GFP_USER); len+1, shirley? + if (!p) + return ERR_PTR(-ENOMEM); + + if (copy_from_user(p, src, len)) { + kvfree(p); + return ERR_PTR(-EFAULT); + } I think you forgot p[len] = '\0'; Sweet lord I need more sleep, my bad. Thanks, Josef
[PATCH] proc: use vmalloc for our kernel buffer
Since sysctl: pass kernel pointers to ->proc_handler we have been pre-allocating a buffer to copy the data from the proc handlers into, and then copying that to userspace. The problem is this just blind kmalloc()'s the buffer size passed in from the read, which in the case of our 'cat' binary was 64kib. Order-4 allocations are not awesome, and since we can potentially allocate up to our maximum order, use vmalloc for these buffers. Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: Josef Bacik --- fs/proc/proc_sysctl.c | 6 +++--- include/linux/string.h | 1 + mm/util.c | 26 ++ 3 files changed, 30 insertions(+), 3 deletions(-) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 6c1166ccdaea..207ac6e6e028 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, goto out; if (write) { - kbuf = memdup_user_nul(ubuf, count); + kbuf = vmemdup_user_nul(ubuf, count); if (IS_ERR(kbuf)) { error = PTR_ERR(kbuf); goto out; } } else { - kbuf = kzalloc(count, GFP_KERNEL); + kbuf = kvzalloc(count, GFP_KERNEL); if (!kbuf) goto out; } @@ -600,7 +600,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, error = count; out_free_buf: - kfree(kbuf); + kvfree(kbuf); out: sysctl_head_finish(head); diff --git a/include/linux/string.h b/include/linux/string.h index 9b7a0632e87a..aee3689fb865 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -12,6 +12,7 @@ extern char *strndup_user(const char __user *, long); extern void *memdup_user(const void __user *, size_t); extern void *vmemdup_user(const void __user *, size_t); +extern void *vmemdup_user_nul(const void __user *, size_t); extern void *memdup_user_nul(const void __user *, size_t); /* diff --git a/mm/util.c b/mm/util.c index 5ef378a2a038..4de3b4b0f358 100644 --- a/mm/util.c +++ b/mm/util.c @@ -208,6 +208,32 @@ void *vmemdup_user(const void __user *src, size_t len) } EXPORT_SYMBOL(vmemdup_user); +/** + * vmemdup_user - duplicate memory region from user space and NUL-terminate + * + * @src: source address in user space + * @len: number of bytes to copy + * + * Return: an ERR_PTR() on failure. Result may be not + * physically contiguous. Use kvfree() to free. + */ +void *vmemdup_user_nul(const void __user *src, size_t len) +{ + void *p; + + p = kvmalloc(len, GFP_USER); + if (!p) + return ERR_PTR(-ENOMEM); + + if (copy_from_user(p, src, len)) { + kvfree(p); + return ERR_PTR(-EFAULT); + } + + return p; +} +EXPORT_SYMBOL(vmemdup_user_nul); + /** * strndup_user - duplicate an existing string from user space * @s: The string to duplicate -- 2.24.1
Re: Inverted mount options completely broken (iversion,relatime)
On 7/29/20 2:41 PM, Eric Sandeen wrote: On 7/29/20 11:32 AM, Josef Bacik wrote: Hello, Eric reported a problem to me where we were clearing SB_I_VERSION on remount of a btrfs file system. After digging through I discovered it's because we expect the proper flags that we want to be passed in via the mount() syscall, and because we didn't have "iversion" in our show_options entry the mount binary (form util-linux) wasn't setting MS_I_VERSION for the remount, and thus the VFS was clearing SB_I_VERSION from our s_flags. No big deal, I'll fix show_mount. Except Eric then noticed that mount -o noiversion didn't do anything, we still get iversion set. That's because btrfs just defaults to having SB_I_VERSION set. Furthermore -o noiversion doesn't get sent into mount, it's handled by the mount binary itself, and it does this by not having MS_I_VERSION set in the mount flags. This was beaten^Wdiscussed to death in an earlier thread, [PATCH] fs: i_version mntopt gets visible through /proc/mounts https://lore.kernel.org/linux-fsdevel/20200616202123.12656-1-msys.miz...@gmail.com/ tl;dr: hch doesn't think [no]iversion should be exposed as an option /at all/ so exposing it in /proc/mounts in show_mnt_opts for mount(8)'s benefit was nacked. This happens as well for -o relatime, it's the default and so if you do mount -o norelatime it won't do anything, you still get relatime behavior. I think that's a different issue. The only time this changes is if you do mount -o remount,norelatime. Hm, not on xfs: # mount -o loop,norelatime xfsfile mnt # grep loop /proc/mounts /dev/loop0 /tmp/mnt xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0 # mount -o remount,norelatime mnt # grep loop /proc/mounts /dev/loop0 /tmp/mnt xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0 Oops you're right, I'm blind. Same happens for btrfs, so using -o norelatime simply does nothing because it's considered a kernel wide default. Are there other oddities besides iversion and relatime? It doesn't look like it, I checked a few others of the MS_INVERT variety, these appear to be the only ones. I really don't want to have this discussion again in the future tho when we introduce MS_SOME_NEW_AWESOME. Thanks, Josef
Inverted mount options completely broken (iversion,relatime)
Hello, Eric reported a problem to me where we were clearing SB_I_VERSION on remount of a btrfs file system. After digging through I discovered it's because we expect the proper flags that we want to be passed in via the mount() syscall, and because we didn't have "iversion" in our show_options entry the mount binary (form util-linux) wasn't setting MS_I_VERSION for the remount, and thus the VFS was clearing SB_I_VERSION from our s_flags. No big deal, I'll fix show_mount. Except Eric then noticed that mount -o noiversion didn't do anything, we still get iversion set. That's because btrfs just defaults to having SB_I_VERSION set. Furthermore -o noiversion doesn't get sent into mount, it's handled by the mount binary itself, and it does this by not having MS_I_VERSION set in the mount flags. This happens as well for -o relatime, it's the default and so if you do mount -o norelatime it won't do anything, you still get relatime behavior. The only time this changes is if you do mount -o remount,norelatime. So my question is, what do we do here? I know Christoph has the strong opinion that we just don't expose I_VERSION at all, which frankly I'm ok with. However more what I'm asking is what do we do with these weird inverted flags that we all just kind of ignore on mount? The current setup is just broken if we want to allow overriding the defaults at mount time. Are we ok with it just being broken? Are we ok with things like mount -o noiversion not working because the file system has decided that I_VERSION (or relatime) is the default, and we're going to ignore what the user asks for unless we're remounting? Thanks, Josef
[PATCH] ftrace: fix ftrace_trace_task return value
I was attempting to use pid filtering with function_graph, but it wasn't allowing anything to make it through. Turns out ftrace_trace_task returns false if ftrace_ignore_pid is not-empty, which isn't correct anymore. We're now setting it to FTRACE_PID_IGNORE if we need to ignore that pid, otherwise it's set to the pid (which is weird considering the name) or to FTRACE_PID_TRACE. Fix the check to check for != FTRACE_PID_IGNORE. With this we can now use function_graph with pid filtering. Fixes: 717e3f5ebc82 ("ftrace: Make function trace pid filtering a bit more exact") Signed-off-by: Josef Bacik --- kernel/trace/ftrace.c | 3 --- kernel/trace/trace.h | 7 ++- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 1903b80db6eb..7d879fae3777 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -139,9 +139,6 @@ static inline void ftrace_ops_init(struct ftrace_ops *ops) #endif } -#define FTRACE_PID_IGNORE -1 -#define FTRACE_PID_TRACE -2 - static void ftrace_pid_func(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *op, struct pt_regs *regs) { diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 13db4000af3f..1531ec565cb5 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -1103,6 +1103,10 @@ print_graph_function_flags(struct trace_iterator *iter, u32 flags) extern struct list_head ftrace_pids; #ifdef CONFIG_FUNCTION_TRACER + +#define FTRACE_PID_IGNORE -1 +#define FTRACE_PID_TRACE -2 + struct ftrace_func_command { struct list_headlist; char*name; @@ -1114,7 +1118,8 @@ struct ftrace_func_command { extern bool ftrace_filter_param __initdata; static inline int ftrace_trace_task(struct trace_array *tr) { - return !this_cpu_read(tr->array_buffer.data->ftrace_ignore_pid); + return this_cpu_read(tr->array_buffer.data->ftrace_ignore_pid) != + FTRACE_PID_IGNORE; } extern int ftrace_is_dead(void); int ftrace_create_function_files(struct trace_array *tr, -- 2.24.1
Re: [PATCH v2] btrfs: fix mount failure caused by race with umount
On 7/10/20 1:23 PM, Boris Burkov wrote: It is possible to cause a btrfs mount to fail by racing it with a slow umount. The crux of the sequence is generic_shutdown_super not yet calling sop->put_super before btrfs_mount_root calls btrfs_open_devices. If that occurs, btrfs_open_devices will decide the opened counter is non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to 0. From here, mount will call sget which will result in grab_super trying to take the super block umount semaphore. That semaphore will be held by the slow umount, so mount will block. Before up-ing the semaphore, umount will delete the super block, resulting in mount's sget reliably allocating a new one, which causes the mount path to dutifully fill it out, and increment total_rw_bytes a second time, which causes the mount to fail, as we see double the expected bytes. Here is the sequence laid out in greater detail: CPU0CPU1 down_write sb->s_umount btrfs_kill_super kill_anon_super(sb) generic_shutdown_super(sb); shrink_dcache_for_umount(sb); sync_filesystem(sb); evict_inodes(sb); // SLOW btrfs_mount_root btrfs_scan_one_device fs_devices = device->fs_devices fs_info->fs_devices = fs_devices // fs_devices-opened makes this a no-op btrfs_open_devices(fs_devices, mode, fs_type) s = sget(fs_type, test, set, flags, fs_info); find sb in s_instances grab_super(sb); down_write(>s_umount); // blocks sop->put_super(sb) // sb->fs_devices->opened == 2; no-op spin_lock(_lock); hlist_del_init(>s_instances); spin_unlock(_lock); up_write(>s_umount); return 0; retry lookup don't find sb in s_instances (deleted by CPU0) s = alloc_super return s; btrfs_fill_super(s, fs_devices, data) open_ctree // fs_devices total_rw_bytes improperly set! btrfs_read_chunk_tree read_one_dev // increment total_rw_bytes again!! super_total_bytes < fs_devices->total_rw_bytes // ERROR!!! To fix this, we observe that if we have already filled the device, the state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can use that to avoid filling it a second time for no reason and, critically, avoid double counting in total_rw_bytes. One gotcha is that read_one_chunk also sets this bit, which happens before read_one_dev (in read_sys_array), so we must remove that setting of the bit as well, for the state bit to truly correspond to the device struct being filled from disk. To reproduce, it is sufficient to dirty a decent number of inodes, then quickly umount and mount. for i in $(seq 0 500) do dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1 done umount /mnt/foo& mount /mnt/foo does the trick for me. A final note is that this fix actually breaks the fstest btrfs/163, but having investigated it, I believe that is due to a subtle flaw in how btrfs replace works when used on a seed device. The replace target device never gets a correct dev_item with the sprout fsid written out. This causes several problems, but for the sake of btrfs/163, read_one_chunk marking the device with IN_FS_METADATA was wallpapering over it, which this patch breaks. I will be sending a subsequent fix for the seed replace issue which will also fix btrfs/163. Signed-off-by: Boris Burkov Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] btrfs: fix mount failure caused by race with umount
On 7/9/20 8:44 PM, Boris Burkov wrote: It is possible to cause a btrfs mount to fail by racing it with a slow umount. The crux of the sequence is generic_shutdown_super not yet calling sop->put_super before btrfs_mount_root calls btrfs_open_devices. If that occurs, btrfs_open_devices will decide the opened counter is non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to 0. From here, mount will call sget which will result in grab_super trying to take the super block umount semaphore. That semaphore will be held by the slow umount, so mount will block. Before up-ing the semaphore, umount will delete the super block, resulting in mount's sget reliably allocating a new one, which causes the mount path to dutifully fill it out, and increment total_rw_bytes a second time, which causes the mount to fail, as we see double the expected bytes. Here is the sequence laid out in greater detail: CPU0CPU1 down_write sb->s_umount btrfs_kill_super kill_anon_super(sb) generic_shutdown_super(sb); shrink_dcache_for_umount(sb); sync_filesystem(sb); evict_inodes(sb); // SLOW btrfs_mount_root btrfs_scan_one_device fs_devices = device->fs_devices fs_info->fs_devices = fs_devices // fs_devices-opened makes this a no-op btrfs_open_devices(fs_devices, mode, fs_type) s = sget(fs_type, test, set, flags, fs_info); find sb in s_instances grab_super(sb); down_write(>s_umount); // blocks sop->put_super(sb) // sb->fs_devices->opened == 2; no-op spin_lock(_lock); hlist_del_init(>s_instances); spin_unlock(_lock); up_write(>s_umount); return 0; retry lookup don't find sb in s_instances (deleted by CPU0) s = alloc_super return s; btrfs_fill_super(s, fs_devices, data) open_ctree // fs_devices total_rw_bytes improperly set! btrfs_read_chunk_tree read_one_dev // increment total_rw_bytes again!! super_total_bytes < fs_devices->total_rw_bytes // ERROR!!! To fix this, we observe that if we have already filled the device, the state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can use that to avoid filling it a second time for no reason and, critically, avoid double counting in total_rw_bytes. One gotcha is that read_one_chunk also sets this bit, which happens before read_one_dev (in read_sys_array), so we must remove that setting of the bit as well, for the state bit to truly correspond to the device struct being filled from disk. To reproduce, it is sufficient to dirty a decent number of inodes, then quickly umount and mount. for i in $(seq 0 500) do dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1 done umount /mnt/foo& mount /mnt/foo does the trick for me. A final note is that this fix actually breaks the fstest btrfs/163, but having investigated it, I believe that is due to a subtle flaw in how btrfs replace works when used on a seed device. The replace target device never gets a correct dev_item with the sprout fsid written out. This causes several problems, but for the sake of btrfs/163, read_one_chunk marking the device with IN_FS_METADATA was wallpapering over it, which this patch breaks. I will be sending a subsequent fix for the seed replace issue which will also fix btrfs/163. Signed-off-by: Boris Burkov --- fs/btrfs/volumes.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c7a3d4d730a3..1d9bd1bbf893 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6633,9 +6633,6 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf, } btrfs_report_missing_device(fs_info, devid, uuid, false); } - set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, - &(map->stripes[i].dev->dev_state)); - } write_lock(_tree->lock); @@ -6815,6 +6812,15 @@ static int
Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
On 6/17/20 2:11 PM, Filipe Manana wrote: On Wed, Jun 17, 2020 at 6:43 PM Chris Mason wrote: On 17 Jun 2020, at 13:20, Filipe Manana wrote: On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov wrote: --- fs/btrfs/extent_io.c | 45 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c59e07360083..f6758ebbb6a2 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, clear_bit(EXTENT_BUFFER_WRITE_ERR, >bflags); num_pages = num_extent_pages(eb); atomic_set(>io_pages, num_pages); + /* +* It is possible for releasepage to clear the TREE_REF bit before we +* set io_pages. See check_buffer_tree_ref for a more detailed comment. +*/ + check_buffer_tree_ref(eb); This is a whole different case from the one described in the changelog, as this is in the write path. Why do we need this one? This was Josef’s idea, but I really like the symmetry. You set io_pages, you do the tree_ref dance. Everyone fiddling with the write back bit right now correctly clears writeback after doing the atomic_dec on io_pages, but the race is tiny and prone to getting exposed again by shifting code around. Tree ref checks around io_pages are the most reliable way to prevent this bug from coming back again later. Ok, but that still doesn't answer my question. Is there an actual race/problem this hunk solves? Before calling write_one_eb() we increment the ref on the eb and we also call lock_extent_buffer_for_io(), which clears the dirty bit and sets the writeback bit on the eb while holding its ref_locks spin_lock. Even if we get to try_release_extent_buffer, it calls extent_buffer_under_io(eb) while holding the ref_locks spin_lock, so at any time it should return true, as either the dirty or the writeback bit is set. Is this purely a safety guard that is being introduced? Can we at least describe in the changelog why we are adding this hunk in the write path? All it mentions is a race between reading and releasing pages, there's nothing mentioned about races with writeback. I think maybe we make that bit a separate patch, and leave the panic fix on it's own. I suggested this because I have lofty ideas of killing the refs_lock, and this would at least keep us consistent in our usage TREE_REF to save us from freeing stuff that may be currently under IO. I'm super not happy with our reference handling coupled with releasepage, but these are the kind of hoops we're going to have to jump through until we have some better infrastructure in place to handle metadata. Thanks, Josef
Re: [btrfs] 3ae92b3782: xfstests.generic.269.fail
On Tue, Sep 03, 2019 at 04:06:33PM +0800, kernel test robot wrote: > FYI, we noticed the following commit (built with gcc-7): > > commit: 3ae92b3782182d282a92573abe95c96d34ca6e73 ("btrfs: change the minimum > global reserve size") > https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git > master > > in testcase: xfstests > with following parameters: > > disk: 4HDD > fs: btrfs > test: generic-group13 > > test-description: xfstests is a regression test suite for xfs and other files > ystems. > test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G > > caused below changes (please refer to attached dmesg/kmsg for entire > log/backtrace): > > It would help if you could capture generic/269.full, but this is likely a problem with fsck that I fixed a few weeks ago where we're seeing nbytes of an inode is wrong, but there's an orphan item so it doesn't matter. This patch just made it more likely for us to have a still being iput'ed inode after a transaction commit. Thanks, Josef
Re: io.latency controller apparently not working
On Fri, Aug 16, 2019 at 12:57:41PM +0200, Paolo Valente wrote: > Hi, > I happened to test the io.latency controller, to make a comparison > between this controller and BFQ. But io.latency seems not to work, > i.e., not to reduce latency compared with what happens with no I/O > control at all. Here is a summary of the results for one of the > workloads I tested, on three different devices (latencies in ms): > > no I/O controlio.latency BFQ > NVMe SSD 1.9 1.90.07 > SATA SSD 3956 0.7 > HDD 4500 4500 11 > > I have put all details on hardware, OS, scenarios and results in the > attached pdf. For your convenience, I'm pasting the source file too. > Do you have the fio jobs you use for this? I just tested on Jens's most recent tree and io.latency appears to be doing what its supposed to be doing. We've also started testing 5.2 in production and it's still working in production as well. The only thing I've touched recently was around wakeups and shouldn't have broken everything. I'm not sure why it's not working for you, but a fio script will help me narrow down what's going on. Thanks, Josef
Re: [PATCH v2] nbd: replace kill_bdev() with __invalidate_device() again
On Wed, Jul 31, 2019 at 08:13:10PM +0800, SunKe wrote: > From: Munehisa Kamata > > Commit abbbdf12497d ("replace kill_bdev() with __invalidate_device()") > once did this, but 29eaadc03649 ("nbd: stop using the bdev everywhere") > resurrected kill_bdev() and it has been there since then. So buffer_head > mappings still get killed on a server disconnection, and we can still > hit the BUG_ON on a filesystem on the top of the nbd device. > > EXT4-fs (nbd0): mounted filesystem with ordered data mode. Opts: (null) > block nbd0: Receive control failed (result -32) > block nbd0: shutting down sockets > print_req_error: I/O error, dev nbd0, sector 66264 flags 3000 > EXT4-fs warning (device nbd0): htree_dirblock_to_tree:979: inode #2: lblock > 0: comm ls: error -5 reading directory block > print_req_error: I/O error, dev nbd0, sector 2264 flags 3000 > EXT4-fs error (device nbd0): __ext4_get_inode_loc:4690: inode #2: block > 283: comm ls: unable to read itable block > EXT4-fs error (device nbd0) in ext4_reserve_inode_write:5894: IO failure > [ cut here ] > kernel BUG at fs/buffer.c:3057! > invalid opcode: [#1] SMP PTI > CPU: 7 PID: 40045 Comm: jbd2/nbd0-8 Not tainted 5.1.0-rc3+ #4 > Hardware name: Amazon EC2 m5.12xlarge/, BIOS 1.0 10/16/2017 > RIP: 0010:submit_bh_wbc+0x18b/0x190 > ... > Call Trace: >jbd2_write_superblock+0xf1/0x230 [jbd2] >? account_entity_enqueue+0xc5/0xf0 >jbd2_journal_update_sb_log_tail+0x94/0xe0 [jbd2] >jbd2_journal_commit_transaction+0x12f/0x1d20 [jbd2] >? __switch_to_asm+0x40/0x70 >... >? lock_timer_base+0x67/0x80 >kjournald2+0x121/0x360 [jbd2] >? remove_wait_queue+0x60/0x60 >kthread+0xf8/0x130 >? commit_timeout+0x10/0x10 [jbd2] >? kthread_bind+0x10/0x10 >ret_from_fork+0x35/0x40 > > With __invalidate_device(), I no longer hit the BUG_ON with sync or > unmount on the disconnected device. > Jeeze I swear I see this same patch go by every 6 months or so, not sure what happens to it. Anyway Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 04/11] trace-cmd: add global functions for live tracing
Yup go for it, thanks, Josef Sent from my iPhone > On Jul 9, 2019, at 9:09 AM, Steven Rostedt wrote: > > On Fri, 20 Nov 2015 15:57:26 -0500 > Josef Bacik wrote: > >> We need a few functions to disable/enable tracing as well as add events to be >> enabled on the first instance, this patch turns a couple of these local >> functions into library functions. Thanks, > > Hi Josef, > > Not sure you still use this, as it's not really a library function > anymore. But we are currently cleaning up the trace-cmd code to create > a real library, and doing it in baby steps. The > tracecmd_enable_events() function is causing some issues and it was > added by you. Are you OK if we remove it. At least temporarily until we > separate out the "enabling" part into the library? > > Thanks! > > -- Steve > > >> >> Signed-off-by: Josef Bacik >> --- >> trace-cmd.h| 5 + >> trace-record.c | 45 +++-- >> 2 files changed, 32 insertions(+), 18 deletions(-) >> >> diff --git a/trace-cmd.h b/trace-cmd.h >> index b4fa7fd..9a9ca30 100644 >> --- a/trace-cmd.h >> +++ b/trace-cmd.h >> @@ -268,6 +268,11 @@ int tracecmd_start_recording(struct tracecmd_recorder >> *recorder, unsigned long s >> void tracecmd_stop_recording(struct tracecmd_recorder *recorder); >> void tracecmd_stat_cpu(struct trace_seq *s, int cpu); >> long tracecmd_flush_recording(struct tracecmd_recorder *recorder); >> +int tracecmd_add_event(const char *event_str, int stack); >> +void tracecmd_enable_events(void); >> +void tracecmd_disable_all_tracing(int disable_tracer); >> +void tracecmd_disable_tracing(void); >> +void tracecmd_enable_tracing(void); >> >> /* --- Plugin handling --- */ >> extern struct pevent_plugin_option trace_ftrace_options[]; >> diff --git a/trace-record.c b/trace-record.c >> index 417b701..7c471ab 100644 >> --- a/trace-record.c >> +++ b/trace-record.c >> @@ -841,7 +841,6 @@ static void update_ftrace_pids(int reset) >> >> static void update_event_filters(struct buffer_instance *instance); >> static void update_pid_event_filters(struct buffer_instance *instance); >> -static void enable_tracing(void); >> >> /** >> * make_pid_filter - create a filter string to all pids against @field >> @@ -1106,7 +1105,7 @@ static void run_cmd(enum trace_type type, int argc, >> char **argv) >>if (!pid) { >>/* child */ >>update_task_filter(); >> -enable_tracing(); >> +tracecmd_enable_tracing(); >>enable_ptrace(); >>/* >> * If we are using stderr for stdout, switch >> @@ -1795,7 +1794,7 @@ static int read_tracing_on(struct buffer_instance >> *instance) >>return ret; >> } >> >> -static void enable_tracing(void) >> +void tracecmd_enable_tracing(void) >> { >>struct buffer_instance *instance; >> >> @@ -1808,7 +1807,7 @@ static void enable_tracing(void) >>reset_max_latency(); >> } >> >> -static void disable_tracing(void) >> +void tracecmd_disable_tracing(void) >> { >>struct buffer_instance *instance; >> >> @@ -1816,9 +1815,9 @@ static void disable_tracing(void) >>write_tracing_on(instance, 0); >> } >> >> -static void disable_all(int disable_tracer) >> +void tracecmd_disable_all_tracing(int disable_tracer) >> { >> -disable_tracing(); >> +tracecmd_disable_tracing(); >> >>if (disable_tracer) { >>disable_func_stack_trace(); >> @@ -1991,6 +1990,11 @@ static void enable_events(struct buffer_instance >> *instance) >>} >> } >> >> +void tracecmd_enable_events(void) >> +{ >> +enable_events(first_instance); >> +} >> + >> static void set_clock(struct buffer_instance *instance) >> { >>char *path; >> @@ -3074,15 +3078,15 @@ static char *get_date_to_ts(void) >>} >> >>for (i = 0; i < date2ts_tries; i++) { >> -disable_tracing(); >> +tracecmd_disable_tracing(); >>clear_trace(); >> -enable_tracing(); >> +tracecmd_enable_tracing(); >> >>gettimeofday(, NULL); >>write(tfd, STAMP, 5); >>gettimeofday(, NULL); >> >> -disable_tracing(); >> +tracecmd_disable_tracing(); >>ts = find_time_stamp(pevent); >>if (!ts) >>continue; >> @@ -3699,6 +3703,11 @@ profile_a
Re: [PATCH 10/10] sched,fair: flatten hierarchical runqueues
On Fri, Jun 28, 2019 at 04:49:13PM -0400, Rik van Riel wrote: > Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue. > > Iteration of the sched_entity hierarchy is rate limited to once per jiffy > per sched_entity, which is a smaller change than it seems, because load > average adjustments were already rate limited to once per jiffy before this > patch series. > > This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks > from throttled cgroups onto their cgroup runqueues, and slowly (using the > GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup > gets unthrottled, to prevent thundering herd issues. > > Signed-off-by: Rik van Riel > --- > include/linux/sched.h | 2 + > kernel/sched/fair.c | 452 +++--- > kernel/sched/pelt.c | 6 +- > kernel/sched/pelt.h | 2 +- > kernel/sched/sched.h | 2 +- > 5 files changed, 171 insertions(+), 293 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 84a6cc6f5c47..901c710363e7 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -453,6 +453,8 @@ struct sched_entity { > #ifdef CONFIG_FAIR_GROUP_SCHED > int depth; > unsigned long enqueued_h_load; > + unsigned long enqueued_h_weight; > + struct load_weight h_load; > struct sched_entity *parent; > /* rq on which this entity is (to be) queued: */ > struct cfs_rq *cfs_rq; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6fea8849cc12..c31d3da081fb 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -450,6 +450,19 @@ find_matching_se(struct sched_entity **se, struct > sched_entity **pse) > } > } > > +/* Add the cgroup cfs_rqs to the list, for update_blocked_averages */ > +static void enqueue_entity_cfs_rqs(struct sched_entity *se) > +{ > + SCHED_WARN_ON(!entity_is_task(se)); > + > + for_each_sched_entity(se) { > + struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se); > + > + if (list_add_leaf_cfs_rq(cfs_rq)) > + break; > + } > +} > + > #else/* !CONFIG_FAIR_GROUP_SCHED */ > > static inline struct task_struct *task_of(struct sched_entity *se) > @@ -672,8 +685,14 @@ int sched_proc_update_handler(struct ctl_table *table, > int write, > */ > static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se) > { > - if (unlikely(se->load.weight != NICE_0_LOAD)) > + if (task_se_in_cgroup(se)) { > + unsigned long h_weight = task_se_h_weight(se); > + if (h_weight != se->h_load.weight) > + update_load_set(>h_load, h_weight); > + delta = __calc_delta(delta, NICE_0_LOAD, >h_load); > + } else if (unlikely(se->load.weight != NICE_0_LOAD)) { > delta = __calc_delta(delta, NICE_0_LOAD, >load); > + } > > return delta; > } > @@ -687,22 +706,16 @@ static inline u64 calc_delta_fair(u64 delta, struct > sched_entity *se) > static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > u64 slice = sysctl_sched_latency; > + struct load_weight *load = _rq->load; > + struct load_weight lw; > > - for_each_sched_entity(se) { > - struct load_weight *load; > - struct load_weight lw; > - > - cfs_rq = cfs_rq_of(se); > - load = _rq->load; > + if (unlikely(!se->on_rq)) { > + lw = cfs_rq->load; > > - if (unlikely(!se->on_rq)) { > - lw = cfs_rq->load; > - > - update_load_add(, se->load.weight); > - load = > - } > - slice = __calc_delta(slice, se->load.weight, load); > + update_load_add(, task_se_h_weight(se)); > + load = > } > + slice = __calc_delta(slice, task_se_h_weight(se), load); > > /* >* To avoid cache thrashing, run at least sysctl_sched_min_granularity. > @@ -2703,16 +2716,28 @@ static inline void update_scan_period(struct > task_struct *p, int new_cpu) > static void > account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > - update_load_add(_rq->load, se->load.weight); > - if (!parent_entity(se)) > + struct rq *rq; > + > + if (task_se_in_cgroup(se)) { > + struct cfs_rq *cgroup_rq = group_cfs_rq_of_parent(se); > + unsigned long h_weight; > + > + update_load_add(_rq->load, se->load.weight); > + cgroup_rq->nr_running++; > + > + /* Add the hierarchical weight to the CPU rq */ > + h_weight = task_se_h_weight(se); > + se->enqueued_h_weight = h_weight; > + update_load_add(_of(cfs_rq)->load, h_weight); Ok I think this is where we need to handle the newly enqueued
Re: [PATCH 09/10] sched,fair: add helper functions for flattened runqueue
On Fri, Jun 28, 2019 at 04:49:12PM -0400, Rik van Riel wrote: > Add helper functions to make the flattened runqueue patch a little smaller. > > The task_se_h_weight function is similar to task_se_h_load, but scales the > task weight by the group weight, without taking the task's duty cycle into > account. > > The task_se_in_cgroup helper is functionally identical to parent_entity, > but directly calling a function with that name obscures what the other > code is trying to use it for, and would make the code harder to understand. > > Signed-off-by: Rik van Riel > --- > kernel/sched/fair.c | 26 ++ > 1 file changed, 26 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a751e7a9b228..6fea8849cc12 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -243,6 +243,7 @@ static u64 __calc_delta(u64 delta_exec, unsigned long > weight, struct load_weight > > const struct sched_class fair_sched_class; > static unsigned long task_se_h_load(struct sched_entity *se); > +static unsigned long task_se_h_weight(struct sched_entity *se); > > /** > * CFS operations on generic schedulable entities: > @@ -411,6 +412,12 @@ static inline struct sched_entity *parent_entity(struct > sched_entity *se) > return se->parent; > } > > +/* Is this (task) sched_entity in a non-root cgroup? */ > +static inline bool task_se_in_cgroup(struct sched_entity *se) > +{ > + return parent_entity(se); > +} > + > static void > find_matching_se(struct sched_entity **se, struct sched_entity **pse) > { > @@ -7819,6 +7826,20 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) > } > } > > +static unsigned long task_se_h_weight(struct sched_entity *se) > +{ > + struct cfs_rq *cfs_rq; > + > + if (!task_se_in_cgroup(se)) > + return se->load.weight; > + > + cfs_rq = group_cfs_rq_of_parent(se); > + update_cfs_rq_h_load(cfs_rq); > + > + /* Reduce the load.weight by the h_load of the group the task is in. */ > + return (cfs_rq->h_load * se->load.weight) >> SCHED_FIXEDPOINT_SHIFT; This should be scale_load_down(cfs_rq->h_load * se->load.weight); Thanks, Josef
Re: [PATCH 02/10] sched: change /proc/sched_debug fields
On Fri, Jun 28, 2019 at 04:49:05PM -0400, Rik van Riel wrote: > Remove some fields from /proc/sched_debug that are removed from > sched_entity in a subsequent patch, and add h_load, which comes in > very handy to debug CPU controller weight distribution. > > Signed-off-by: Rik van Riel Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 01/10] sched: introduce task_se_h_load helper
On Fri, Jun 28, 2019 at 04:49:04PM -0400, Rik van Riel wrote: > Sometimes the hierarchical load of a sched_entity needs to be calculated. > Rename task_h_load to task_se_h_load, and directly pass a sched_entity to > that function. > > Move the function declaration up above where it will be used later. > > No functional changes. > > Signed-off-by: Rik van Riel Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 03/10] sched,fair: redefine runnable_load_avg as the sum of task_h_load
On Mon, Jul 01, 2019 at 12:47:35PM -0400, Rik van Riel wrote: > On Mon, 2019-07-01 at 12:29 -0400, Josef Bacik wrote: > > On Fri, Jun 28, 2019 at 04:49:06PM -0400, Rik van Riel wrote: > > > The runnable_load magic is used to quickly propagate information > > > about > > > runnable tasks up the hierarchy of runqueues. The runnable_load_avg > > > is > > > mostly used for the load balancing code, which only examines the > > > value at > > > the root cfs_rq. > > > > > > Redefine the root cfs_rq runnable_load_avg to be the sum of > > > task_h_loads > > > of the runnable tasks. This works because the hierarchical > > > runnable_load of > > > a task is already equal to the task_se_h_load today. This provides > > > enough > > > information to the load balancer. > > > > > > The runnable_load_avg of the cgroup cfs_rqs does not appear to be > > > used for anything, so don't bother calculating those. > > > > > > This removes one of the things that the code currently traverses > > > the > > > cgroup hierarchy for, and getting rid of it brings us one step > > > closer > > > to a flat runqueue for the CPU controller. > > > > > > > My memory on this stuff is very hazy, but IIRC we had the > > runnable_sum and the > > runnable_avg separated out because you could have the avg lag behind > > the sum. > > So like you enqueue a bunch of new entities who's avg may have > > decayed a bunch > > and so their overall load is not felt on the CPU until they start > > running, and > > now you've overloaded that CPU. The sum was there to make sure new > > things > > coming onto the CPU added actual load to the queue instead of looking > > like there > > was no load. > > > > Is this going to be a problem now with this new code? > > That is a good question! > > On the one hand, you may well be right. > > On the other hand, while I see the old code calculating > runnable_sum, I don't really see it _using_ it to drive > scheduling decisions. > > It would be easy to define the CPU cfs_rq->runnable_load_sum > as being the sum of task_se_h_weight() of each runnable task > on the CPU (for example), but what would we use it for? > > What am I missing? It's super sublte, but you're right in that we don't really need the runnable_avg per-se, but what you do is you kill calc_group_runnable, which used to do this load_avg = max(cfs_rq->avg.load_avg, scale_load_down(cfs_rq->load.weight)); runnable = max(cfs_rq->avg.runnable_load_avg, scale_load_down(cfs_rq->runnable_weight)); so we'd account for this weirdness of adding a previously idle process to a new CPU and overloading the CPU because we'd add a bunch of these 0 weight looking tasks that suddenly all wake up and are on the same CPU. So we used the runnable_weight to account for what was actually happening, and the max of load_avg and the weight to figure out what the potential load would be. What you've done here is change the weighting stuff to be completely based on load avg, which is problematic for the reasons above. Did you fix this later on in your patches? If so then just tell me to keep reading and I'll do that ;). Thanks, Josef
Re: [PATCH 03/10] sched,fair: redefine runnable_load_avg as the sum of task_h_load
On Fri, Jun 28, 2019 at 04:49:06PM -0400, Rik van Riel wrote: > The runnable_load magic is used to quickly propagate information about > runnable tasks up the hierarchy of runqueues. The runnable_load_avg is > mostly used for the load balancing code, which only examines the value at > the root cfs_rq. > > Redefine the root cfs_rq runnable_load_avg to be the sum of task_h_loads > of the runnable tasks. This works because the hierarchical runnable_load of > a task is already equal to the task_se_h_load today. This provides enough > information to the load balancer. > > The runnable_load_avg of the cgroup cfs_rqs does not appear to be > used for anything, so don't bother calculating those. > > This removes one of the things that the code currently traverses the > cgroup hierarchy for, and getting rid of it brings us one step closer > to a flat runqueue for the CPU controller. > My memory on this stuff is very hazy, but IIRC we had the runnable_sum and the runnable_avg separated out because you could have the avg lag behind the sum. So like you enqueue a bunch of new entities who's avg may have decayed a bunch and so their overall load is not felt on the CPU until they start running, and now you've overloaded that CPU. The sum was there to make sure new things coming onto the CPU added actual load to the queue instead of looking like there was no load. Is this going to be a problem now with this new code? > +static inline void > +update_runnable_load_avg(struct sched_entity *se) > +{ > + struct cfs_rq *root_cfs_rq = _rq_of(se)->rq->cfs; > + long new_h_load, delta; > + > + SCHED_WARN_ON(!entity_is_task(se)); > + > + if (!se->on_rq) > + return; > > - sub_positive(_rq->avg.runnable_load_avg, se->avg.runnable_load_avg); > - sub_positive(_rq->avg.runnable_load_sum, > - se_runnable(se) * se->avg.runnable_load_sum); > + new_h_load = task_se_h_load(se); > + delta = new_h_load - se->enqueued_h_load; > + root_cfs_rq->avg.runnable_load_avg += delta; Should we use add_positive here? Thanks, Josef
Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression
On Wed, Jun 26, 2019 at 10:39:36AM +0800, Rong Chen wrote: > On 6/25/19 10:22 PM, Josef Bacik wrote: > > On Fri, Jun 21, 2019 at 08:48:03AM +0800, Huang, Ying wrote: > > > "Huang, Ying" writes: > > > > > > > "Huang, Ying" writes: > > > > > > > > > Hi, Josef, > > > > > > > > > > kernel test robot writes: > > > > > > > > > > > Greeting, > > > > > > > > > > > > FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to > > > > > > commit: > > > > > > > > > > > > > > > > > > commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve > > > > > > delalloc metadata differently") > > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git > > > > > > master > > > > > > > > > > > > in testcase: aim7 > > > > > > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ > > > > > > 3.00GHz with 384G memory > > > > > > with following parameters: > > > > > > > > > > > > disk: 4BRD_12G > > > > > > md: RAID0 > > > > > > fs: btrfs > > > > > > test: disk_rr > > > > > > load: 1500 > > > > > > cpufreq_governor: performance > > > > > > > > > > > > test-description: AIM7 is a traditional UNIX system level benchmark > > > > > > suite which is used to test and measure the performance of multiuser > > > > > > system. > > > > > > test-url: > > > > > > https://sourceforge.net/projects/aimbench/files/aim-suite7/ > > > > > Here's another regression, do you have time to take a look at this? > > > > Ping > > > Ping again ... > > > > > Finally got time to look at this but I can't get the reproducer to work > > > > root@destiny ~/lkp-tests# bin/lkp run ~/job-aim.yaml > > Traceback (most recent call last): > > 11: from /root/lkp-tests/bin/run-local:18:in `' > > 10: from > > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' > > 9: from > > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' > > 8: from /root/lkp-tests/lib/yaml.rb:5:in `' > > 7: from > > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' > > 6: from > > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' > > 5: from /root/lkp-tests/lib/common.rb:9:in `' > > 4: from > > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' > > 3: from > > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' > > 2: from /root/lkp-tests/lib/array_ext.rb:3:in `' > > 1: from > > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' > > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require': > > cannot load such file -- active_support/core_ext/enumerable (LoadError) > > Hi Josef, > > I tried the latest lkp-tests, and didn't have the problem. Could you please > update the lkp-tests repo and run "lkp install" again? > I updated it this morning, and I just updated it now, my tree is on 2c5b1a95b08dbe81bba64419c482a877a3b424ac lkp install says everything is installed except No match for argument: libipc-run-perl and it still doesn't run properly. Thanks, Josef
Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression
On Fri, Jun 21, 2019 at 08:48:03AM +0800, Huang, Ying wrote: > "Huang, Ying" writes: > > > "Huang, Ying" writes: > > > >> Hi, Josef, > >> > >> kernel test robot writes: > >> > >>> Greeting, > >>> > >>> FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to commit: > >>> > >>> > >>> commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve > >>> delalloc metadata differently") > >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > >>> > >>> in testcase: aim7 > >>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz > >>> with 384G memory > >>> with following parameters: > >>> > >>> disk: 4BRD_12G > >>> md: RAID0 > >>> fs: btrfs > >>> test: disk_rr > >>> load: 1500 > >>> cpufreq_governor: performance > >>> > >>> test-description: AIM7 is a traditional UNIX system level benchmark > >>> suite which is used to test and measure the performance of multiuser > >>> system. > >>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/ > >> > >> Here's another regression, do you have time to take a look at this? > > > > Ping > > Ping again ... > Finally got time to look at this but I can't get the reproducer to work root@destiny ~/lkp-tests# bin/lkp run ~/job-aim.yaml Traceback (most recent call last): 11: from /root/lkp-tests/bin/run-local:18:in `' 10: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' 9: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' 8: from /root/lkp-tests/lib/yaml.rb:5:in `' 7: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' 6: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' 5: from /root/lkp-tests/lib/common.rb:9:in `' 4: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' 3: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' 2: from /root/lkp-tests/lib/array_ext.rb:3:in `' 1: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require' /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require': cannot load such file -- active_support/core_ext/enumerable (LoadError) I can't even figure out from the job file or anything how I'm supposed to run AIM7 itself. This is on a Fedora 30 box, so it's relatively recent. Thanks, Josef
Re: [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
On Tue, Jun 18, 2019 at 07:42:46AM +, Naohiro Aota wrote: > On 2019/06/13 23:09, Josef Bacik wrote: > > On Fri, Jun 07, 2019 at 10:10:14PM +0900, Naohiro Aota wrote: > >> If the btrfs volume has mirrored block groups, it unconditionally makes > >> un-mirrored block groups read only. When we have mirrored block groups, but > >> don't have writable block groups, this will drop all writable block groups. > >> So, check if we have at least one writable mirrored block group before > >> setting un-mirrored block groups read only. > >> > > > > I don't understand why you want this. Thanks, > > > > Josef > > > > This is necessary to handle e.g. btrfs/124 case. > > When we mount degraded RAID1 FS and write to it, and then > re-mount with full device, the write pointers of corresponding > zones of written BG differ. The patch 07 mark such block group > as "wp_broken" and make it read only. In this situation, we only > have read only RAID1 BGs because of "wp_broken" and un-mirrored BGs > are also marked read only, because we have RAID1 BGs. > As a result, all the BGs are now read only, so that we > cannot even start the rebalance to fix the situation. Ah ok, please add this explanation to the changelog. Thanks, Josef
Re: [PATCH 3/8] blkcg: implement REQ_CGROUP_PUNT
On Thu, Jun 13, 2019 at 05:33:45PM -0700, Tejun Heo wrote: > When a shared kthread needs to issue a bio for a cgroup, doing so > synchronously can lead to priority inversions as the kthread can be > trapped waiting for that cgroup. This patch implements > REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing > to a dedicated per-blkcg work item to avoid such priority inversions. > > This will be used to fix priority inversions in btrfs compression and > should be generally useful as we grow filesystem support for > comprehensive IO control. > > Signed-off-by: Tejun Heo > Cc: Chris Mason Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 2/8] blkcg, writeback: Implement wbc_blkcg_css()
On Thu, Jun 13, 2019 at 05:33:44PM -0700, Tejun Heo wrote: > Add a helper to determine the target blkcg from wbc. > > Signed-off-by: Tejun Heo Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 1/8] blkcg, writeback: Add wbc->no_wbc_acct
On Thu, Jun 13, 2019 at 05:33:43PM -0700, Tejun Heo wrote: > When writeback IOs are bounced through async layers, the IOs should > only be accounted against the wbc from the original bdi writeback to > avoid confusing cgroup inode ownership arbitration. Add > wbc->no_wbc_acct to allow disabling wbc accounting. This will be used > make btfs compression work well with cgroup IO control. > > Signed-off-by: Tejun Heo Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 0/2] Fix misuse of blk_rq_stats in blk-iolatency
On Fri, Jun 14, 2019 at 02:44:11PM +0300, Pavel Begunkov (Silence) wrote: > From: Pavel Begunkov > > There are implicit assumptions about struct blk_rq_stats, which make > it's very easy to misuse. The first patch fixes consequences, and the > second employs type-system to prevent recurrences. > > > Pavel Begunkov (2): > blk-iolatency: Fix zero mean in previous stats > blk-stats: Introduce explicit stat staging buffers > I don't have a problem with this, but it's up to Jens I suppose Acked-by: Josef Bacik Thanks, Josef
Re: [PATCH 2/2] nbd: add support for nbd as root device
On Wed, Jun 12, 2019 at 07:31:44PM +0300, roman.stratiie...@globallogic.com wrote: > From: Roman Stratiienko > > Adding support to nbd to use it as a root device. This code essentially > provides a minimal nbd-client implementation within the kernel. It opens > a socket and makes the negotiation with the server. Afterwards it passes > the socket to the normal nbd-code to handle the connection. > > The arguments for the server are passed via kernel command line. > The kernel command line has the format > 'nbdroot=[:]/'. > SERVER_IP is optional. If it is not available it will use the > root_server_addr transmitted through DHCP. > > Based on those arguments, the connection to the server is established > and is connected to the nbd0 device. The rootdevice therefore is > root=/dev/nbd0. > > Patch was initialy posted by Markus Pargmann > and can be found at https://lore.kernel.org/patchwork/patch/532556/ > > Change-Id: I78f7313918bf31b9dc01a74a42f0f068bede312c > Signed-off-by: Roman Stratiienko > Reviewed-by: Aleksandr Bulyshchenko Just throw nbd-client in your initramfs. Every nbd server has it's own handshake protocol, embedding one particular servers handshake protocol into the kernel isn't the answer here. Thanks, Josef
Re: [PATCH 02/19] btrfs: Get zone information of zoned block devices
On Fri, Jun 07, 2019 at 10:10:08PM +0900, Naohiro Aota wrote: > If a zoned block device is found, get its zone information (number of zones > and zone size) using the new helper function btrfs_get_dev_zonetypes(). To > avoid costly run-time zone report commands to test the device zones type > during block allocation, attach the seqzones bitmap to the device structure > to indicate if a zone is sequential or accept random writes. > > This patch also introduces the helper function btrfs_dev_is_sequential() to > test if the zone storing a block is a sequential write required zone. > > Signed-off-by: Damien Le Moal > Signed-off-by: Naohiro Aota > --- > fs/btrfs/volumes.c | 143 + > fs/btrfs/volumes.h | 33 +++ > 2 files changed, 176 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 1c2a6e4b39da..b673178718e3 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -786,6 +786,135 @@ static int btrfs_free_stale_devices(const char *path, > return ret; > } > > +static int __btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, > + struct blk_zone **zones, > + unsigned int *nr_zones, gfp_t gfp_mask) > +{ > + struct blk_zone *z = *zones; > + int ret; > + > + if (!z) { > + z = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL); > + if (!z) > + return -ENOMEM; > + } > + > + ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, > + z, nr_zones, gfp_mask); > + if (ret != 0) { > + btrfs_err(device->fs_info, "Get zone at %llu failed %d\n", > + pos, ret); > + return ret; > + } > + > + *zones = z; > + > + return 0; > +} > + > +static void btrfs_destroy_dev_zonetypes(struct btrfs_device *device) > +{ > + kfree(device->seq_zones); > + kfree(device->empty_zones); > + device->seq_zones = NULL; > + device->empty_zones = NULL; > + device->nr_zones = 0; > + device->zone_size = 0; > + device->zone_size_shift = 0; > +} > + > +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, > +struct blk_zone *zone, gfp_t gfp_mask) > +{ > + unsigned int nr_zones = 1; > + int ret; > + > + ret = __btrfs_get_dev_zones(device, pos, , _zones, gfp_mask); > + if (ret != 0 || !nr_zones) > + return ret ? ret : -EIO; > + > + return 0; > +} > + > +int btrfs_get_dev_zonetypes(struct btrfs_device *device) > +{ > + struct block_device *bdev = device->bdev; > + sector_t nr_sectors = bdev->bd_part->nr_sects; > + sector_t sector = 0; > + struct blk_zone *zones = NULL; > + unsigned int i, n = 0, nr_zones; > + int ret; > + > + device->zone_size = 0; > + device->zone_size_shift = 0; > + device->nr_zones = 0; > + device->seq_zones = NULL; > + device->empty_zones = NULL; > + > + if (!bdev_is_zoned(bdev)) > + return 0; > + > + device->zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT; > + device->zone_size_shift = ilog2(device->zone_size); > + device->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev)); > + if (nr_sectors & (bdev_zone_sectors(bdev) - 1)) > + device->nr_zones++; > + > + device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones), > + sizeof(*device->seq_zones), GFP_KERNEL); > + if (!device->seq_zones) > + return -ENOMEM; > + > + device->empty_zones = kcalloc(BITS_TO_LONGS(device->nr_zones), > + sizeof(*device->empty_zones), GFP_KERNEL); > + if (!device->empty_zones) > + return -ENOMEM; > + > +#define BTRFS_REPORT_NR_ZONES 4096 > + > + /* Get zones type */ > + while (sector < nr_sectors) { > + nr_zones = BTRFS_REPORT_NR_ZONES; > + ret = __btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, > + , _zones, GFP_KERNEL); > + if (ret != 0 || !nr_zones) { > + if (!ret) > + ret = -EIO; > + goto out; > + } > + > + for (i = 0; i < nr_zones; i++) { > + if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ) > + set_bit(n, device->seq_zones); > + if (zones[i].cond == BLK_ZONE_COND_EMPTY) > + set_bit(n, device->empty_zones); > + sector = zones[i].start + zones[i].len; > + n++; > + } > + } > + > + if (n != device->nr_zones) { > + btrfs_err(device->fs_info, > + "Inconsistent number of zones (%u / %u)\n", n, > + device->nr_zones); > + ret = -EIO; > +
Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote: > When in HMZONED mode, make sure that device super blocks are located in > randomly writable zones of zoned block devices. That is, do not write super > blocks in sequential write required zones of host-managed zoned block > devices as update would not be possible. > > Signed-off-by: Damien Le Moal > Signed-off-by: Naohiro Aota > --- > fs/btrfs/disk-io.c | 11 +++ > fs/btrfs/disk-io.h | 1 + > fs/btrfs/extent-tree.c | 4 > fs/btrfs/scrub.c | 2 ++ > 4 files changed, 18 insertions(+) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 7c1404c76768..ddbb02906042 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct > block_device *bdev) > return latest; > } > > +int btrfs_check_super_location(struct btrfs_device *device, u64 pos) > +{ > + /* any address is good on a regular (zone_size == 0) device */ > + /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */ This is not how you do multi-line comments in the kernel. Thanks, Josef
Re: [PATCH 11/19] btrfs: introduce submit buffer
On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote: > Sequential allocation is not enough to maintain sequential delivery of > write IOs to the device. Various features (async compress, async checksum, > ...) of btrfs affect ordering of the IOs. This patch introduces submit > buffer to sort WRITE bios belonging to a block group and sort them out > sequentially in increasing block address to achieve sequential write > sequences with __btrfs_map_bio(). > > Signed-off-by: Naohiro Aota I hate everything about this. Can't we just use the plugging infrastructure for this and then make sure it re-orders the bios before submitting them? Also what's to prevent the block layer scheduler from re-arranging these io's? Thanks, Josef
Re: [PATCH 16/19] btrfs: wait existing extents before truncating
On Fri, Jun 07, 2019 at 10:10:22PM +0900, Naohiro Aota wrote: > When truncating a file, file buffers which have already been allocated but > not yet written may be truncated. Truncating these buffers could cause > breakage of a sequential write pattern in a block group if the truncated > blocks are for example followed by blocks allocated to another file. To > avoid this problem, always wait for write out of all unwritten buffers > before proceeding with the truncate execution. > > Signed-off-by: Naohiro Aota > --- > fs/btrfs/inode.c | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 89542c19d09e..4e8c7921462f 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -5137,6 +5137,17 @@ static int btrfs_setsize(struct inode *inode, struct > iattr *attr) > btrfs_end_write_no_snapshotting(root); > btrfs_end_transaction(trans); > } else { > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > + > + if (btrfs_fs_incompat(fs_info, HMZONED)) { > + u64 sectormask = fs_info->sectorsize - 1; > + > + ret = btrfs_wait_ordered_range(inode, > +newsize & (~sectormask), > +(u64)-1); Use ALIGN(). Thanks, Josef
Re: [PATCH 18/19] btrfs: support dev-replace in HMZONED mode
On Fri, Jun 07, 2019 at 10:10:24PM +0900, Naohiro Aota wrote: > Currently, dev-replace copy all the device extents on source device to the > target device, and it also clones new incoming write I/Os from users to the > source device into the target device. > > Cloning incoming IOs can break the sequential write rule in the target > device. When write is mapped in the middle of block group, that I/O is > directed in the middle of a zone of target device, which breaks the > sequential write rule. > > However, the cloning function cannot be simply disabled since incoming I/Os > targeting already copied device extents must be cloned so that the I/O is > executed on the target device. > > We cannot use dev_replace->cursor_{left,right} to determine whether bio > is going to not yet copied region. Since we have time gap between > finishing btrfs_scrub_dev() and rewriting the mapping tree in > btrfs_dev_replace_finishing(), we can have newly allocated device extent > which is never cloned (by handle_ops_on_dev_replace) nor copied (by the > dev-replace process). > > So the point is to copy only already existing device extents. This patch > introduce mark_block_group_to_copy() to mark existing block group as a > target of copying. Then, handle_ops_on_dev_replace() and dev-replace can > check the flag to do their job. > > This patch also handles empty region between used extents. Since > dev-replace is smart to copy only used extents on source device, we have to > fill the gap to honor the sequential write rule in the target device. > > Signed-off-by: Naohiro Aota > --- > fs/btrfs/ctree.h | 1 + > fs/btrfs/dev-replace.c | 96 +++ > fs/btrfs/extent-tree.c | 32 +++- > fs/btrfs/scrub.c | 169 + > fs/btrfs/volumes.c | 27 ++- > 5 files changed, 319 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index dad8ea5c3b99..a0be2b96117a 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -639,6 +639,7 @@ struct btrfs_block_group_cache { > unsigned int has_caching_ctl:1; > unsigned int removed:1; > unsigned int wp_broken:1; > + unsigned int to_copy:1; > > int disk_cache_state; > > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c > index fbe5ea2a04ed..5011b5ce0e75 100644 > --- a/fs/btrfs/dev-replace.c > +++ b/fs/btrfs/dev-replace.c > @@ -263,6 +263,13 @@ static int btrfs_init_dev_replace_tgtdev(struct > btrfs_fs_info *fs_info, > device->dev_stats_valid = 1; > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > device->fs_devices = fs_info->fs_devices; > + if (bdev_is_zoned(bdev)) { > + ret = btrfs_get_dev_zonetypes(device); > + if (ret) { > + mutex_unlock(_info->fs_devices->device_list_mutex); > + goto error; > + } > + } > list_add(>dev_list, _info->fs_devices->devices); > fs_info->fs_devices->num_devices++; > fs_info->fs_devices->open_devices++; > @@ -396,6 +403,88 @@ static char* btrfs_dev_name(struct btrfs_device *device) > return rcu_str_deref(device->name); > } > > +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info, > + struct btrfs_device *src_dev) > +{ > + struct btrfs_path *path; > + struct btrfs_key key; > + struct btrfs_key found_key; > + struct btrfs_root *root = fs_info->dev_root; > + struct btrfs_dev_extent *dev_extent = NULL; > + struct btrfs_block_group_cache *cache; > + struct extent_buffer *l; > + int slot; > + int ret; > + u64 chunk_offset, length; > + > + path = btrfs_alloc_path(); > + if (!path) > + return -ENOMEM; > + > + path->reada = READA_FORWARD; > + path->search_commit_root = 1; > + path->skip_locking = 1; > + > + key.objectid = src_dev->devid; > + key.offset = 0ull; > + key.type = BTRFS_DEV_EXTENT_KEY; > + > + while (1) { > + ret = btrfs_search_slot(NULL, root, , path, 0, 0); > + if (ret < 0) > + break; > + if (ret > 0) { > + if (path->slots[0] >= > + btrfs_header_nritems(path->nodes[0])) { > + ret = btrfs_next_leaf(root, path); > + if (ret < 0) > + break; > + if (ret > 0) { > + ret = 0; > + break; > + } > + } else { > + ret = 0; > + } > + } > + > + l = path->nodes[0]; > + slot = path->slots[0]; > + > + btrfs_item_key_to_cpu(l, _key, slot); > + > + if (found_key.objectid != src_dev->devid) > +
Re: [LKP] [btrfs] 302167c50b: fio.write_bw_MBps -12.4% regression
On Fri, May 24, 2019 at 03:46:17PM +0800, Huang, Ying wrote: > "Huang, Ying" writes: > > > "Huang, Ying" writes: > > > >> Hi, Josef, > >> > >> kernel test robot writes: > >> > >>> Greeting, > >>> > >>> FYI, we noticed a -12.4% regression of fio.write_bw_MBps due to commit: > >>> > >>> > >>> commit: 302167c50b32e7fccc98994a91d40ddbbab04e52 ("btrfs: don't end > >>> the transaction for delayed refs in throttle") > >>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git > >>> pending-fixes > >>> > >>> in testcase: fio-basic > >>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz > >>> with 64G memory > >>> with following parameters: > >>> > >>> runtime: 300s > >>> nr_task: 8t > >>> disk: 1SSD > >>> fs: btrfs > >>> rw: randwrite > >>> bs: 4k > >>> ioengine: sync > >>> test_size: 400g > >>> cpufreq_governor: performance > >>> ucode: 0xb2e > >>> > >>> test-description: Fio is a tool that will spawn a number of threads > >>> or processes doing a particular type of I/O action as specified by > >>> the user. > >>> test-url: https://github.com/axboe/fio > >>> > >>> > >> > >> Do you have time to take a look at this regression? > > > > Ping > > Ping again. > This happens because now we rely more on on-demand flushing than the catchup flushing that happened before. This is just one case where it's slightly worse, overall this change provides better latencies, and even in this result it provided better completion latencies because we're not randomly flushing at the end of a transaction. It does appear to be costing writes in that they will spend more time flushing than before, so you get slightly lower throughput on pure small write workloads. I can't actually see the slowdown locally. This patch is here to stay, it just shows we need to continue to refine the flushing code to be less spikey/painful. Thanks, Josef
Re: [PATCH] mm: filemap: correct the comment about VM_FAULT_RETRY
On Fri, Apr 26, 2019 at 07:22:11AM +0800, Yang Shi wrote: > The commit 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking > operations") changed when mmap_sem is dropped during filemap page fault > and when returning VM_FAULT_RETRY. > > Correct the comment to reflect the change. > > Cc: Josef Bacik > Signed-off-by: Yang Shi > --- Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] btrfs: fix a NULL pointer dereference
On Thu, Mar 14, 2019 at 09:54:07AM +0200, Nikolay Borisov wrote: > > > On 14.03.19 г. 9:50 ч., Kangjie Lu wrote: > > btrfs_lookup_block_group may fail and return NULL. The fix goes > > to out when it fails to avoid NULL pointer dereference. > > Actually no, in this case btrfs_lookup_block_group must never fail > because if we have an allocated eb then it must have been allocated from > a bg. > Agreed, we only get to btrfs_free_tree_block() if we are actually deleting the extent buffer. We would have had to read in the extent buffer first to get here, which would have failed if there was no block group. We can't get into this situation with a specifically crafted file system to exploit this as we'd bail out well before we could get to btrfs_free_tree_block(). Adding an ASSERT() makes sure developers aren't doing anything stupid, but again we'd have to be doing something _super_ stupid to hit it. Thanks, Josef
Re: [PATCH v2 0/3] blkcg: sync() isolation
On Thu, Mar 07, 2019 at 07:08:31PM +0100, Andrea Righi wrote: > = Problem = > > When sync() is executed from a high-priority cgroup, the process is forced to > wait the completion of the entire outstanding writeback I/O, even the I/O that > was originally generated by low-priority cgroups potentially. > > This may cause massive latencies to random processes (even those running in > the > root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic > priority inversion problem. > > This topic has been previously discussed here: > https://patchwork.kernel.org/patch/10804489/ > Sorry to move the goal posts on you again Andrea, but Tejun and I talked about this some more offline. We don't want cgroup to become the arbiter of correctness/behavior here. We just want it to be isolating things. For you that means you can drop the per-cgroup flag stuff, and only do the priority boosting for multiple sync(2) waiters. That is a real priority inversion that needs to be fixed. io.latency and io.max are capable of noticing that a low priority group is going above their configured limits and putting pressure elsewhere accordingly. Tejun said he'd rather see the sync(2) isolation be done at the namespace level. That way if you have fs namespacing you are already isolated to your namespace. If you feel like tackling that then hooray, but that's a separate dragon to slay so don't feel like you have to right now. This way we keep cgroup doing its job, controlling resources. Then we allow namespacing to do its thing, isolating resources. Thanks, Josef
Re: [PATCH v2 2/3] blkcg: introduce io.sync_isolation
On Thu, Mar 07, 2019 at 07:08:33PM +0100, Andrea Righi wrote: > Add a flag to the blkcg cgroups to make sync()'ers in a cgroup only be > allowed to write out pages that have been dirtied by the cgroup itself. > > This flag is disabled by default (meaning that we are not changing the > previous behavior by default). > > When this flag is enabled any cgroup can write out only dirty pages that > belong to the cgroup itself (except for the root cgroup that would still > be able to write out all pages globally). > > Signed-off-by: Andrea Righi Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v2 1/3] blkcg: prevent priority inversion problem during sync()
On Thu, Mar 07, 2019 at 07:08:32PM +0100, Andrea Righi wrote: > Prevent priority inversion problem when a high-priority blkcg issues a > sync() and it is forced to wait the completion of all the writeback I/O > generated by any other low-priority blkcg, causing massive latencies to > processes that shouldn't be I/O-throttled at all. > > The idea is to save a list of blkcg's that are waiting for writeback: > every time a sync() is executed the current blkcg is added to the list. > > Then, when I/O is throttled, if there's a blkcg waiting for writeback > different than the current blkcg, no throttling is applied (we can > probably refine this logic later, i.e., a better policy could be to > adjust the throttling I/O rate using the blkcg with the highest speed > from the list of waiters - priority inheritance, kinda). > > Signed-off-by: Andrea Righi > --- > block/blk-cgroup.c | 131 +++ > block/blk-throttle.c | 11 ++- > fs/fs-writeback.c| 5 ++ > fs/sync.c| 8 +- > include/linux/backing-dev-defs.h | 2 + > include/linux/blk-cgroup.h | 23 ++ > mm/backing-dev.c | 2 + > 7 files changed, 178 insertions(+), 4 deletions(-) > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index 2bed5725aa03..4305e78d1bb2 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1351,6 +1351,137 @@ struct cgroup_subsys io_cgrp_subsys = { > }; > EXPORT_SYMBOL_GPL(io_cgrp_subsys); > > +#ifdef CONFIG_CGROUP_WRITEBACK > +struct blkcg_wb_sleeper { > + struct backing_dev_info *bdi; > + struct blkcg *blkcg; > + refcount_t refcnt; > + struct list_head node; > +}; > + > +static DEFINE_SPINLOCK(blkcg_wb_sleeper_lock); > +static LIST_HEAD(blkcg_wb_sleeper_list); > + > +static struct blkcg_wb_sleeper * > +blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi) > +{ > + struct blkcg_wb_sleeper *bws; > + > + list_for_each_entry(bws, _wb_sleeper_list, node) > + if (bws->blkcg == blkcg && bws->bdi == bdi) > + return bws; > + return NULL; > +} > + > +static void blkcg_wb_sleeper_add(struct blkcg_wb_sleeper *bws) > +{ > + list_add(>node, _wb_sleeper_list); > +} > + > +static void blkcg_wb_sleeper_del(struct blkcg_wb_sleeper *bws) > +{ > + list_del_init(>node); > +} > + > +/** > + * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device > + * @blkcg: current blkcg cgroup > + * @bdi: block device to check > + * > + * Return true if any other blkcg different than the current one is waiting > for > + * writeback on the target block device, false otherwise. > + */ > +bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info > *bdi) > +{ > + struct blkcg_wb_sleeper *bws; > + bool ret = false; > + > + spin_lock(_wb_sleeper_lock); > + list_for_each_entry(bws, _wb_sleeper_list, node) > + if (bws->bdi == bdi && bws->blkcg != blkcg) { > + ret = true; > + break; > + } > + spin_unlock(_wb_sleeper_lock); > + > + return ret; > +} No global lock please, add something to the bdi I think? Also have a fast path of if (list_empty(blkcg_wb_sleeper_list)) return false; we don't need to be super accurate here. Thanks, Josef
Re: [PATCH v2 3/3] blkcg: implement sync() isolation
On Thu, Mar 07, 2019 at 07:08:34PM +0100, Andrea Righi wrote: > Keep track of the inodes that have been dirtied by each blkcg cgroup and > make sure that a blkcg issuing a sync() can trigger the writeback + wait > of only those pages that belong to the cgroup itself. > > This behavior is applied only when io.sync_isolation is enabled in the > cgroup, otherwise the old behavior is applied: sync() triggers the > writeback of any dirty page. > > Signed-off-by: Andrea Righi > --- > block/blk-cgroup.c | 47 ++ > fs/fs-writeback.c | 52 +++--- > fs/inode.c | 1 + > include/linux/blk-cgroup.h | 22 > include/linux/fs.h | 4 +++ > mm/page-writeback.c| 1 + > 6 files changed, 124 insertions(+), 3 deletions(-) > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index 4305e78d1bb2..7d3b26ba4575 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1480,6 +1480,53 @@ void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info > *bdi) > spin_unlock(_wb_sleeper_lock); > rcu_read_unlock(); > } > + > +/** > + * blkcg_set_mapping_dirty - set owner of a dirty mapping > + * @mapping: target address space > + * > + * Set the current blkcg as the owner of the address space @mapping (the > first > + * blkcg that dirties @mapping becomes the owner). > + */ > +void blkcg_set_mapping_dirty(struct address_space *mapping) > +{ > + struct blkcg *curr_blkcg, *blkcg; > + > + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) || > + mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) > + return; > + > + rcu_read_lock(); > + curr_blkcg = blkcg_from_current(); > + blkcg = blkcg_from_mapping(mapping); > + if (curr_blkcg != blkcg) { > + if (blkcg) > + css_put(>css); > + css_get(_blkcg->css); > + rcu_assign_pointer(mapping->i_blkcg, curr_blkcg); > + } > + rcu_read_unlock(); > +} > + > +/** > + * blkcg_set_mapping_clean - clear the owner of a dirty mapping > + * @mapping: target address space > + * > + * Unset the owner of @mapping when it becomes clean. > + */ > + > +void blkcg_set_mapping_clean(struct address_space *mapping) > +{ > + struct blkcg *blkcg; > + > + rcu_read_lock(); > + blkcg = rcu_dereference(mapping->i_blkcg); > + if (blkcg) { > + css_put(>css); > + RCU_INIT_POINTER(mapping->i_blkcg, NULL); > + } > + rcu_read_unlock(); > +} > #endif > Why do we need this? We already have the inode_attach_wb(), which has the blkcg_css embedded in it for whoever dirtied the inode first. Can we not just use that? Thanks, Josef
Re: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()
On Mon, Feb 11, 2019 at 09:40:29PM +0100, Andrea Righi wrote: > On Mon, Feb 11, 2019 at 10:39:34AM -0500, Josef Bacik wrote: > > On Sat, Feb 09, 2019 at 03:07:49PM +0100, Andrea Righi wrote: > > > This is an attempt to mitigate the priority inversion problem of a > > > high-priority blkcg issuing a sync() and being forced to wait the > > > completion of all the writeback I/O generated by any other low-priority > > > blkcg, causing massive latencies to processes that shouldn't be > > > I/O-throttled at all. > > > > > > The idea is to save a list of blkcg's that are waiting for writeback: > > > every time a sync() is executed the current blkcg is added to the list. > > > > > > Then, when I/O is throttled, if there's a blkcg waiting for writeback > > > different than the current blkcg, no throttling is applied (we can > > > probably refine this logic later, i.e., a better policy could be to > > > adjust the throttling I/O rate using the blkcg with the highest speed > > > from the list of waiters - priority inheritance, kinda). > > > > > > This topic has been discussed here: > > > https://lwn.net/ml/cgroups/20190118103127.325-1-righi.and...@gmail.com/ > > > > > > But we didn't come up with any definitive solution. > > > > > > This patch is not a definitive solution either, but it's an attempt to > > > continue addressing this issue and handling the priority inversion > > > problem with sync() in a better way. > > > > > > Signed-off-by: Andrea Righi > > > > Talked with Tejun about this some and we agreed the following is probably > > the > > best way forward > > First of all thanks for the update! > > > > > 1) Track the submitter of the wb work to the writeback code. > > Are we going to track the cgroup that originated the dirty pages (or > maybe dirty inodes) or do you have any idea in particular? > The guy doing the sync(), so that way we can accomplish #3. But really this is an implementation detail, however you want to accomplish it is fine by me. > > 2) Sync() defaults to the root cg, and and it writes all the things as the > > root > >cg. > > OK. > > > 3) Add a flag to the cgroups that would make sync()'ers in that group only > > be > >allowed to write out things that belong to its group. > > So, IIUC, when this flag is enabled a cgroup that is doing sync() would > trigger the writeback of the pages that belong to that cgroup only and > it waits only for these pages to be sync-ed, right? In this case > writeback can still go at cgroup's speed. > > Instead when the flag is disabled, sync() would trigger writeback I/O > globally, as usual, and it goes at full speed (root cgroup's speed). > > Am I understanding correctly? > Yup that's exactly it. Thanks, Josef
Re: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()
On Sat, Feb 09, 2019 at 03:07:49PM +0100, Andrea Righi wrote: > This is an attempt to mitigate the priority inversion problem of a > high-priority blkcg issuing a sync() and being forced to wait the > completion of all the writeback I/O generated by any other low-priority > blkcg, causing massive latencies to processes that shouldn't be > I/O-throttled at all. > > The idea is to save a list of blkcg's that are waiting for writeback: > every time a sync() is executed the current blkcg is added to the list. > > Then, when I/O is throttled, if there's a blkcg waiting for writeback > different than the current blkcg, no throttling is applied (we can > probably refine this logic later, i.e., a better policy could be to > adjust the throttling I/O rate using the blkcg with the highest speed > from the list of waiters - priority inheritance, kinda). > > This topic has been discussed here: > https://lwn.net/ml/cgroups/20190118103127.325-1-righi.and...@gmail.com/ > > But we didn't come up with any definitive solution. > > This patch is not a definitive solution either, but it's an attempt to > continue addressing this issue and handling the priority inversion > problem with sync() in a better way. > > Signed-off-by: Andrea Righi Talked with Tejun about this some and we agreed the following is probably the best way forward 1) Track the submitter of the wb work to the writeback code. 2) Sync() defaults to the root cg, and and it writes all the things as the root cg. 3) Add a flag to the cgroups that would make sync()'ers in that group only be allowed to write out things that belong to its group. This way we avoid the priority inversion of having things like systemd or random logged in user doing sync() and having to wait, and we keep low prio cgroups from causing big IO storms by syncing out stuff and getting upgraded to root priority just to avoid the inversion. Obviously by default we want this flag to be off since its such a big change, but people/setups really worried about this behavior (Facebook for instance would likely use this flag) can go ahead and set it and be sure we're getting good isolation and still avoiding the priority inversion associated with running sync from a high priority context. Thanks, Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Tue, Jan 29, 2019 at 07:39:38PM +0100, Andrea Righi wrote: > On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote: > > On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote: > > > Hi Vivek, > > > > > > sorry for the late reply. > > > > > > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote: > > > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote: > > > > > > > > [..] > > > > > Alright, let's skip the root cgroup for now. I think the point here is > > > > > if we want to provide sync() isolation among cgroups or not. > > > > > > > > > > According to the manpage: > > > > > > > > > >sync() causes all pending modifications to filesystem > > > > > metadata and cached file data to be > > > > >written to the underlying filesystems. > > > > > > > > > > And: > > > > >According to the standard specification (e.g., POSIX.1-2001), > > > > > sync() schedules the writes, but > > > > >may return before the actual writing is done. However > > > > > Linux waits for I/O completions, and > > > > >thus sync() or syncfs() provide the same guarantees as fsync > > > > > called on every file in the sys‐ > > > > >tem or filesystem respectively. > > > > > > > > > > Excluding the root cgroup, do you think a sync() issued inside a > > > > > specific cgroup should wait for I/O completions only for the writes > > > > > that > > > > > have been generated by that cgroup? > > > > > > > > Can we account I/O towards the cgroup which issued "sync" only if write > > > > rate of sync cgroup is higher than cgroup to which page belongs to. Will > > > > that solve problem, assuming its doable? > > > > > > Maybe this would mitigate the problem, in part, but it doesn't solve it. > > > > > > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup > > > issues "sync", the fast cgroup needs to wait a lot, because writeback is > > > happening at the speed of the slow cgroup. > > > > Hi Andrea, > > > > But that's true only for I/O which has already been submitted to block > > layer, right? Any new I/O yet to be submitted could still be attributed > > to faster cgroup requesting sync. > > Right. If we could bump up the new I/O yet to be submitted I think we > could effectively prevent the priority inversion problem (the ongoing > writeback I/O should be negligible). > > > > > Until and unless cgroups limits are absurdly low, it should not take very > > long for already submitted I/O to finish. If yes, then in practice, it > > might not be a big problem? > > I was actually doing my tests with a very low limit (1MB/s both for rbps > and wbps), but this shows the problem very well I think. > > Here's what I'm doing: > > [ slow cgroup (1Mbps read/write) ] > >$ cat /sys/fs/cgroup/unified/cg1/io.max >259:0 rbps=1048576 wbps=1048576 riops=max wiops=max >$ cat /proc/self/cgroup >0::/cg1 > >$ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer > --time_based --runtime=30 > > [ fast cgroup (root cgroup, no limitation) ] > ># cat /proc/self/cgroup >0::/ > ># time sync >real 9m32,618s >user 0m0,000s >sys0m0,018s > > With this simple test I can easily trigger hung task timeout warnings > and make the whole system totally sluggish (even the processes running > in the root cgroup). > > When fio ends, writeback is still taking forever to complete, as you can > see by the insane amount that sync takes to complete. > Yeah sync() needs to be treated differently, but its kind of special too. We don't want slow to run sync() and backup fast doing sync() because we make all of the io go based on the submitting cgroup. The problem here is we don't know who's more important until we get to the blk cgroup layer, and even then sometimes we can't tell (different hierarchies would make this tricky with io.weight or io.latency). We could treat it like REQ_META and just let everything go through and back charge. This feels like a way for the slow group to cheat though, unless we just throttle the shit out of him before returning to user space. I'll have to think about this some more. Thanks, Josef
Re: [PATCH 07/11] btrfs: move to fn pointers for get/put workspaces
On Tue, Jan 29, 2019 at 01:17:17PM -0500, Josef Bacik wrote: > On Mon, Jan 28, 2019 at 04:24:33PM -0500, Dennis Zhou wrote: > > The previous patch added generic helpers for get_workspace() and > > put_workspace(). Now, we can migrate ownership of the workspace_manager > > to be in the compression type code as the compression code itself > > doesn't care beyond being able to get a workspace. The init/cleanup > > and get/put methods are abstracted so each compression algorithm can > > decide how they want to manage their workspaces. > > > > Signed-off-by: Dennis Zhou > > We're doing this to have special handling for extra workspaces to be free'd at > some point in the future if they are unused. This is fine by me, but why not > just add a shrinker and let it be handled by memory pressure? Then we avoid > all > this abstraction and allow for ztsd to have its shrinker for its extra > workspaces. You can even use the list_lru stuff to make it super simple, then > you don't have to worry about all the infrastructure. Thanks, > Nevermind, I missed that you also change the get side to lookup the workspace for the compression level instead of cycling through the idle_ws list. In that case this is fine by me. Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 08/11] btrfs: plumb level through the compression interface
On Mon, Jan 28, 2019 at 04:24:34PM -0500, Dennis Zhou wrote: > Zlib compression supports multiple levels, but doesn't require changing > in how a workspace itself is created and managed. Zstd introduces a > different memory requirement such that higher levels of compression > require more memory. This requires changes in how the alloc()/get() > methods work for zstd. This pach plumbs compression level through the > interface as a parameter in preparation for zstd compression levels. > This gives the compression types opportunity to create/manage based on > the compression level. > > Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 07/11] btrfs: move to fn pointers for get/put workspaces
On Mon, Jan 28, 2019 at 04:24:33PM -0500, Dennis Zhou wrote: > The previous patch added generic helpers for get_workspace() and > put_workspace(). Now, we can migrate ownership of the workspace_manager > to be in the compression type code as the compression code itself > doesn't care beyond being able to get a workspace. The init/cleanup > and get/put methods are abstracted so each compression algorithm can > decide how they want to manage their workspaces. > > Signed-off-by: Dennis Zhou We're doing this to have special handling for extra workspaces to be free'd at some point in the future if they are unused. This is fine by me, but why not just add a shrinker and let it be handled by memory pressure? Then we avoid all this abstraction and allow for ztsd to have its shrinker for its extra workspaces. You can even use the list_lru stuff to make it super simple, then you don't have to worry about all the infrastructure. Thanks, Josef
Re: [PATCH 06/11] btrfs: add compression interface in (get/put)_workspace()
On Mon, Jan 28, 2019 at 04:24:32PM -0500, Dennis Zhou wrote: > There are two levels of workspace management. First, alloc()/free() > which are responsible for actually creating and destroy workspaces. > Second, at a higher level, get()/put() which is the compression code > asking for a workspace from a workspace_manager. > > The compression code shouldn't really care how it gets a workspace, but > that it got a workspace. This adds get_workspace() and put_workspace() > to be the higher level interface which is responsible for indexing into > the appropriate compression type. It also introduces > btrfs_put_workspace() and btrfs_get_workspace() to be the generic > implementations of the higher interface. > > Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 05/11] btrfs: add helper methods for workspace manager init and cleanup
On Mon, Jan 28, 2019 at 04:24:31PM -0500, Dennis Zhou wrote: > Workspace manager init and cleanup code is open coded inside a for loop > over the compression types. This forces each compression type to rely on > the same workspace manager implementation. This patch creates helper > methods that will be the generic implementation for btrfs workspace > management. > > Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 04/11] btrfs: unify compression ops with workspace_manager
On Mon, Jan 28, 2019 at 04:24:30PM -0500, Dennis Zhou wrote: > Make the workspace_manager own the interface operations rather than > managing index-paired arrays for the workspace_manager and compression > operations. > > Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 03/11] btrfs: manage heuristic workspace as index 0
On Mon, Jan 28, 2019 at 04:24:29PM -0500, Dennis Zhou wrote: > While the heuristic workspaces aren't really compression workspaces, > they use the same interface for managing them. So rather than branching, > let's just handle them once again as the index 0 compression type. > > Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 02/11] btrfs: rename workspaces_list to workspace_manager
On Mon, Jan 28, 2019 at 04:24:28PM -0500, Dennis Zhou wrote: > This is in preparation for zstd compression levels. As each level will > require different sized workspaces, workspaces_list is no longer a > really fitting name. > > Signed-off-by: Dennis Zhou > --- Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 01/11] btrfs: add macros for compression type and level
On Mon, Jan 28, 2019 at 04:24:27PM -0500, Dennis Zhou wrote: > It is very easy to miss places that rely on a certain bitshifting for > decyphering the type_level overloading. Make macros handle this instead. > > Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Thanks, Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote: > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote: > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > > > This is a redesign of my old cgroup-io-throttle controller: > > > https://lwn.net/Articles/330531/ > > > > > > I'm resuming this old patch to point out a problem that I think is still > > > not solved completely. > > > > > > = Problem = > > > > > > The io.max controller works really well at limiting synchronous I/O > > > (READs), but a lot of I/O requests are initiated outside the context of > > > the process that is ultimately responsible for its creation (e.g., > > > WRITEs). > > > > > > Throttling at the block layer in some cases is too late and we may end > > > up slowing down processes that are not responsible for the I/O that > > > is being processed at that level. > > > > How so? The writeback threads are per-cgroup and have the cgroup stuff set > > properly. So if you dirty a bunch of pages, they are associated with your > > cgroup, and then writeback happens and it's done in the writeback thread > > associated with your cgroup and then that is throttled. Then you are > > throttled > > at balance_dirty_pages() because the writeout is taking longer. > > Right, writeback is per-cgroup and slowing down writeback affects only > that specific cgroup, but, there are cases where other processes from > other cgroups may require to wait on that writeback to complete before > doing I/O (for example an fsync() to a file shared among different > cgroups). In this case we may end up blocking cgroups that shouldn't be > blocked, that looks like a priority-inversion problem. This is the > problem that I'm trying to address. Well this case is a misconfiguration, you shouldn't be sharing files between cgroups. But even if you are, fsync() is synchronous, we should be getting the context from the process itself and thus should have its own rules applied. There's nothing we can do for outstanding IO, but that shouldn't be that much. That would need to be dealt with on a per-contoller basis. > > > > > I introduced the blk_cgroup_congested() stuff for paths that it's not easy > > to > > clearly tie IO to the thing generating the IO, such as readahead and such. > > If > > you are running into this case that may be something worth using. Course it > > only works for io.latency now but there's no reason you can't add support > > to it > > for io.max or whatever. > > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with > memcg), something like this: if the cgroup is already congested don't > generate extra I/O due to readahead. Am I right? Yeah, but that's just how it's currently used, it can be used any which way we feel like. > > > > > > > > > = Proposed solution = > > > > > > The main idea of this controller is to split I/O measurement and I/O > > > throttling: I/O is measured at the block layer for READS, at page cache > > > (dirty pages) for WRITEs, and processes are limited while they're > > > generating I/O at the VFS level, based on the measured I/O. > > > > > > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest > > looking into that route and simply changing the existing io controller you > > are > > using to take advantage of that so it will actually throttle things. Then > > just > > sprinkle it around the areas where we indirectly generate IO. Thanks, > > Absolutely, I can probably use blk_cgroup_congested() as a method to > determine when a cgroup should be throttled (instead of doing my own > I/O measuring), but to prevent the "slow writeback slowing down other > cgroups" issue I still need to apply throttling when pages are dirtied > in page cache. Again this is just a fuckup from a configuration stand point. The argument could be made that sync() is probably broken here, but I think the right solution here is to just pass the cgroup context along with the writeback information and use that if it's set instead. Thanks, Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote: > > > > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik > > ha scritto: > > > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > >> This is a redesign of my old cgroup-io-throttle controller: > >> https://lwn.net/Articles/330531/ > >> > >> I'm resuming this old patch to point out a problem that I think is still > >> not solved completely. > >> > >> = Problem = > >> > >> The io.max controller works really well at limiting synchronous I/O > >> (READs), but a lot of I/O requests are initiated outside the context of > >> the process that is ultimately responsible for its creation (e.g., > >> WRITEs). > >> > >> Throttling at the block layer in some cases is too late and we may end > >> up slowing down processes that are not responsible for the I/O that > >> is being processed at that level. > > > > How so? The writeback threads are per-cgroup and have the cgroup stuff set > > properly. So if you dirty a bunch of pages, they are associated with your > > cgroup, and then writeback happens and it's done in the writeback thread > > associated with your cgroup and then that is throttled. Then you are > > throttled > > at balance_dirty_pages() because the writeout is taking longer. > > > > IIUC, Andrea described this problem: certain processes in a certain group > dirty a > lot of pages, causing write back to start. Then some other blameless > process in the same group experiences very high latency, in spite of > the fact that it has to do little I/O. > In that case the io controller isn't doing it's job and needs to be fixed (or reconfigured). io.latency guards against this, I assume io.max would keep this from happening if it were configured properly. > Does your blk_cgroup_congested() stuff solves this issue? > > Or simply I didn't get what Andrea meant at all :) > I _think_ Andrea is talking about the fact that we can generate IO indirectly and never get throttled for it, which is what blk_cgroup_congested() is meant to address. I added it specifically because some low prio task was just allocating all of the memory on the system and causing a lot of pressure because of swapping, but there was no direct feedback loop there. blk_cgroup_congested() provides that feedback loop. Course I could be wrong too and we're all just talking past each other ;). Thanks, Josef
Re: [RFC PATCH 0/3] cgroup: fsio throttle controller
On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote: > This is a redesign of my old cgroup-io-throttle controller: > https://lwn.net/Articles/330531/ > > I'm resuming this old patch to point out a problem that I think is still > not solved completely. > > = Problem = > > The io.max controller works really well at limiting synchronous I/O > (READs), but a lot of I/O requests are initiated outside the context of > the process that is ultimately responsible for its creation (e.g., > WRITEs). > > Throttling at the block layer in some cases is too late and we may end > up slowing down processes that are not responsible for the I/O that > is being processed at that level. How so? The writeback threads are per-cgroup and have the cgroup stuff set properly. So if you dirty a bunch of pages, they are associated with your cgroup, and then writeback happens and it's done in the writeback thread associated with your cgroup and then that is throttled. Then you are throttled at balance_dirty_pages() because the writeout is taking longer. I introduced the blk_cgroup_congested() stuff for paths that it's not easy to clearly tie IO to the thing generating the IO, such as readahead and such. If you are running into this case that may be something worth using. Course it only works for io.latency now but there's no reason you can't add support to it for io.max or whatever. > > = Proposed solution = > > The main idea of this controller is to split I/O measurement and I/O > throttling: I/O is measured at the block layer for READS, at page cache > (dirty pages) for WRITEs, and processes are limited while they're > generating I/O at the VFS level, based on the measured I/O. > This is what blk_cgroup_congested() is meant to accomplish, I would suggest looking into that route and simply changing the existing io controller you are using to take advantage of that so it will actually throttle things. Then just sprinkle it around the areas where we indirectly generate IO. Thanks, Josef
Re: [PATCH RFC 0/3] mm: Reduce IO by improving algorithm of memcg pagecache pages eviction
On Wed, Jan 09, 2019 at 07:08:09PM +0300, Kirill Tkhai wrote: > Hi, Josef, > > On 09.01.2019 18:49, Josef Bacik wrote: > > On Wed, Jan 09, 2019 at 03:20:18PM +0300, Kirill Tkhai wrote: > >> On nodes without memory overcommit, it's common a situation, > >> when memcg exceeds its limit and pages from pagecache are > >> shrinked on reclaim, while node has a lot of free memory. > >> Further access to the pages requires real device IO, while > >> IO causes time delays, worse powerusage, worse throughput > >> for other users of the device, etc. > >> > >> Cleancache is not a good solution for this problem, since > >> it implies copying of page on every cleancache_put_page() > >> and cleancache_get_page(). Also, it requires introduction > >> of internal per-cleancache_ops data structures to manage > >> cached pages and their inodes relationships, which again > >> introduces overhead. > >> > >> This patchset introduces another solution. It introduces > >> a new scheme for evicting memcg pages: > >> > >> 1)__remove_mapping() uncharges unmapped page memcg > >> and leaves page in pagecache on memcg reclaim; > >> > >> 2)putback_lru_page() places page into root_mem_cgroup > >> list, since its memcg is NULL. Page may be evicted > >> on global reclaim (and this will be easily, as > >> page is not mapped, so shrinker will shrink it > >> with 100% probability of success); > >> > >> 3)pagecache_get_page() charges page into memcg of > >> a task, which takes it first. > >> > >> Below is small test, which shows profit of the patchset. > >> > >> Create memcg with limit 20M (exact value does not matter much): > >> $ mkdir /sys/fs/cgroup/memory/ct > >> $ echo 20M > /sys/fs/cgroup/memory/ct/memory.limit_in_bytes > >> $ echo $$ > /sys/fs/cgroup/memory/ct/tasks > >> > >> Then twice read 1GB file: > >> $ time cat file_1gb > /dev/null > >> > >> Before (2 iterations): > >> 1)0.01user 0.82system 0:11.16elapsed 7%CPU > >> 2)0.01user 0.91system 0:11.16elapsed 8%CPU > >> > >> After (2 iterations): > >> 1)0.01user 0.57system 0:11.31elapsed 5%CPU > >> 2)0.00user 0.28system 0:00.28elapsed 100%CPU > >> > >> With the patch set applied, we have file pages are cached > >> during the second read, so the result is 39 times faster. > >> > >> This may be useful for slow disks, NFS, nodes without > >> overcommit by memory, in case of two memcg access the same > >> files, etc. > >> > > > > This isn't going to work for us (Facebook). The whole reason the hard limit > > exists is to keep different groups from messing up other groups. Page cache > > reclaim is not free, most of our pain and most of the reason we use cgroups > > is to limit the effect of flooding the machine with pagecache from different > > groups. > > I understand the problem. > > > Memory leaks happen few and far between, but chef doing a yum > > update in the system container happens regularly. If you talk about > > suddenly > > orphaning these pages to the root container it still creates pressure on the > > main workload, pressure that results in it having to take time from what > > it's > > doing and free up memory instead. > > Could you please to clarify additional pressure, which introduces the > patchset? > The number of actions, which are needed to evict a pagecache page, remain > almost > the same: we just delay __delete_from_page_cache() to global reclaim. Global > reclaim should not introduce much pressure, since it's the iteration on a > single > memcg (we should not dive into hell of children memcg, since root memcg > reclaim > should be successful and free enough pages, should't we?). If we go into global reclaim at all. If we're unable to allocate a page as the most important cgroup we start shrinking ourselves first right? And then eventually end up in global reclaim, right? So it may be easily enough reclaimed, but we're going to waste a lot of time getting there in the meantime, which means latency that's hard to pin down. And secondly this allows hard limited cgroups to essentially leak pagecache into the whole system, creating waaay more memory pressure than what I think you intend. Your logic is that we'll exceed our limit, evict some pagecache to the root cgroup, and we avoid a OOM and everything is ok. However what will really happen is some user is going to do dd if=/dev/zero
Re: [PATCH RFC 0/3] mm: Reduce IO by improving algorithm of memcg pagecache pages eviction
On Wed, Jan 09, 2019 at 03:20:18PM +0300, Kirill Tkhai wrote: > On nodes without memory overcommit, it's common a situation, > when memcg exceeds its limit and pages from pagecache are > shrinked on reclaim, while node has a lot of free memory. > Further access to the pages requires real device IO, while > IO causes time delays, worse powerusage, worse throughput > for other users of the device, etc. > > Cleancache is not a good solution for this problem, since > it implies copying of page on every cleancache_put_page() > and cleancache_get_page(). Also, it requires introduction > of internal per-cleancache_ops data structures to manage > cached pages and their inodes relationships, which again > introduces overhead. > > This patchset introduces another solution. It introduces > a new scheme for evicting memcg pages: > > 1)__remove_mapping() uncharges unmapped page memcg > and leaves page in pagecache on memcg reclaim; > > 2)putback_lru_page() places page into root_mem_cgroup > list, since its memcg is NULL. Page may be evicted > on global reclaim (and this will be easily, as > page is not mapped, so shrinker will shrink it > with 100% probability of success); > > 3)pagecache_get_page() charges page into memcg of > a task, which takes it first. > > Below is small test, which shows profit of the patchset. > > Create memcg with limit 20M (exact value does not matter much): > $ mkdir /sys/fs/cgroup/memory/ct > $ echo 20M > /sys/fs/cgroup/memory/ct/memory.limit_in_bytes > $ echo $$ > /sys/fs/cgroup/memory/ct/tasks > > Then twice read 1GB file: > $ time cat file_1gb > /dev/null > > Before (2 iterations): > 1)0.01user 0.82system 0:11.16elapsed 7%CPU > 2)0.01user 0.91system 0:11.16elapsed 8%CPU > > After (2 iterations): > 1)0.01user 0.57system 0:11.31elapsed 5%CPU > 2)0.00user 0.28system 0:00.28elapsed 100%CPU > > With the patch set applied, we have file pages are cached > during the second read, so the result is 39 times faster. > > This may be useful for slow disks, NFS, nodes without > overcommit by memory, in case of two memcg access the same > files, etc. > This isn't going to work for us (Facebook). The whole reason the hard limit exists is to keep different groups from messing up other groups. Page cache reclaim is not free, most of our pain and most of the reason we use cgroups is to limit the effect of flooding the machine with pagecache from different groups. Memory leaks happen few and far between, but chef doing a yum update in the system container happens regularly. If you talk about suddenly orphaning these pages to the root container it still creates pressure on the main workload, pressure that results in it having to take time from what it's doing and free up memory instead. Thanks, Josef