Re: [PATCH 0/4] eventfs: Some more minor fixes

2023-11-22 Thread Josef Bacik
On Tue, Nov 21, 2023 at 06:10:03PM -0500, Steven Rostedt wrote:
> Mark Rutland reported some crashes from the latest eventfs updates.
> This fixes most of them.
> 
> He still has one splat that he can trigger but I can not. Still looking
> into that.

Reviewed-by: Josef Bacik 

Thanks,

Josef



Re: [RFC v3 0/2] vfs / btrfs: add support for ustat()

2021-04-15 Thread Josef Bacik

On 4/15/21 1:53 PM, Luis Chamberlain wrote:

On Wed, Aug 23, 2017 at 3:31 PM Jeff Mahoney  wrote:


On 8/15/14 5:29 AM, Al Viro wrote:

On Thu, Aug 14, 2014 at 07:58:56PM -0700, Luis R. Rodriguez wrote:


Christoph had noted that this seemed associated to the problem
that the btrfs uses different assignments for st_dev than s_dev,
but much as I'd like to see that changed based on discussions so
far its unclear if this is going to be possible unless strong
commitment is reached.


Resurrecting a dead thread since we've been carrying this patch anyway
since then.


Explain, please.  Whose commitment and commitment to what, exactly?
Having different ->st_dev values for different files on the same
fs is a bloody bad idea; why does btrfs do that at all?  If nothing else,
it breaks the usual "are those two files on the same fs?" tests...


It's because btrfs snapshots would have inode number collisions.
Changing the inode numbers for snapshots would negate a big benefit of
btrfs snapshots: the quick creation and lightweight on-disk
representation due to metadata sharing.

The thing is that ustat() used to work.  Your commit 0ee5dc676a5f8
(btrfs: kill magical embedded struct superblock) had a regression:
Since it replaced the superblock with a simple dev_t, it rendered the
device no longer discoverable by user_get_super.  We need a list_head to
attach for searching.

There's an argument that this is hacky.  It's valid.  The only other
feedback I've heard is to use a real superblock for subvolumes to do
this instead.  That doesn't work either, due to things like freeze/thaw
and inode writeback.  Ultimately, what we need is a single file system
with multiple namespaces.  Years ago we just needed different inode
namespaces, but as people have started adopting btrfs for containers, we
need more than that.  I've heard requests for per-subvolume security
contexts.  I'd imagine user namespaces are on someone's wish list.  A
working df can be done with ->d_automount, but the way btrfs handles
having a "canonical" subvolume location has always been a way to avoid
directory loops.  I'd like to just automount subvolumes everywhere
they're referenced.  One solution, for which I have no code yet, is to
have something like a superblock-light that we can hang things like a
security context, a user namespace, and an anonymous dev.  Most file
systems would have just one.  Btrfs would have one per subvolume.

That's a big project with a bunch of discussion.


4 years have gone by and this patch is still being carried around for
btrfs. Other than resolving this ustat() issue for btrfs are there new
reasons to support this effort done to be done properly? Are there
other filesystems that would benefit? I'd like to get an idea of the
stakeholder here before considering taking this on or not.



Not really sure why this needs to be addressed, we have statfs(), and what we 
have has worked forever now.  There's a lot of larger things that need to be 
addressed in general to support the volume approach inside file systems that is 
going to require a lot of work inside of VFS.  If you feel like tackling that 
work and then wiring up btrfs by all means have at it, but I'm not seeing a 
urgent need to address this.  Thanks,


Josef



Re: the qemu-nbd process automatically exit with the commit 43347d56c 'livepatch: send a fake signal to all blocking tasks'

2021-04-14 Thread Josef Bacik

On 4/14/21 11:21 AM, xiaojun.zhao...@gmail.com wrote:

On Wed, 14 Apr 2021 13:27:43 +0200 (CEST)
Miroslav Benes  wrote:


Hi,

On Wed, 14 Apr 2021, xiaojun.zhao...@gmail.com wrote:


I found the qemu-nbd process(started with qemu-nbd -t -c /dev/nbd0
nbd.qcow2) will automatically exit when I patched for functions of
the nbd with livepatch.

The nbd relative source:
static int nbd_start_device_ioctl(struct nbd_device *nbd, struct
block_device *bdev)
{ struct nbd_config *config =
nbd->config; int
ret;
 ret =
nbd_start_device(nbd); if
(ret) return
ret;
 if
(max_part) bdev->bd_invalidated =
1;
mutex_unlock(>config_lock); ret =
wait_event_interruptible(config->recv_wq,
atomic_read(>recv_threads) == 0); if
(ret)
sock_shutdown(nbd);
flush_workqueue(nbd->recv_workq);
 mutex_lock(>config_lock);
 nbd_bdev_reset(bdev);
 /* user requested, ignore socket errors
*/ if (test_bit(NBD_RT_DISCONNECT_REQUESTED,
>runtime_flags)) ret =
0; if (test_bit(NBD_RT_TIMEDOUT,
>runtime_flags)) ret =
-ETIMEDOUT; return
ret; }


So my understanding is that ndb spawns a number
(config->recv_threads) of workqueue jobs and then waits for them to
finish. It waits interruptedly. Now, any signal would make
wait_event_interruptible() to return -ERESTARTSYS. Livepatch fake
signal is no exception there. The error is then propagated back to
the userspace. Unless a user requested a disconnection or there is
timeout set. How does the userspace then reacts to it? Is
_interruptible there because the userspace sends a signal in case of
NBD_RT_DISCONNECT_REQUESTED set? How does the userspace handles
ordinary signals? This all sounds a bit strange, but I may be missing
something easily.


When the nbd waits for atomic_read(>recv_threads) == 0, the
klp will send a fake signal to it then the qemu-nbd process exits.
And the signal of sysfs to control this action was removed in the
commit 10b3d52790e 'livepatch: Remove signal sysfs attribute'. Are
there other ways to control this action? How?


No, there is no way currently. We send a fake signal automatically.

Regards
Miroslav

It occurs IO error of the nbd device when I use livepatch of the
nbd, and I guess that any livepatch on other kernel source maybe cause
the IO error. Well, now I decide to workaround for this problem by
adding a livepatch for the klp to disable a automatic fake signal.



Would wait_event_killable() fix this problem?  I'm not sure any client 
implementations depend on being able to send other signals to the client 
process, so it should be safe from that standpoint.  Not sure if the livepatch 
thing would still get an error at that point tho.  Thanks,


Josef


Re: [PATCH][RESEND] Revert "PM: ACPI: reboot: Use S5 for reboot"

2021-03-18 Thread Josef Bacik

On 3/18/21 1:42 AM, Kai-Heng Feng wrote:

On Thu, Mar 18, 2021 at 1:25 AM Josef Bacik  wrote:
[snipped]

"shutdown now" works fine with and without your patch.  Thanks,


Rafael,
Please revert the patch while we are working on it.

Josef,
Can you please test the following patch:



That made it work fine.  Thanks,

Josef


Re: [PATCH][RESEND] Revert "PM: ACPI: reboot: Use S5 for reboot"

2021-03-17 Thread Josef Bacik

On 3/17/21 12:14 PM, Kai-Heng Feng wrote:

On Wed, Mar 17, 2021 at 11:19 PM Josef Bacik  wrote:


On 3/16/21 10:50 PM, Kai-Heng Feng wrote:

Hi,

On Wed, Mar 17, 2021 at 10:17 AM Josef Bacik  wrote:


This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095.

This patch causes a panic when rebooting my Dell Poweredge r440.  I do
not have the full panic log as it's lost at that stage of the reboot and
I do not have a serial console.  Reverting this patch makes my system
able to reboot again.


But this patch also helps many HP laptops, so maybe we should figure
out what's going on on Poweredge r440.
Does it also panic on shutdown?



Sure I'll test whatever to get it fixed, but I just wasted 3 days bisecting and
lost a weekend of performance testing on btrfs because of this regression, so
until you figure out how it broke it needs to be reverted so people don't have
to figure out why reboot suddenly isn't working.


That's unfortunate to hear. However, I've been spending tons of time
on bisecting kernels. To me it's just a normal part of kernel
development so I won't call it "wasted".

Feel free to revert the patch though.



Running "halt" has the same effect with and without your patch, it gets to
"system halted" and just sits there without powering off.  Not entirely sure why
that is, but there's no panic.


What about shutdown? pm_power_off_prepare() is used by shutdown but
it's not used by halt.


"shutdown now" works fine with and without your patch.  Thanks,

Josef


Re: [PATCH][RESEND] Revert "PM: ACPI: reboot: Use S5 for reboot"

2021-03-17 Thread Josef Bacik

On 3/16/21 10:50 PM, Kai-Heng Feng wrote:

Hi,

On Wed, Mar 17, 2021 at 10:17 AM Josef Bacik  wrote:


This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095.

This patch causes a panic when rebooting my Dell Poweredge r440.  I do
not have the full panic log as it's lost at that stage of the reboot and
I do not have a serial console.  Reverting this patch makes my system
able to reboot again.


But this patch also helps many HP laptops, so maybe we should figure
out what's going on on Poweredge r440.
Does it also panic on shutdown?



Sure I'll test whatever to get it fixed, but I just wasted 3 days bisecting and 
lost a weekend of performance testing on btrfs because of this regression, so 
until you figure out how it broke it needs to be reverted so people don't have 
to figure out why reboot suddenly isn't working.


Running "halt" has the same effect with and without your patch, it gets to 
"system halted" and just sits there without powering off.  Not entirely sure why 
that is, but there's no panic.


The panic itself is lost, but I see there's an NMI and I have the RIP

(gdb) list *('mwait_idle_with_hints.constprop.0'+0x4b)
0x816dabdb is in mwait_idle_with_hints 
(./arch/x86/include/asm/current.h:15).

10
11  DECLARE_PER_CPU(struct task_struct *, current_task);
12
13  static __always_inline struct task_struct *get_current(void)
14  {
15  return this_cpu_read_stable(current_task);
16  }
17
18  #define current get_current()
19

:jmp0x936dac02 


:nopl   (%rax)
:jmp0x936dabac 


:nopl   (%rax)
:mfence
:mov%gs:0x17bc0,%rax
:   clflush (%rax)
:   mfence
:   xor%edx,%edx
:   mov%rdx,%rcx
:   mov%gs:0x17bc0,%rax
:   monitor %rax,%rcx,%rdx
:   mov(%rax),%rax
:   test   $0x8,%al
:   jne0x936dabdb 

:   jmpq   0x936dabd0 

:   verw   0x9f9fec(%rip)# 
0x940d4bbc

:   mov$0x1,%ecx
:   mov%rdi,%rax
:   mwait  %rax,%rcx
:   mov%gs:0x17bc0,%rax
:   lock andb $0xdf,0x2(%rax)
:   lock addl $0x0,-0x4(%rsp)
:   mov(%rax),%rax
:   test   $0x8,%al
:   je 0x936dac01 

:   andl 
$0x7fff,%gs:0x6c93cf7f(%rip)# 0x17b80

:   retq
:   mov%gs:0x17bc0,%rax
:   lock orb $0x20,0x2(%rax)
:   mov(%rax),%rax
:   test   $0x8,%al
:   jne0x936dabdb 

:   jmpq   0x936dab95 


:   nopl   0x0(%rax)

0x4b is after the mwait, which means we're panicing in the 
current_clr_polling(), where we do clear_thread_flag(TIF_POLLING_NRFLAG).  Thanks,


Josef


[PATCH][RESEND] Revert "PM: ACPI: reboot: Use S5 for reboot"

2021-03-16 Thread Josef Bacik
This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095.

This patch causes a panic when rebooting my Dell Poweredge r440.  I do
not have the full panic log as it's lost at that stage of the reboot and
I do not have a serial console.  Reverting this patch makes my system
able to reboot again.

Signed-off-by: Josef Bacik 
---
- apologies, I mistyped the lkml list email.

 kernel/reboot.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/reboot.c b/kernel/reboot.c
index eb1b15850761..a6ad5eb2fa73 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -244,8 +244,6 @@ void migrate_to_reboot_cpu(void)
 void kernel_restart(char *cmd)
 {
kernel_restart_prepare(cmd);
-   if (pm_power_off_prepare)
-   pm_power_off_prepare();
migrate_to_reboot_cpu();
syscore_shutdown();
if (!cmd)
-- 
2.26.2



Re: [PATCH v4 02/28] mm: Add an unlock function for PG_private_2/PG_fscache

2021-03-16 Thread Josef Bacik

On 3/16/21 8:43 PM, Linus Torvalds wrote:

[ Adding btrfs people explicitly, maybe they see this on the fs-devel
list, but maybe they don't react .. ]

On Tue, Mar 16, 2021 at 12:07 PM Matthew Wilcox  wrote:


This isn't a problem with this patch per se, but I'm concerned about
private2 and expected page refcounts.


Ugh. You are very right.

It would be good to just change the rules - I get the feeling nobody
actually depended on them anyway because they were _so_ esoteric.


static inline int is_page_cache_freeable(struct page *page)
{
 /*
  * A freeable page cache page is referenced only by the caller
  * that isolated the page, the page cache and optional buffer
  * heads at page->private.
  */
 int page_cache_pins = thp_nr_pages(page);
 return page_count(page) - page_has_private(page) == 1 + 
page_cache_pins;


You're right, that "page_has_private()" is really really nasty.

The comment is, I think, the traditional usage case, which used to be
about page->buffers. Obviously these days it is now about
page->private with PG_private set, pointing to buffers
(attach_page_private() and detach_page_private()).

But as you point out:


#define PAGE_FLAGS_PRIVATE  \
 (1UL << PG_private | 1UL << PG_private_2)

So ... a page with both flags cleared should have a refcount of N.
A page with one or both flags set should have a refcount of N+1.


Could we just remove the PG_private_2 thing in this context entirely,
and make the rule be that

  (a) PG_private means that you have some local private data in
page->private, and that's all that matters for the "freeable" thing.

  (b) PG_private_2 does *not* have the same meaning, and has no bearing
on freeability (and only the refcount matters)

I _)think_ the btrfs behavior is to only use PagePrivate2() when it
has a reference to the page, so btrfs doesn't care?

I think fscache is already happy to take the page count when using
PG_private_2 for locking, exactly because I didn't want to have any
confusion about lifetimes. But this "page_has_private()" math ends up
meaning it's confusing anyway.

btrfs people? What are the semantics for PG_private_2? Is it just a
flag, and you really don't want it to have anything to do with any
page lifetime decisions? Or?



Yeah it's just a flag, we use it to tell that the page is part of a range that 
has been allocated for IO.  The lifetime of the page is independent of the page, 
but is generally either dirty or under writeback, so either it goes through 
truncate and we clear PagePrivate2 there, or it actually goes through IO and is 
cleared before we drop the page in our endio.  We _always_ have PG_private set 
on the page as long as we own it, and PG_private_2 is only set in this IO 
related context, so we're safe there because of the rules around 
PG_dirty/PG_writeback.  We don't need it to have an extra ref for it being set. 
 Thanks,


Josef



Re: KASAN: use-after-free Read in nbd_genl_connect

2021-02-22 Thread Josef Bacik

On 2/22/21 3:25 AM, syzbot wrote:

Hello,

syzbot found the following issue on:

HEAD commit:f40ddce8 Linux 5.11
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=179e8d22d0
kernel config:  https://syzkaller.appspot.com/x/.config?x=e53d04227c52a0df
dashboard link: https://syzkaller.appspot.com/bug?extid=429d3f82d757c211bff3
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=10d190cad0
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13dc8a82d0

Bisection is inconclusive: the issue happens on the oldest tested release.

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1128ae60d0
final oops: https://syzkaller.appspot.com/x/report.txt?x=1328ae60d0
console output: https://syzkaller.appspot.com/x/log.txt?x=1528ae60d0

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+429d3f82d757c211b...@syzkaller.appspotmail.com



#syz test:  git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
nbd-kasan-fix


Re: KASAN: use-after-free Read in nbd_genl_connect

2021-02-22 Thread Josef Bacik

On 2/22/21 4:34 AM, Hillf Danton wrote:

Mon, 22 Feb 2021 00:25:18 -0800

Hello,

syzbot found the following issue on:

HEAD commit:f40ddce8 Linux 5.11
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=179e8d22d0
kernel config:  https://syzkaller.appspot.com/x/.config?x=e53d04227c52a0df
dashboard link: https://syzkaller.appspot.com/bug?extid=429d3f82d757c211bff3
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=10d190cad0
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13dc8a82d0

Bisection is inconclusive: the issue happens on the oldest tested release.

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1128ae60d0
final oops: https://syzkaller.appspot.com/x/report.txt?x=1328ae60d0
console output: https://syzkaller.appspot.com/x/log.txt?x=1528ae60d0

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+429d3f82d757c211b...@syzkaller.appspotmail.com

==
BUG: KASAN: use-after-free in instrument_atomic_read 
include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_read 
include/asm-generic/atomic-instrumented.h:27 [inline]
BUG: KASAN: use-after-free in refcount_dec_not_one+0x71/0x1e0 lib/refcount.c:76
Read of size 4 at addr 888014ca19a0 by task syz-executor980/8540

CPU: 0 PID: 8540 Comm: syz-executor980 Not tainted 5.11.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:79 [inline]
  dump_stack+0x107/0x163 lib/dump_stack.c:120
  print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230
  __kasan_report mm/kasan/report.c:396 [inline]
  kasan_report.cold+0x79/0xd5 mm/kasan/report.c:413
  check_memory_region_inline mm/kasan/generic.c:179 [inline]
  check_memory_region+0x13d/0x180 mm/kasan/generic.c:185
  instrument_atomic_read include/linux/instrumented.h:71 [inline]
  atomic_read include/asm-generic/atomic-instrumented.h:27 [inline]
  refcount_dec_not_one+0x71/0x1e0 lib/refcount.c:76
  refcount_dec_and_mutex_lock+0x19/0x140 lib/refcount.c:115
  nbd_put drivers/block/nbd.c:248 [inline]


This put is unbalanced, given the 1978 line below.


  nbd_genl_connect+0xee7/0x1560 drivers/block/nbd.c:1980
  genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:739
  genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
  genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:800
  netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
  genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
  netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
  netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
  netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1919
  sock_sendmsg_nosec net/socket.c:652 [inline]
  sock_sendmsg+0xcf/0x120 net/socket.c:672
  sys_sendmsg+0x6e8/0x810 net/socket.c:2345
  ___sys_sendmsg+0xf3/0x170 net/socket.c:2399
  __sys_sendmsg+0xe5/0x1b0 net/socket.c:2432
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x440709
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 
48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 
c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:7ffd63e9cc88 EFLAGS: 0246 ORIG_RAX: 002e
RAX: ffda RBX: e3fb RCX: 00440709
RDX:  RSI: 2340 RDI: 0003
RBP:  R08:  R09: 7ffd63e9ce28
R10:  R11: 0246 R12: 7ffd63e9cc9c
R13: 431bde82d7b634db R14: 004ae018 R15: 004004a0

Allocated by task 8536:
  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
  kasan_set_track mm/kasan/common.c:46 [inline]
  set_alloc_info mm/kasan/common.c:401 [inline]
  kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:429
  kmalloc include/linux/slab.h:552 [inline]
  kzalloc include/linux/slab.h:682 [inline]
  nbd_dev_add+0x44/0x8e0 drivers/block/nbd.c:1673
  nbd_genl_connect+0x59c/0x1560 drivers/block/nbd.c:1838
  genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:739
  genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
  genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:800
  netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
  genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
  netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
  netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
  netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1919
  sock_sendmsg_nosec net/socket.c:652 [inline]
  sock_sendmsg+0xcf/0x120 net/socket.c:672
  sys_sendmsg+0x6e8/0x810 net/socket.c:2345
  ___sys_sendmsg+0xf3/0x170 net/socket.c:2399
  __sys_sendmsg+0xe5/0x1b0 net/socket.c:2432
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
  

Re: [PATCH v6 0/2] fix a NULL pointer bug and simplify the code

2021-02-18 Thread Josef Bacik

On 2/18/21 7:26 AM, Sun Ke wrote:

fix a NULL pointer bug and simplify the code

v6: Just add if (nbd->recv_workq) to nbd_disconnect_and_put().
v5: Adjust the title and add “Suggested-by”.
v4: Share exception handling code for if branches and
move put_nbd adjustment to a separate patch.
v3: Do not use unlock and add put_nbd.
v2: Use jump target unlock.

Sun Ke (2):
   nbd: Fix NULL pointer in flush_workqueue
   nbd: share nbd_put and return by goto put_nbd

  drivers/block/nbd.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)



You can add

Reviewed-by: Josef Bacik 

to both of these, thanks,

Josef


Re: [PATCH v5 1/2] nbd: Fix NULL pointer in flush_workqueue

2021-02-10 Thread Josef Bacik

On 2/5/21 1:56 AM, Sun Ke wrote:

Open /dev/nbdX first, the config_refs will be 1 and
the pointers in nbd_device are still null. Disconnect
/dev/nbdX, then reference a null recv_workq. The
protection by config_refs in nbd_genl_disconnect is useless.

[  656.366194] BUG: kernel NULL pointer dereference, address: 0020
[  656.368943] #PF: supervisor write access in kernel mode
[  656.369844] #PF: error_code(0x0002) - not-present page
[  656.370717] PGD 10cc87067 P4D 10cc87067 PUD 1074b4067 PMD 0
[  656.371693] Oops: 0002 [#1] SMP
[  656.372242] CPU: 5 PID: 7977 Comm: nbd-client Not tainted 
5.11.0-rc5-00040-g76c057c84d28 #1
[  656.373661] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[  656.375904] RIP: 0010:mutex_lock+0x29/0x60
[  656.376627] Code: 00 0f 1f 44 00 00 55 48 89 fd 48 83 05 6f d7 fe 08 01 e8 7a c3 
ff ff 48 83 05 6a d7 fe 08 01 31 c0 65 48 8b 14 25 00 6d 01 00  48 0f b1 55 
d
[  656.378934] RSP: 0018:c95eb9b0 EFLAGS: 00010246
[  656.379350] RAX:  RBX:  RCX: 
[  656.379915] RDX: 888104cf2600 RSI: aae8f452 RDI: 0020
[  656.380473] RBP: 0020 R08:  R09: 88813bd6b318
[  656.381039] R10: 00c7 R11: fefefefefefefeff R12: 888102710b40
[  656.381599] R13: c95eb9e0 R14: b2930680 R15: 88810770ef00
[  656.382166] FS:  7fdf117ebb40() GS:88813bd4() 
knlGS:
[  656.382806] CS:  0010 DS:  ES:  CR0: 80050033
[  656.383261] CR2: 0020 CR3: 000100c84000 CR4: 06e0
[  656.383819] DR0:  DR1:  DR2: 
[  656.384370] DR3:  DR6: fffe0ff0 DR7: 0400
[  656.384927] Call Trace:
[  656.385111]  flush_workqueue+0x92/0x6c0
[  656.385395]  nbd_disconnect_and_put+0x81/0xd0
[  656.385716]  nbd_genl_disconnect+0x125/0x2a0
[  656.386034]  genl_family_rcv_msg_doit.isra.0+0x102/0x1b0
[  656.386422]  genl_rcv_msg+0xfc/0x2b0
[  656.386685]  ? nbd_ioctl+0x490/0x490
[  656.386954]  ? genl_family_rcv_msg_doit.isra.0+0x1b0/0x1b0
[  656.387354]  netlink_rcv_skb+0x62/0x180
[  656.387638]  genl_rcv+0x34/0x60
[  656.387874]  netlink_unicast+0x26d/0x590
[  656.388162]  netlink_sendmsg+0x398/0x6c0
[  656.388451]  ? netlink_rcv_skb+0x180/0x180
[  656.388750]  sys_sendmsg+0x1da/0x320
[  656.389038]  ? sys_recvmsg+0x130/0x220
[  656.389334]  ___sys_sendmsg+0x8e/0xf0
[  656.389605]  ? ___sys_recvmsg+0xa2/0xf0
[  656.389889]  ? handle_mm_fault+0x1671/0x21d0
[  656.390201]  __sys_sendmsg+0x6d/0xe0
[  656.390464]  __x64_sys_sendmsg+0x23/0x30
[  656.390751]  do_syscall_64+0x45/0x70
[  656.391017]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: e9e006f5fcf2 ("nbd: fix max number of supported devs")
Suggested-by: Markus Elfring 
Signed-off-by: Sun Ke 
---
v4: Share exception handling code for if branches
v3: Do not use unlock and add put_nbd.
v2: Use jump target unlock.
---
  drivers/block/nbd.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index e6ea5d344f87..3c9b3bf3f4c2 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -2014,17 +2014,20 @@ static int nbd_genl_disconnect(struct sk_buff *skb, 
struct genl_info *info)
mutex_lock(_index_mutex);
nbd = idr_find(_index_idr, index);
if (!nbd) {
-   mutex_unlock(_index_mutex);
printk(KERN_ERR "nbd: couldn't find device at index %d\n",
   index);
-   return -EINVAL;
+   goto unlock_index;
}
-   if (!refcount_inc_not_zero(>refs)) {
-   mutex_unlock(_index_mutex);
+   mutex_lock(>config_lock);
+   if (!refcount_inc_not_zero(>refs) || !nbd->recv_workq) {


We can't safely take the ->config_log if we don't have a ref.  Just add

if (nbd->recv_workq)
flush_workqueue(nbd->recv_workq);

to nbd_disconnect_and_put().  Problem solved, we can't drop it until we drop our 
last config ref, and we're holding a config ref here.  We could probably add it 
in the meantime, but at this point we've disconnected all of our sockets so it 
doesn't matter.  Thanks,


Josef


Re: [PATCH] nbd: Convert to DEFINE_SHOW_ATTRIBUTE

2021-02-10 Thread Josef Bacik

On 2/6/21 2:10 AM, winnd...@163.com wrote:

From: Liao Pingfang 

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Liao Pingfang 


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-29 Thread Josef Bacik

On 1/27/21 8:57 AM, Michal Rostecki wrote:

From: Michal Rostecki 

Before this change, the btrfs_get_io_geometry() function was calling
btrfs_get_chunk_map() to get the extent mapping, necessary for
calculating the I/O geometry. It was using that extent mapping only
internally and freeing the pointer after its execution.

That resulted in calling btrfs_get_chunk_map() de facto twice by the
__btrfs_map_block() function. It was calling btrfs_get_io_geometry()
first and then calling btrfs_get_chunk_map() directly to get the extent
mapping, used by the rest of the function.

This change fixes that by passing the extent mapping to the
btrfs_get_io_geometry() function as an argument.

v2:
When btrfs_get_chunk_map() returns an error in btrfs_submit_direct():
- Use errno_to_blk_status(PTR_ERR(em)) as the status
- Set em to NULL

Signed-off-by: Michal Rostecki 


This panic'ed all of my test vms in their overnight xfstests runs, the panic is 
this

[ 2449.936502] BTRFS critical (device dm-7): mapping failed logical 1113825280 
bio len 40960 len 24576

[ 2449.937073] [ cut here ]
[ 2449.937329] kernel BUG at fs/btrfs/volumes.c:6450!
[ 2449.937604] invalid opcode:  [#1] SMP NOPTI
[ 2449.937855] CPU: 0 PID: 259045 Comm: kworker/u5:0 Not tainted 5.11.0-rc5+ 
#122
[ 2449.938252] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.13.0-2.fc32 04/01/2014

[ 2449.938713] Workqueue: btrfs-worker-high btrfs_work_helper
[ 2449.939016] RIP: 0010:btrfs_map_bio.cold+0x5a/0x5c
[ 2449.939392] Code: 37 87 ff ff e8 ed d4 8a ff 48 83 c4 18 e9 b5 52 8b ff 49 89 
c8 4c 89 fa 4c 89 f1 48 c7 c6 b0 c0 61 8b 48 89 ef e8 11 87 ff ff <0f> 0b 4c 89 
e7 e8 42 09 86 ff e9 fd 59 8b ff 49 8b 7a 50 44 89 f2

[ 2449.940402] RSP: :9f24c1637d90 EFLAGS: 00010282
[ 2449.940689] RAX: 0057 RBX: 90c78ff716b8 RCX: 
[ 2449.941080] RDX: 90c7fbc27ae0 RSI: 90c7fbc19110 RDI: 90c7fbc19110
[ 2449.941467] RBP: 90c7911d4000 R08:  R09: 
[ 2449.941853] R10: 9f24c1637b48 R11: 8b9723e8 R12: 
[ 2449.942243] R13:  R14: a000 R15: 4263a000
[ 2449.942632] FS:  () GS:90c7fbc0() 
knlGS:

[ 2449.943072] CS:  0010 DS:  ES:  CR0: 80050033
[ 2449.943386] CR2: 5575163c3080 CR3: 00010ad6c004 CR4: 00370ef0
[ 2449.943772] Call Trace:
[ 2449.943915]  ? lock_release+0x1c3/0x290
[ 2449.944135]  run_one_async_done+0x3a/0x60
[ 2449.944360]  btrfs_work_helper+0x136/0x520
[ 2449.944588]  process_one_work+0x26e/0x570
[ 2449.944812]  worker_thread+0x55/0x3c0
[ 2449.945016]  ? process_one_work+0x570/0x570
[ 2449.945250]  kthread+0x137/0x150
[ 2449.945430]  ? __kthread_bind_mask+0x60/0x60
[ 2449.945666]  ret_from_fork+0x1f/0x30

it happens when you run btrfs/060.  Please make sure to run xfstests against 
patches before you submit them upstream.  Thanks,


Josef


Re: [PATCH] btrfs: Simplify the calculation of variables

2021-01-27 Thread Josef Bacik

On 1/27/21 3:11 AM, Abaci Team wrote:

Fix the following coccicheck warnings:

./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is
equivalent to !A || B.

Reported-by: Abaci Robot 
Suggested-by: Jiapeng Zhong 
Signed-off-by: Abaci Team 


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [RFC PATCH 00/37] block: introduce bio_init_fields()

2021-01-19 Thread Josef Bacik

On 1/19/21 12:05 AM, Chaitanya Kulkarni wrote:

Hi,

This is a *compile only RFC* which adds a generic helper to initialize
the various fields of the bio that is repeated all the places in
file-systems, block layer, and drivers.

The new helper allows callers to initialize various members such as
bdev, sector, private, end io callback, io priority, and write hints.

The objective of this RFC is to only start a discussion, this it not
completely tested at all.


It would help to know what you're trying to accomplish here.  I'd echo Mike's 
comments about how it makes it annoying to update things in the future.  In 
addition, there's so many fields that I'm not going to remember what each one is 
without having to look it up, which makes it annoying to use and to review.  If 
it's simply to make sure fields are initialized then you could add debug sanity 
checks to submit_bio().  If it's to clean up duplication, well I'd argue that 
the duplication is much clearer than positional arguments in a giant function 
call.  If you are wanting to change a particular part of the bio to be 
initialized properly, like Dennis's work to make sure the bi_blkg was 
initialized at bi_bdev set time, then a more targeted patch series with a 
specific intent will be more useful and more successful.  Thanks,


Josef


Re: [PATCH] nbd: Respect max_part for all partition scans

2020-12-17 Thread Josef Bacik

On 12/17/20 3:58 AM, Josh Triplett wrote:

The creation path of the NBD device respects max_part and only scans for
partitions if max_part is not 0. However, some other code paths ignore
max_part, and unconditionally scan for partitions. Add a check for
max_part on each partition scan.

Signed-off-by: Josh Triplett 


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

2020-12-10 Thread Josef Bacik

On 12/9/20 10:38 PM, Bart Van Assche wrote:

On 12/7/20 10:55 AM, Palmer Dabbelt wrote:

All in all, I've found it a bit hard to figure out what sort of interest
people
have in dm-user: when I bring this up I seem to run into people who've done
similar things before and are vaguely interested, but certainly nobody is
chomping at the bit.  I'm sending it out in this early state to try and
figure
out if it's interesting enough to keep going.


Cc-ing Josef and Mike since their nbd contributions make me wonder
whether this new driver could be useful to their use cases?



Sorry gmail+imap sucks and I can't get my email client to get at the original 
thread.  However here is my take.


1) The advantages of using dm-user of NBD that you listed aren't actually 
problems for NBD.  We have NBD working in production where you can hand off the 
sockets for the server without ending in timeouts, it was actually the main 
reason we wrote our own server so we could use the FD transfer stuff to restart 
the server without impacting any clients that had the device in use.


2) The extra copy is a big deal, in fact we already have too many copies in our 
existing NBD setup and are actively looking for ways to avoid those.


Don't take this as I don't think dm-user is a good idea, but I think at the very 
least it should start with the very best we have to offer, starting with as few 
copies as possible.


If you are using it currently in production then cool, there's clearly a usecase 
for it.  Personally as I get older and grouchier I want less things in the 
kernel, so if this enables us to eventually do everything NBD related in 
userspace with no performance drop then I'd be down.  I don't think you need to 
make that your primary goal, but at least polishing this up so it could 
potentially be abused in the future would make it more compelling for merging. 
Thanks,


Josef


[LSFMMBPF 2021] A status update

2020-12-04 Thread Josef Bacik

Hello,

We on the program committee hope everybody has been able to stay safe and 
healthy during this challenging time, and look forward to being able to see all 
of you in person again when it is safe.


The current plans for LSFMMBPF 2021 are to schedule an in person conference in 
H2 (after June) of 2021.  The tentative plan is to use the same hotel that we 
had planned to use for 2020, as we still have contracts with them.  However 
clearly that is not set in stone.  The Linux Foundation has done a wonderful job 
of working with us to formulate a plan and figure out the logistics that will 
work the best for everybody, I really can't thank them enough for their help.


Once we have a finalized date we will redo the CFP emails, probably coming out 
March time frame.  If you have any questions or concerns please feel free to 
respond to this email, or email me or any of the other PC members privately and 
we will do our best to answer your questions.  Rest assured the general timing 
of the conference is going to take into account the wide variety of schedules 
that we are dealing with, and we will do our best to come up with something that 
works for as many as people as possible.


We hope that you and your families continue to stay safe and health.  Thank you 
on behalf of the program committee:


Josef Bacik (Filesystems)
Amir Goldstein (Filesystems)
Martin K. Petersen (Storage)
Omar Sandoval (Storage)
Michal Hocko (MM)
Dan Williams (MM)
Alexei Starovoitov (BPF)
Daniel Borkmann (BPF)


Re: [GIT PULL][PATCH v5 0/9] Update to zstd-1.4.6

2020-11-06 Thread Josef Bacik

On 11/3/20 1:05 AM, Nick Terrell wrote:

From: Nick Terrell 

Please pull from

   g...@github.com:terrelln/linux.git tags/v5-zstd-1.4.6

to get these changes. Alternatively the patchset is included.



Where did we come down on the code formatting question?  Personally I'm of the 
mind that as long as the consumers themselves adhere to the proper coding style 
I'm fine not maintaining the code style as long as we get the benefit of easily 
syncing in code from the upstream project.  Thanks,


Josef


Re: [PATCH v5 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-11-06 Thread Josef Bacik

On 11/3/20 1:05 AM, Nick Terrell wrote:

From: Nick Terrell 

Move away from the compatibility wrapper to the zstd-1.4.6 API. This
code is functionally equivalent.

Signed-off-by: Nick Terrell 
---
  fs/btrfs/zstd.c | 48 
  1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index a7367ff573d4..6b466e090cd7 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -16,7 +16,7 @@
  #include 
  #include 
  #include 
-#include 
+#include 
  #include "misc.h"
  #include "compression.h"
  #include "ctree.h"
@@ -159,8 +159,8 @@ static void zstd_calc_ws_mem_sizes(void)
zstd_get_btrfs_parameters(level, ZSTD_BTRFS_MAX_INPUT);
size_t level_size =
max_t(size_t,
- ZSTD_CStreamWorkspaceBound(params.cParams),
- ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
+ 
ZSTD_estimateCStreamSize_usingCParams(params.cParams),
+ ZSTD_estimateDStreamSize(ZSTD_BTRFS_MAX_INPUT));
  
  		max_size = max_t(size_t, max_size, level_size);

zstd_ws_mem_sizes[level - 1] = max_size;
@@ -389,13 +389,23 @@ int zstd_compress_pages(struct list_head *ws, struct 
address_space *mapping,
*total_in = 0;
  
  	/* Initialize the stream */

-   stream = ZSTD_initCStream(params, len, workspace->mem,
-   workspace->size);
+   stream = ZSTD_initStaticCStream(workspace->mem, workspace->size);
if (!stream) {
-   pr_warn("BTRFS: ZSTD_initCStream failed\n");
+   pr_warn("BTRFS: ZSTD_initStaticCStream failed\n");
ret = -EIO;
goto out;
}
+   {
+   size_t ret2;
+
+   ret2 = ZSTD_initCStream_advanced(stream, NULL, 0, params, len);
+   if (ZSTD_isError(ret2)) {
+   pr_warn("BTRFS: ZSTD_initCStream_advanced returned 
%s\n",
+   ZSTD_getErrorName(ret2));
+   ret = -EIO;
+   goto out;
+   }
+   }


Please don't do this, you can just add size_t ret2 at the top and not put this 
in a block.  Other than that the code looks fine, once you fix that you can add


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [btrfs] 96bed17ad9: fio.write_iops -59.7% regression

2020-11-04 Thread Josef Bacik

On 11/4/20 1:16 AM, kernel test robot wrote:

Greeting,

FYI, we noticed a -59.7% regression of fio.write_iops due to commit:


commit: 96bed17ad9d425ff6958a2e6f87179453a3d76f2 ("btrfs: simplify the logic in 
need_preemptive_flushing")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master


in testcase: fio-basic
on test machine: 192 threads Intel(R) Xeon(R) CPU @ 2.20GHz with 192G memory
with following parameters:

disk: 1SSD
fs: btrfs
runtime: 300s
nr_task: 8
rw: write
bs: 4k
ioengine: sync
test_size: 256g
cpufreq_governor: performance
ucode: 0x4002f01

test-description: Fio is a tool that will spawn a number of threads or 
processes doing a particular type of I/O action as specified by the user.
test-url: https://github.com/axboe/fio



I generally ignore these reports, but since it's FIO I figured at least the test 
itself was valid.  However once again I'm unable to reproduce the results


linus master:

task_0: (groupid=0, jobs=8): err= 0: pid=38586: Wed Nov  4 08:13:36 2020
  write: IOPS=168k, BW=655MiB/s (687MB/s)(192GiB/31msec); 0 zone resets
clat (usec): min=26, max=786, avg=47.15, stdev= 7.21
 lat (usec): min=26, max=786, avg=47.21, stdev= 7.21
clat percentiles (nsec):
 |  1.00th=[31872],  5.00th=[35584], 10.00th=[37632], 20.00th=[40704],
 | 30.00th=[43264], 40.00th=[45312], 50.00th=[47360], 60.00th=[48896],
 | 70.00th=[50944], 80.00th=[52992], 90.00th=[56064], 95.00th=[59136],
 | 99.00th=[65280], 99.50th=[68096], 99.90th=[74240], 99.95th=[77312],
 | 99.99th=[88576]
   bw (  KiB/s): min=63752, max=112864, per=12.50%, avg=83810.53, 
stdev=3403.48, samples=4792

   iops: min=15938, max=28216, avg=20952.61, stdev=850.87, samples=4792
  lat (usec)   : 50=65.73%, 100=34.27%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  cpu  : usr=2.22%, sys=97.77%, ctx=5054, majf=0, minf=63
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=0,50298940,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=655MiB/s (687MB/s), 655MiB/s-655MiB/s (687MB/s-687MB/s), io=192GiB 
(206GB), run=31-31msec


kdave/for-next-20201104
task_0: (groupid=0, jobs=8): err= 0: pid=6652: Wed Nov  4 08:41:52 2020
  write: IOPS=180k, BW=705MiB/s (739MB/s)(207GiB/31msec); 0 zone resets
clat (usec): min=17, max=10603, avg=43.91, stdev= 9.62
 lat (usec): min=17, max=10603, avg=43.98, stdev= 9.62
clat percentiles (nsec):
 |  1.00th=[25984],  5.00th=[31104], 10.00th=[33536], 20.00th=[37120],
 | 30.00th=[39168], 40.00th=[41216], 50.00th=[43264], 60.00th=[45824],
 | 70.00th=[47872], 80.00th=[50944], 90.00th=[54528], 95.00th=[57600],
 | 99.00th=[64768], 99.50th=[68096], 99.90th=[74240], 99.95th=[78336],
 | 99.99th=[90624]
   bw (  KiB/s): min=66760, max=123160, per=12.50%, avg=90221.11, 
stdev=9052.52, samples=4792

   iops: min=16690, max=30790, avg=22555.24, stdev=2263.14, samples=4792
  lat (usec)   : 20=0.01%, 50=77.24%, 100=22.75%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 20=0.01%
  cpu  : usr=1.67%, sys=98.31%, ctx=4806, majf=0, minf=68
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=0,54134917,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=705MiB/s (739MB/s), 705MiB/s-705MiB/s (739MB/s-739MB/s), io=207GiB 
(222GB), run=31-31msec


So instead of -60% iops regression, I'm seeing a 7% iops improvement.  The only 
difference is that my machine doesn't have 192 threads, it has 80.  Thanks,


Josef


Re: [PATCH] fs: fix NULL dereference due to data race in prepend_path()

2020-10-14 Thread Josef Bacik

On 10/14/20 4:45 PM, Andrii Nakryiko wrote:

Fix data race in prepend_path() with re-reading mnt->mnt_ns twice without
holding the lock. is_mounted() does check for NULL, but is_anon_ns(mnt->mnt_ns)
might re-read the pointer again which could be NULL already, if in between
reads one of kern_unmount()/kern_unmount_array()/umount_tree() sets mnt->mnt_ns
to NULL.

This is seen in production with the following stack trace:

[22942.418012] BUG: kernel NULL pointer dereference, address: 0048
...
[22942.976884] RIP: 0010:prepend_path.isra.4+0x1ce/0x2e0
[22943.037706] Code: 89 c6 e9 0d ff ff ff 49 8b 85 c0 00 00 00 48 85 c0 0f 84 9d 00 
00 00 48 3d 00 f0 ff ff 0f 87 91 00 00 00 49 8b 86 e0 00 00 00 <48> 83 78 48 00 
0f 94 c0 0f b6 c0 83 c0 01 e9 3b ff ff ff 39 0d 29
[22943.264141] RSP: 0018:c90020d6fd98 EFLAGS: 00010283
[22943.327058] RAX:  RBX:  RCX: 0007e5ee
[22943.413041] RDX: 889fb56ac0c0 RSI: ffd05dc8147e RDI: 88b1f845ab7b
[22943.499015] RBP: 889fbf8100c0 R08: c90020d6fe30 R09: c90020d6fe2c
[22943.584992] R10: c90020d6fe2c R11: ea00095836c0 R12: c90020d6fe30
[22943.670968] R13: 88b7d336bea0 R14: 88b7d336be80 R15: 88aeb78db980
[22943.756944] FS:  7f228447e980() GS:889fc00c() 
knlGS:
[22943.854448] CS:  0010 DS:  ES:  CR0: 80050033
[22943.923653] CR2: 0048 CR3: 001ed235e001 CR4: 007606e0
[22944.009630] DR0:  DR1:  DR2: 
[22944.095604] DR3:  DR6: fffe0ff0 DR7: 0400
[22944.181581] PKRU: 5554
[22944.214100] Call Trace:
[22944.243485]  d_path+0xe6/0x150
[22944.280202]  proc_pid_readlink+0x8f/0x100
[22944.328449]  vfs_readlink+0xf8/0x110
[22944.371456]  ? touch_atime+0x33/0xd0
[22944.414466]  do_readlinkat+0xfd/0x120
[22944.458522]  __x64_sys_readlinkat+0x1a/0x20
[22944.508868]  do_syscall_64+0x42/0x110
[22944.552928]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: f2683bd8d5bd ("[PATCH] fix d_absolute_path() interplay with fsmount()")
Signed-off-by: Andrii Nakryiko 
---
  fs/d_path.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/d_path.c b/fs/d_path.c
index 0f1fc1743302..a69e2cd36e6e 100644
--- a/fs/d_path.c
+++ b/fs/d_path.c
@@ -102,6 +102,8 @@ static int prepend_path(const struct path *path,
  
  		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {

struct mount *parent = READ_ONCE(mnt->mnt_parent);
+   struct mnt_namespace *mnt_ns;
+
/* Escaped? */
if (dentry != vfsmnt->mnt_root) {
bptr = *buffer;
@@ -116,7 +118,9 @@ static int prepend_path(const struct path *path,
vfsmnt = >mnt;
continue;
}
-   if (is_mounted(vfsmnt) && !is_anon_ns(mnt->mnt_ns))
+   mnt_ns = READ_ONCE(mnt->mnt_ns);
+   /* open-coded is_mounted() to use local mnt_ns */
+   if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns))
error = 1;  // absolute root
else


I had to go look at this code carefully to make sure that mnt == 
real_mount(vfsmnt), which it does.  I was also afraid that if we could have 
mnt->mnt_ns change in between checks that we were just trading a possible NULL 
deref with a UAF, but we're under RCU here so we're good there as well.  You can add


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] btrfs: ref-verify: Fix memleak in add_extent_data_ref

2020-10-09 Thread Josef Bacik

On 8/27/20 3:43 AM, Dinghao Liu wrote:

When lookup_root_entry() fails, ref should be freed
just like when insert_ref_entry() fails.

Signed-off-by: Dinghao Liu 


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 6/7] btrfs: Promote to unsigned long long before shifting

2020-10-09 Thread Josef Bacik

On 10/4/20 2:04 PM, Matthew Wilcox (Oracle) wrote:

On 32-bit systems, this shift will overflow for files larger than 4GB.

Cc: sta...@vger.kernel.org
Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6")
Signed-off-by: Matthew Wilcox (Oracle) 


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 5/7] btrfs: Promote to unsigned long long before shifting

2020-10-09 Thread Josef Bacik

On 10/4/20 2:04 PM, Matthew Wilcox (Oracle) wrote:

On 32-bit systems, this shift will overflow for files larger than 4GB.

Cc: sta...@vger.kernel.org
Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and 
release")
Signed-off-by: Matthew Wilcox (Oracle) 


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] btrfs: fix memdup.cocci warnings

2020-09-22 Thread Josef Bacik

On 9/22/20 6:21 AM, Julia Lawall wrote:

From: kernel test robot 

fs/btrfs/send.c:3854:8-15: WARNING opportunity for kmemdup

  Use kmemdup rather than duplicating its implementation

Generated by: scripts/coccinelle/api/memdup.cocci

Fixes: 28314eb24e6c ("btrfs: send, recompute reference path after orphanization of a 
directory")
Signed-off-by: kernel test robot 
Signed-off-by: Julia Lawall 


Reviewed-by: Josef Bacik 

Thanks,

Josef



Re: [PATCH] btrfs: Fix potential null pointer deref

2020-09-21 Thread Josef Bacik

On 9/21/20 3:12 PM, Alex Dewar wrote:

In btrfs_destroy_inode(), the variable root may be NULL, but the check
for this takes place after its value has already been dereferenced to
access its fs_info member. Move the dereference operation to later in
the function.

Fixes: a6dbd429d8dd ("Btrfs: fix panic when trying to destroy a newly 
allocated")
Addresses-Coverity: CID 1497103: Null pointer dereferences (REVERSE_INULL)
Signed-off-by: Alex Dewar 


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: remove revalidate_disk()

2020-09-01 Thread Josef Bacik

On 9/1/20 11:57 AM, Christoph Hellwig wrote:

Hi Jens,

this series removes the revalidate_disk() function, which has been a
really odd duck in the last years.  The prime reason why most people
use it is because it propagates a size change from the gendisk to
the block_device structure.  But it also calls into the rather ill
defined ->revalidate_disk method which is rather useless for the
callers.  So this adds a new helper to just propagate the size, and
cleans up all kinds of mess around this area.  Follow on patches
will eventuall kill of ->revalidate_disk entirely, but ther are a lot
more patches needed for that.



I applied and built everything on Jens's for-next, patch #2 was fuzzy but it 
applied.


I checked through everything, the only thing that was strange to me is not 
calling revalidate_disk_size() in nvdimm, but since it's during attach you point 
out it doesn't matter.  You can add


Reviewed-by: Josef Bacik 

To the series, thanks,

Josef


Re: [PATCH] btrfs: check return value of filemap_fdatawrite_range()

2020-08-21 Thread Josef Bacik

On 8/21/20 8:41 AM, Alex Dewar wrote:

In btrfs_dio_imap_begin(), filemap_fdatawrite_range() is called without
checking the return value. Add a check to catch errors.

Fixes: c0aaf9b7a114f ("btrfs: switch to iomap_dio_rw() for dio")
Addresses-Coverity: ("Unused value")
Signed-off-by: Alex Dewar 
---
  fs/btrfs/inode.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7b57aaa1f9acc..38fde20b4a81b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7347,9 +7347,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, 
loff_t start,
 * outstanding dirty pages are on disk.
 */
if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-_I(inode)->runtime_flags))
+_I(inode)->runtime_flags)) {
ret = filemap_fdatawrite_range(inode->i_mapping, start,
   start + length - 1);
+   if (ret)
+   return ret;
+   }
  
  	dio_data = kzalloc(sizeof(*dio_data), GFP_NOFS);

if (!dio_data)



Had to check to make sure there's no cleanup that's needed, there isn't, 
you can add


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH][v2] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik

On 8/13/20 5:10 PM, David Laight wrote:

From: Josef Bacik

Sent: 13 August 2020 18:19

...

We wouldn't even need the extra +1 part, since we're only copying in how much
the user wants anyway, we could just go ahead and convert this to

left -= snprintf(buffer, left, "0x%04x\n", *(unsigned int *) table->data);

and be fine, right?  Or am I misunderstanding what you're looking for?  Thanks,


Doesn't that need to be scnprintf()?
IIRC snprintf() returns the number of bytes that would have been
written were the buffer infinite size?
(I suspect this is an 'accidental' return value from the original
SYSV? userspace implementation that just dumped characters that
wouldn't fit in the buffer somewhere.)



Yeah, if you look at the patches I just sent you'll notice I used scnprintf() 
everywhere.  Thanks,


Josef


[PATCH 5/6] parport: rework procfs handlers to take advantage of the new buffer

2020-08-13 Thread Josef Bacik
The buffer coming from higher up the stack has an extra byte to handle
the NULL terminator in the string.  Instead of using a temporary buffer
to sprintf into and then copying into the buffer, just scnprintf
directly into the buffer and update lenp as appropriate.

Signed-off-by: Josef Bacik 
---
 drivers/parport/procfs.c | 108 +--
 1 file changed, 36 insertions(+), 72 deletions(-)

diff --git a/drivers/parport/procfs.c b/drivers/parport/procfs.c
index d740eba3c099..453d035ad5f6 100644
--- a/drivers/parport/procfs.c
+++ b/drivers/parport/procfs.c
@@ -37,9 +37,8 @@ static int do_active_device(struct ctl_table *table, int 
write,
  void *result, size_t *lenp, loff_t *ppos)
 {
struct parport *port = (struct parport *)table->extra1;
-   char buffer[256];
struct pardevice *dev;
-   int len = 0;
+   size_t ret = 0;
 
if (write)  /* can't happen anyway */
return -EACCES;
@@ -48,24 +47,19 @@ static int do_active_device(struct ctl_table *table, int 
write,
*lenp = 0;
return 0;
}
-   
+
for (dev = port->devices; dev ; dev = dev->next) {
if(dev == port->cad) {
-   len += sprintf(buffer, "%s\n", dev->name);
+   ret += scnprintf(result + ret, *lenp - ret, "%s\n",
+dev->name);
}
}
 
-   if(!len) {
-   len += sprintf(buffer, "%s\n", "none");
-   }
-
-   if (len > *lenp)
-   len = *lenp;
-   else
-   *lenp = len;
+   if (!ret)
+   ret = scnprintf(result, *lenp, "%s\n", "none");
 
-   *ppos += len;
-   memcpy(result, buffer, len);
+   *lenp = ret;
+   *ppos += ret;
return 0;
 }
 
@@ -75,8 +69,7 @@ static int do_autoprobe(struct ctl_table *table, int write,
 {
struct parport_device_info *info = table->extra2;
const char *str;
-   char buffer[256];
-   int len = 0;
+   size_t ret = 0;
 
if (write) /* permissions stop this */
return -EACCES;
@@ -85,30 +78,24 @@ static int do_autoprobe(struct ctl_table *table, int write,
*lenp = 0;
return 0;
}
-   
+
if ((str = info->class_name) != NULL)
-   len += sprintf (buffer + len, "CLASS:%s;\n", str);
+   ret += scnprintf(result + ret, *lenp - ret, "CLASS:%s;\n", str);
 
if ((str = info->model) != NULL)
-   len += sprintf (buffer + len, "MODEL:%s;\n", str);
+   ret += scnprintf(result + ret, *lenp - ret, "MODEL:%s;\n", str);
 
if ((str = info->mfr) != NULL)
-   len += sprintf (buffer + len, "MANUFACTURER:%s;\n", str);
+   ret += scnprintf(result + ret, *lenp - ret, 
"MANUFACTURER:%s;\n", str);
 
if ((str = info->description) != NULL)
-   len += sprintf (buffer + len, "DESCRIPTION:%s;\n", str);
+   ret += scnprintf(result + ret, *lenp - ret, 
"DESCRIPTION:%s;\n", str);
 
if ((str = info->cmdset) != NULL)
-   len += sprintf (buffer + len, "COMMAND SET:%s;\n", str);
-
-   if (len > *lenp)
-   len = *lenp;
-   else
-   *lenp = len;
+   ret += scnprintf(result + ret, *lenp - ret, "COMMAND 
SET:%s;\n", str);
 
-   *ppos += len;
-
-   memcpy(result, buffer, len);
+   *lenp = ret;
+   *ppos += ret;
return 0;
 }
 #endif /* IEEE1284.3 support. */
@@ -117,8 +104,7 @@ static int do_hardware_base_addr(struct ctl_table *table, 
int write,
 void *result, size_t *lenp, loff_t *ppos)
 {
struct parport *port = (struct parport *)table->extra1;
-   char buffer[20];
-   int len = 0;
+   size_t ret;
 
if (*ppos) {
*lenp = 0;
@@ -128,15 +114,10 @@ static int do_hardware_base_addr(struct ctl_table *table, 
int write,
if (write) /* permissions prevent this anyway */
return -EACCES;
 
-   len += sprintf (buffer, "%lu\t%lu\n", port->base, port->base_hi);
-
-   if (len > *lenp)
-   len = *lenp;
-   else
-   *lenp = len;
-
-   *ppos += len;
-   memcpy(result, buffer, len);
+   ret = scnprintf(result, *lenp, "%lu\t%lu\n", port->base,
+   port->base_hi);
+   *lenp = ret;
+   *ppos += ret;
return 0;
 }
 
@@ -144,8 +125,7 @@ static int do_hardware_irq(struct ctl_table *table, int 
write,
   void *result, size_t *lenp, loff_t *ppos)
 {
struct parport *port = (struct parport *)table->

[PATCH 6/6] sunrpc: rework proc handlers to take advantage of the new buffer

2020-08-13 Thread Josef Bacik
Now that we're allocating an extra slot for the NULL terminated string,
use scnprintf() and write directly into the buffer.

Signed-off-by: Josef Bacik 
---
 net/sunrpc/sysctl.c| 10 ++
 net/sunrpc/xprtrdma/svc_rdma.c | 16 ++--
 2 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/net/sunrpc/sysctl.c b/net/sunrpc/sysctl.c
index 999eee1ed61c..31ed530d9846 100644
--- a/net/sunrpc/sysctl.c
+++ b/net/sunrpc/sysctl.c
@@ -117,14 +117,8 @@ proc_dodebug(struct ctl_table *table, int write, void 
*buffer, size_t *lenp,
if (strcmp(table->procname, "rpc_debug") == 0)
rpc_show_tasks(_net);
} else {
-   len = sprintf(tmpbuf, "0x%04x", *(unsigned int *) table->data);
-   if (len > left)
-   len = left;
-   memcpy(buffer, tmpbuf, len);
-   if ((left -= len) > 0) {
-   *((char *)buffer + len) = '\n';
-   left--;
-   }
+   len = scnprintf(buffer, *lenp, "0x%04x\n", *(unsigned int *) 
table->data);
+   left -= len;
}
 
 done:
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 526da5d4710b..9b3a113598af 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -90,20 +90,8 @@ static int read_reset_stat(struct ctl_table *table, int 
write,
if (write)
atomic_set(stat, 0);
else {
-   char str_buf[32];
-   int len = snprintf(str_buf, 32, "%d\n", atomic_read(stat));
-   if (len >= 32)
-   return -EFAULT;
-   len = strlen(str_buf);
-   if (*ppos > len) {
-   *lenp = 0;
-   return 0;
-   }
-   len -= *ppos;
-   if (len > *lenp)
-   len = *lenp;
-   if (len)
-   memcpy(buffer, str_buf, len);
+   size_t len = scnprintf(buffer, *lenp, "%d\n",
+  atomic_read(stat));
*lenp = len;
*ppos += len;
}
-- 
2.24.1



[PATCH 4/6] sysctl: make proc_put_long() use scnprintf

2020-08-13 Thread Josef Bacik
Now that we're passing down a kernel buffer with enough space to account
for an extra NULL terminator, go ahead and use scnprintf() to print out
a long in proc_put_long().  count here includes NULL terminator slot in
the buffer, so we will get the correct behavior we're looking for.

Signed-off-by: Josef Bacik 
---
 kernel/sysctl.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 287862f91717..d8cc8737f58f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -484,6 +484,7 @@ static int proc_get_long(char **buf, size_t *size,
 
return 0;
 }
+#undef TMPBUFLEN
 
 /**
  * proc_put_long - converts an integer to a decimal ASCII formatted string
@@ -498,18 +499,12 @@ static int proc_get_long(char **buf, size_t *size,
  */
 static void proc_put_long(void **buf, size_t *size, unsigned long val, bool 
neg)
 {
-   int len;
-   char tmp[TMPBUFLEN], *p = tmp;
+   size_t ret;
 
-   sprintf(p, "%s%lu", neg ? "-" : "", val);
-   len = strlen(tmp);
-   if (len > *size)
-   len = *size;
-   memcpy(*buf, tmp, len);
-   *size -= len;
-   *buf += len;
+   ret = scnprintf(*buf, *size, "%s%lu", neg ? "-" : "", val);
+   *size -= ret;
+   *buf += ret;
 }
-#undef TMPBUFLEN
 
 static void proc_put_char(void **buf, size_t *size, char c)
 {
-- 
2.24.1



[PATCH 0/6] Some buffer management fixes for proc

2020-08-13 Thread Josef Bacik
This initialy started with

  [PATCH 1/6] proc: use vmalloc for our kernel buffer

Which came about because we were getting page alloc failures when cat tried to
do a read with a 64kib buffer, triggering an order 4 allocation.  We need to
switch to kvmalloc for this buffer to avoid these high order allocations.  Then
Christoph suggested renaming vmemdup_user to kvmemdup_user, so then we have this

  [PATCH 2/6] tree-wide: rename vmemdup_user to kvmemdup_user

And then finally Viro noticed that if we allocate an extra byte for the NULL
terminator then we can use scnprintf() in a few places, and thus the next set of
patches

  [PATCH 3/6] proc: allocate count + 1 for our read buffer
  [PATCH 4/6] sysctl: make proc_put_long() use scnprintf
  [PATCH 5/6] parport: rework procfs handlers to take advantage of the
  [PATCH 6/6] sunrpc: rework proc handlers to take advantage of the new

There's one case that I didn't convert, _proc_do_string, and that's because it's
one of the few places that takes into account ppos, and so we'll skip forward in
the string we're provided with from the caller.  In this case it makes sense to
just leave it the way it is.  I'm pretty sure I caught all the other people who
directly mess with the buffer, but there's around 800 ->proc_handler's, and my
eyes started to glaze over after a while.

Josef



[PATCH 1/6] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik
Since

  sysctl: pass kernel pointers to ->proc_handler

we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace.  The problem is this
just blind kmalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib.  Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order,
use vmalloc for these buffers.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik 
---
 fs/proc/proc_sysctl.c  |  6 +++---
 include/linux/string.h |  1 +
 mm/util.c  | 27 +++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 6c1166ccdaea..8e19bad83b45 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
goto out;
 
if (write) {
-   kbuf = memdup_user_nul(ubuf, count);
+   kbuf = kvmemdup_user_nul(ubuf, count);
if (IS_ERR(kbuf)) {
error = PTR_ERR(kbuf);
goto out;
}
} else {
-   kbuf = kzalloc(count, GFP_KERNEL);
+   kbuf = kvzalloc(count, GFP_KERNEL);
if (!kbuf)
goto out;
}
@@ -600,7 +600,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
 
error = count;
 out_free_buf:
-   kfree(kbuf);
+   kvfree(kbuf);
 out:
sysctl_head_finish(head);
 
diff --git a/include/linux/string.h b/include/linux/string.h
index 9b7a0632e87a..21bb6d3d88c4 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -12,6 +12,7 @@
 extern char *strndup_user(const char __user *, long);
 extern void *memdup_user(const void __user *, size_t);
 extern void *vmemdup_user(const void __user *, size_t);
+extern void *kvmemdup_user_nul(const void __user *, size_t);
 extern void *memdup_user_nul(const void __user *, size_t);
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 5ef378a2a038..cf454d57d3e2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -208,6 +208,33 @@ void *vmemdup_user(const void __user *src, size_t len)
 }
 EXPORT_SYMBOL(vmemdup_user);
 
+/**
+ * kvmemdup_user_nul - duplicate memory region from user space and 
NUL-terminate
+ *
+ * @src: source address in user space
+ * @len: number of bytes to copy
+ *
+ * Return: an ERR_PTR() on failure.  Result may be not
+ * physically contiguous.  Use kvfree() to free.
+ */
+void *kvmemdup_user_nul(const void __user *src, size_t len)
+{
+   char *p;
+
+   p = kvmalloc(len + 1, GFP_USER);
+   if (!p)
+   return ERR_PTR(-ENOMEM);
+
+   if (copy_from_user(p, src, len)) {
+   kvfree(p);
+   return ERR_PTR(-EFAULT);
+   }
+   p[len] = '\0';
+
+   return p;
+}
+EXPORT_SYMBOL(kvmemdup_user_nul);
+
 /**
  * strndup_user - duplicate an existing string from user space
  * @s: The string to duplicate
-- 
2.24.1



[PATCH 3/6] proc: allocate count + 1 for our read buffer

2020-08-13 Thread Josef Bacik
Al suggested that if we allocate enough space to add in the '\0'
character at the end of our strings, we could just use scnprintf() in
our ->proc_handler functions without having to be fancy about keeping
track of space.  There are a lot of these handlers, so the follow ups
will be separate, but start with allocating the extra byte to handle the
null termination of strings.

Signed-off-by: Josef Bacik 
---
 fs/proc/proc_sysctl.c | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 8e19bad83b45..446e7a949025 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -548,6 +548,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
struct ctl_table *table = PROC_I(inode)->sysctl_entry;
void *kbuf;
ssize_t error;
+   size_t orig_count = count;
 
if (IS_ERR(head))
return PTR_ERR(head);
@@ -577,9 +578,23 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
goto out;
}
} else {
-   kbuf = kvzalloc(count, GFP_KERNEL);
+   /*
+* To make our lives easier in ->proc_handler, we allocate an
+* extra byte to allow us to use scnprintf() for handling the
+* buffer output.  This works properly because scnprintf() will
+* only return the number of bytes that it was able to write
+* out, _NOT_ including the NULL byte.  This means the handler's
+* will only ever return a maximum of count as what they've
+* copied.
+*
+* HOWEVER, we do not assume that ->proc_handlers are without
+* bugs, so further down we'll do an extra check to make sure
+* that count isn't larger than the orig_count.
+*/
+   kbuf = kvzalloc(count + 1, GFP_KERNEL);
if (!kbuf)
goto out;
+   count += 1;
}
 
error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, , ,
@@ -593,6 +608,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
goto out_free_buf;
 
if (!write) {
+   /*
+* This shouldn't happen, but those are the last words before
+* somebody adds a security vulnerability, so just make sure
+* that count isn't larger than orig_count.
+*/
+   if (count > orig_count)
+   count = orig_count;
error = -EFAULT;
if (copy_to_user(ubuf, kbuf, count))
goto out_free_buf;
-- 
2.24.1



[PATCH 2/6] tree-wide: rename vmemdup_user to kvmemdup_user

2020-08-13 Thread Josef Bacik
This helper uses kvmalloc, not vmalloc, so rename it to kvmemdup_user to
make it clear we're using kvmalloc() and will need to use kvfree().

Signed-off-by: Josef Bacik 
---
 arch/x86/kvm/cpuid.c   | 6 +++---
 drivers/gpu/drm/virtio/virtgpu_ioctl.c | 2 +-
 drivers/tty/vt/consolemap.c| 2 +-
 include/linux/string.h | 2 +-
 mm/util.c  | 6 +++---
 sound/core/control.c   | 4 ++--
 virt/kvm/kvm_main.c| 4 ++--
 7 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 3fd6eec202d7..22834ea499ee 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -200,9 +200,9 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
if (cpuid->nent > KVM_MAX_CPUID_ENTRIES)
goto out;
if (cpuid->nent) {
-   cpuid_entries = vmemdup_user(entries,
-array_size(sizeof(struct 
kvm_cpuid_entry),
-   cpuid->nent));
+   cpuid_entries = kvmemdup_user(entries,
+ array_size(sizeof(struct 
kvm_cpuid_entry),
+cpuid->nent));
if (IS_ERR(cpuid_entries)) {
r = PTR_ERR(cpuid_entries);
goto out;
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c 
b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 7a2430e34e00..c2f973aa3680 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -147,7 +147,7 @@ static int virtio_gpu_execbuffer_ioctl(struct drm_device 
*dev, void *data,
bo_handles = NULL;
}
 
-   buf = vmemdup_user(u64_to_user_ptr(exbuf->command), exbuf->size);
+   buf = kvmemdup_user(u64_to_user_ptr(exbuf->command), exbuf->size);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);
goto out_unused_fd;
diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
index 5947b54d92be..2cffa8b3c74b 100644
--- a/drivers/tty/vt/consolemap.c
+++ b/drivers/tty/vt/consolemap.c
@@ -542,7 +542,7 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct 
unipair __user *list)
if (!ct)
return 0;
 
-   unilist = vmemdup_user(list, array_size(sizeof(struct unipair), ct));
+   unilist = kvmemdup_user(list, array_size(sizeof(struct unipair), ct));
if (IS_ERR(unilist))
return PTR_ERR(unilist);
 
diff --git a/include/linux/string.h b/include/linux/string.h
index 21bb6d3d88c4..a6f7218124a0 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -11,7 +11,7 @@
 
 extern char *strndup_user(const char __user *, long);
 extern void *memdup_user(const void __user *, size_t);
-extern void *vmemdup_user(const void __user *, size_t);
+extern void *kvmemdup_user(const void __user *, size_t);
 extern void *kvmemdup_user_nul(const void __user *, size_t);
 extern void *memdup_user_nul(const void __user *, size_t);
 
diff --git a/mm/util.c b/mm/util.c
index cf454d57d3e2..f434634b6ba3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -183,7 +183,7 @@ void *memdup_user(const void __user *src, size_t len)
 EXPORT_SYMBOL(memdup_user);
 
 /**
- * vmemdup_user - duplicate memory region from user space
+ * kvmemdup_user - duplicate memory region from user space
  *
  * @src: source address in user space
  * @len: number of bytes to copy
@@ -191,7 +191,7 @@ EXPORT_SYMBOL(memdup_user);
  * Return: an ERR_PTR() on failure.  Result may be not
  * physically contiguous.  Use kvfree() to free.
  */
-void *vmemdup_user(const void __user *src, size_t len)
+void *kvmemdup_user(const void __user *src, size_t len)
 {
void *p;
 
@@ -206,7 +206,7 @@ void *vmemdup_user(const void __user *src, size_t len)
 
return p;
 }
-EXPORT_SYMBOL(vmemdup_user);
+EXPORT_SYMBOL(kvmemdup_user);
 
 /**
  * kvmemdup_user_nul - duplicate memory region from user space and 
NUL-terminate
diff --git a/sound/core/control.c b/sound/core/control.c
index aa0c0cf182af..b712f4d261de 100644
--- a/sound/core/control.c
+++ b/sound/core/control.c
@@ -1297,7 +1297,7 @@ static int replace_user_tlv(struct snd_kcontrol *kctl, 
unsigned int __user *buf,
if (size > 1024 * 128)  /* sane value */
return -EINVAL;
 
-   container = vmemdup_user(buf, size);
+   container = kvmemdup_user(buf, size);
if (IS_ERR(container))
return PTR_ERR(container);
 
@@ -1365,7 +1365,7 @@ static int snd_ctl_elem_init_enum_names(struct 
user_element *ue)
if (ue->info.value.enumerated.names_length > 64 * 1024)
return -EINVAL;
 
-   names = vmemdup_user((const void __user *)user_ptrval,
+   names = kvmemdup_user((const void __user *)user_ptrval,
ue

Re: [PATCH][v2] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik

On 8/13/20 1:31 PM, Al Viro wrote:

On Thu, Aug 13, 2020 at 01:19:18PM -0400, Josef Bacik wrote:


in sunrpc proc_dodebug() turns into
left -= snprintf(buffer, left, "0x%04x\n",

 
 left + 1, that is.


 *(unsigned int *) table->data);
and that's not the only example.



We wouldn't even need the extra +1 part, since we're only copying in how
much the user wants anyway, we could just go ahead and convert this to

left -= snprintf(buffer, left, "0x%04x\n", *(unsigned int *) table->data);

and be fine, right?  Or am I misunderstanding what you're looking for?  Thanks,


snprintf() always produces a NUL-terminated string.  And if you are passing 7 as
len, you want 0xf0ad\n to be copied to user.  For that you need 8 passed to
snprintf, and 8-byte buffer given to it.



Right, gotcha.  I'll rig that up and see how it looks.  I'd recommend looking 
through what I do with a fine tooth comb, I'm obviously not batting 1000 today. 
Thanks,


Josef


Re: [PATCH][v2] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik

On 8/13/20 12:20 PM, Al Viro wrote:

On Thu, Aug 13, 2020 at 05:41:17PM +0200, Christoph Hellwig wrote:

On Thu, Aug 13, 2020 at 11:40:00AM -0400, Josef Bacik wrote:

On 8/13/20 11:37 AM, Christoph Hellwig wrote:

On Thu, Aug 13, 2020 at 11:33:56AM -0400, Josef Bacik wrote:

Since

sysctl: pass kernel pointers to ->proc_handler

we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace.  The problem is this
just blind kmalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib.  Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order,
use vmalloc for these buffers.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik 
---
v1->v2:
- Make vmemdup_user_nul actually do the right thing...sorry about that.

   fs/proc/proc_sysctl.c  |  6 +++---
   include/linux/string.h |  1 +
   mm/util.c  | 27 +++
   3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 6c1166ccdaea..207ac6e6e028 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
goto out;
if (write) {
-   kbuf = memdup_user_nul(ubuf, count);
+   kbuf = vmemdup_user_nul(ubuf, count);


Given that this can also do a kmalloc and thus needs to be paired
with kvfree shouldn't it be kvmemdup_user_nul?



There's an existing vmemdup_user that does kvmalloc, so I followed the
existing naming convention.  Do you want me to change them both?  Thanks,


I personally would, and given that it only has a few users it might
even be feasible.


FWIW, how about following or combining that with "allocate count + 1 bytes on
the read side"?  Allows some nice cleanups - e.g.
 len = sprintf(tmpbuf, "0x%04x", *(unsigned int *) table->data);
 if (len > left)
 len = left;
 memcpy(buffer, tmpbuf, len);
 if ((left -= len) > 0) {
 *((char *)buffer + len) = '\n';
 left--;
 }
in sunrpc proc_dodebug() turns into
left -= snprintf(buffer, left, "0x%04x\n",
 *(unsigned int *) table->data);
and that's not the only example.



We wouldn't even need the extra +1 part, since we're only copying in how much 
the user wants anyway, we could just go ahead and convert this to


left -= snprintf(buffer, left, "0x%04x\n", *(unsigned int *) table->data);

and be fine, right?  Or am I misunderstanding what you're looking for?  Thanks,

Josef


Re: [PATCH] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik

On 8/13/20 12:19 PM, David Laight wrote:

From: Josef Bacik

Sent: 13 August 2020 15:53

   sysctl: pass kernel pointers to ->proc_handler

we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace.  The problem is this
just blind kmalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib.  Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order,
use vmalloc for these buffers.


What happens if I run 'dd bs=16M ...' ?

David



/* don't even try if the size is too large */
error = -ENOMEM;
if (count >= KMALLOC_MAX_SIZE)
goto out;

is above this code, thanks,

Josef


[PATCH 2/2] tree-wide: rename vmemdup_user to kvmemdup_user

2020-08-13 Thread Josef Bacik
This helper uses kvmalloc, not vmalloc, so rename it to kvmemdup_user to
make it clear we're using kvmalloc() and will need to use kvfree().

Signed-off-by: Josef Bacik 
---
 arch/x86/kvm/cpuid.c   | 6 +++---
 drivers/gpu/drm/virtio/virtgpu_ioctl.c | 2 +-
 drivers/tty/vt/consolemap.c| 2 +-
 include/linux/string.h | 2 +-
 mm/util.c  | 6 +++---
 sound/core/control.c   | 4 ++--
 virt/kvm/kvm_main.c| 4 ++--
 7 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 3fd6eec202d7..22834ea499ee 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -200,9 +200,9 @@ int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
if (cpuid->nent > KVM_MAX_CPUID_ENTRIES)
goto out;
if (cpuid->nent) {
-   cpuid_entries = vmemdup_user(entries,
-array_size(sizeof(struct 
kvm_cpuid_entry),
-   cpuid->nent));
+   cpuid_entries = kvmemdup_user(entries,
+ array_size(sizeof(struct 
kvm_cpuid_entry),
+cpuid->nent));
if (IS_ERR(cpuid_entries)) {
r = PTR_ERR(cpuid_entries);
goto out;
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c 
b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 7a2430e34e00..c2f973aa3680 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -147,7 +147,7 @@ static int virtio_gpu_execbuffer_ioctl(struct drm_device 
*dev, void *data,
bo_handles = NULL;
}
 
-   buf = vmemdup_user(u64_to_user_ptr(exbuf->command), exbuf->size);
+   buf = kvmemdup_user(u64_to_user_ptr(exbuf->command), exbuf->size);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);
goto out_unused_fd;
diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
index 5947b54d92be..2cffa8b3c74b 100644
--- a/drivers/tty/vt/consolemap.c
+++ b/drivers/tty/vt/consolemap.c
@@ -542,7 +542,7 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct 
unipair __user *list)
if (!ct)
return 0;
 
-   unilist = vmemdup_user(list, array_size(sizeof(struct unipair), ct));
+   unilist = kvmemdup_user(list, array_size(sizeof(struct unipair), ct));
if (IS_ERR(unilist))
return PTR_ERR(unilist);
 
diff --git a/include/linux/string.h b/include/linux/string.h
index 21bb6d3d88c4..a6f7218124a0 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -11,7 +11,7 @@
 
 extern char *strndup_user(const char __user *, long);
 extern void *memdup_user(const void __user *, size_t);
-extern void *vmemdup_user(const void __user *, size_t);
+extern void *kvmemdup_user(const void __user *, size_t);
 extern void *kvmemdup_user_nul(const void __user *, size_t);
 extern void *memdup_user_nul(const void __user *, size_t);
 
diff --git a/mm/util.c b/mm/util.c
index cf454d57d3e2..f434634b6ba3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -183,7 +183,7 @@ void *memdup_user(const void __user *src, size_t len)
 EXPORT_SYMBOL(memdup_user);
 
 /**
- * vmemdup_user - duplicate memory region from user space
+ * kvmemdup_user - duplicate memory region from user space
  *
  * @src: source address in user space
  * @len: number of bytes to copy
@@ -191,7 +191,7 @@ EXPORT_SYMBOL(memdup_user);
  * Return: an ERR_PTR() on failure.  Result may be not
  * physically contiguous.  Use kvfree() to free.
  */
-void *vmemdup_user(const void __user *src, size_t len)
+void *kvmemdup_user(const void __user *src, size_t len)
 {
void *p;
 
@@ -206,7 +206,7 @@ void *vmemdup_user(const void __user *src, size_t len)
 
return p;
 }
-EXPORT_SYMBOL(vmemdup_user);
+EXPORT_SYMBOL(kvmemdup_user);
 
 /**
  * kvmemdup_user_nul - duplicate memory region from user space and 
NUL-terminate
diff --git a/sound/core/control.c b/sound/core/control.c
index aa0c0cf182af..b712f4d261de 100644
--- a/sound/core/control.c
+++ b/sound/core/control.c
@@ -1297,7 +1297,7 @@ static int replace_user_tlv(struct snd_kcontrol *kctl, 
unsigned int __user *buf,
if (size > 1024 * 128)  /* sane value */
return -EINVAL;
 
-   container = vmemdup_user(buf, size);
+   container = kvmemdup_user(buf, size);
if (IS_ERR(container))
return PTR_ERR(container);
 
@@ -1365,7 +1365,7 @@ static int snd_ctl_elem_init_enum_names(struct 
user_element *ue)
if (ue->info.value.enumerated.names_length > 64 * 1024)
return -EINVAL;
 
-   names = vmemdup_user((const void __user *)user_ptrval,
+   names = kvmemdup_user((const void __user *)user_ptrval,
ue

[PATCH 1/2][v3] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik
Since

  sysctl: pass kernel pointers to ->proc_handler

we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace.  The problem is this
just blind kmalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib.  Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order,
use vmalloc for these buffers.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik 
---
v2->v3:
- Rename vmemdup_user_nul to kvmemdup_user_nul.
v1->v2:
- Make vmemdup_user_nul actually do the right thing...sorry about that.

 fs/proc/proc_sysctl.c  |  6 +++---
 include/linux/string.h |  1 +
 mm/util.c  | 27 +++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 6c1166ccdaea..8e19bad83b45 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
goto out;
 
if (write) {
-   kbuf = memdup_user_nul(ubuf, count);
+   kbuf = kvmemdup_user_nul(ubuf, count);
if (IS_ERR(kbuf)) {
error = PTR_ERR(kbuf);
goto out;
}
} else {
-   kbuf = kzalloc(count, GFP_KERNEL);
+   kbuf = kvzalloc(count, GFP_KERNEL);
if (!kbuf)
goto out;
}
@@ -600,7 +600,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
 
error = count;
 out_free_buf:
-   kfree(kbuf);
+   kvfree(kbuf);
 out:
sysctl_head_finish(head);
 
diff --git a/include/linux/string.h b/include/linux/string.h
index 9b7a0632e87a..21bb6d3d88c4 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -12,6 +12,7 @@
 extern char *strndup_user(const char __user *, long);
 extern void *memdup_user(const void __user *, size_t);
 extern void *vmemdup_user(const void __user *, size_t);
+extern void *kvmemdup_user_nul(const void __user *, size_t);
 extern void *memdup_user_nul(const void __user *, size_t);
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 5ef378a2a038..cf454d57d3e2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -208,6 +208,33 @@ void *vmemdup_user(const void __user *src, size_t len)
 }
 EXPORT_SYMBOL(vmemdup_user);
 
+/**
+ * kvmemdup_user_nul - duplicate memory region from user space and 
NUL-terminate
+ *
+ * @src: source address in user space
+ * @len: number of bytes to copy
+ *
+ * Return: an ERR_PTR() on failure.  Result may be not
+ * physically contiguous.  Use kvfree() to free.
+ */
+void *kvmemdup_user_nul(const void __user *src, size_t len)
+{
+   char *p;
+
+   p = kvmalloc(len + 1, GFP_USER);
+   if (!p)
+   return ERR_PTR(-ENOMEM);
+
+   if (copy_from_user(p, src, len)) {
+   kvfree(p);
+   return ERR_PTR(-EFAULT);
+   }
+   p[len] = '\0';
+
+   return p;
+}
+EXPORT_SYMBOL(kvmemdup_user_nul);
+
 /**
  * strndup_user - duplicate an existing string from user space
  * @s: The string to duplicate
-- 
2.24.1



Re: [PATCH][v2] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik

On 8/13/20 11:37 AM, Christoph Hellwig wrote:

On Thu, Aug 13, 2020 at 11:33:56AM -0400, Josef Bacik wrote:

Since

   sysctl: pass kernel pointers to ->proc_handler

we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace.  The problem is this
just blind kmalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib.  Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order,
use vmalloc for these buffers.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik 
---
v1->v2:
- Make vmemdup_user_nul actually do the right thing...sorry about that.

  fs/proc/proc_sysctl.c  |  6 +++---
  include/linux/string.h |  1 +
  mm/util.c  | 27 +++
  3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 6c1166ccdaea..207ac6e6e028 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
goto out;
  
  	if (write) {

-   kbuf = memdup_user_nul(ubuf, count);
+   kbuf = vmemdup_user_nul(ubuf, count);


Given that this can also do a kmalloc and thus needs to be paired
with kvfree shouldn't it be kvmemdup_user_nul?



There's an existing vmemdup_user that does kvmalloc, so I followed the existing 
naming convention.  Do you want me to change them both?  Thanks,


Josef


[PATCH][v2] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik
Since

  sysctl: pass kernel pointers to ->proc_handler

we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace.  The problem is this
just blind kmalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib.  Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order,
use vmalloc for these buffers.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik 
---
v1->v2:
- Make vmemdup_user_nul actually do the right thing...sorry about that.

 fs/proc/proc_sysctl.c  |  6 +++---
 include/linux/string.h |  1 +
 mm/util.c  | 27 +++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 6c1166ccdaea..207ac6e6e028 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
goto out;
 
if (write) {
-   kbuf = memdup_user_nul(ubuf, count);
+   kbuf = vmemdup_user_nul(ubuf, count);
if (IS_ERR(kbuf)) {
error = PTR_ERR(kbuf);
goto out;
}
} else {
-   kbuf = kzalloc(count, GFP_KERNEL);
+   kbuf = kvzalloc(count, GFP_KERNEL);
if (!kbuf)
goto out;
}
@@ -600,7 +600,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
 
error = count;
 out_free_buf:
-   kfree(kbuf);
+   kvfree(kbuf);
 out:
sysctl_head_finish(head);
 
diff --git a/include/linux/string.h b/include/linux/string.h
index 9b7a0632e87a..aee3689fb865 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -12,6 +12,7 @@
 extern char *strndup_user(const char __user *, long);
 extern void *memdup_user(const void __user *, size_t);
 extern void *vmemdup_user(const void __user *, size_t);
+extern void *vmemdup_user_nul(const void __user *, size_t);
 extern void *memdup_user_nul(const void __user *, size_t);
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 5ef378a2a038..9d0ad7aafc27 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -208,6 +208,33 @@ void *vmemdup_user(const void __user *src, size_t len)
 }
 EXPORT_SYMBOL(vmemdup_user);
 
+/**
+ * vmemdup_user_nul - duplicate memory region from user space and NUL-terminate
+ *
+ * @src: source address in user space
+ * @len: number of bytes to copy
+ *
+ * Return: an ERR_PTR() on failure.  Result may be not
+ * physically contiguous.  Use kvfree() to free.
+ */
+void *vmemdup_user_nul(const void __user *src, size_t len)
+{
+   char *p;
+
+   p = kvmalloc(len + 1, GFP_USER);
+   if (!p)
+   return ERR_PTR(-ENOMEM);
+
+   if (copy_from_user(p, src, len)) {
+   kvfree(p);
+   return ERR_PTR(-EFAULT);
+   }
+   p[len] = '\0';
+
+   return p;
+}
+EXPORT_SYMBOL(vmemdup_user_nul);
+
 /**
  * strndup_user - duplicate an existing string from user space
  * @s: The string to duplicate
-- 
2.24.1



Re: [PATCH] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik

On 8/13/20 10:59 AM, Matthew Wilcox wrote:

On Thu, Aug 13, 2020 at 10:53:05AM -0400, Josef Bacik wrote:

+/**
+ * vmemdup_user - duplicate memory region from user space and NUL-terminate


vmemdup_user_nul()


+void *vmemdup_user_nul(const void __user *src, size_t len)
+{
+   void *p;
+
+   p = kvmalloc(len, GFP_USER);


len+1, shirley?


+   if (!p)
+   return ERR_PTR(-ENOMEM);
+
+   if (copy_from_user(p, src, len)) {
+   kvfree(p);
+   return ERR_PTR(-EFAULT);
+   }


I think you forgot

 p[len] = '\0';



Sweet lord I need more sleep, my bad.  Thanks,

Josef


[PATCH] proc: use vmalloc for our kernel buffer

2020-08-13 Thread Josef Bacik
Since

  sysctl: pass kernel pointers to ->proc_handler

we have been pre-allocating a buffer to copy the data from the proc
handlers into, and then copying that to userspace.  The problem is this
just blind kmalloc()'s the buffer size passed in from the read, which in
the case of our 'cat' binary was 64kib.  Order-4 allocations are not
awesome, and since we can potentially allocate up to our maximum order,
use vmalloc for these buffers.

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Josef Bacik 
---
 fs/proc/proc_sysctl.c  |  6 +++---
 include/linux/string.h |  1 +
 mm/util.c  | 26 ++
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 6c1166ccdaea..207ac6e6e028 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -571,13 +571,13 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
goto out;
 
if (write) {
-   kbuf = memdup_user_nul(ubuf, count);
+   kbuf = vmemdup_user_nul(ubuf, count);
if (IS_ERR(kbuf)) {
error = PTR_ERR(kbuf);
goto out;
}
} else {
-   kbuf = kzalloc(count, GFP_KERNEL);
+   kbuf = kvzalloc(count, GFP_KERNEL);
if (!kbuf)
goto out;
}
@@ -600,7 +600,7 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *ubuf,
 
error = count;
 out_free_buf:
-   kfree(kbuf);
+   kvfree(kbuf);
 out:
sysctl_head_finish(head);
 
diff --git a/include/linux/string.h b/include/linux/string.h
index 9b7a0632e87a..aee3689fb865 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -12,6 +12,7 @@
 extern char *strndup_user(const char __user *, long);
 extern void *memdup_user(const void __user *, size_t);
 extern void *vmemdup_user(const void __user *, size_t);
+extern void *vmemdup_user_nul(const void __user *, size_t);
 extern void *memdup_user_nul(const void __user *, size_t);
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 5ef378a2a038..4de3b4b0f358 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -208,6 +208,32 @@ void *vmemdup_user(const void __user *src, size_t len)
 }
 EXPORT_SYMBOL(vmemdup_user);
 
+/**
+ * vmemdup_user - duplicate memory region from user space and NUL-terminate
+ *
+ * @src: source address in user space
+ * @len: number of bytes to copy
+ *
+ * Return: an ERR_PTR() on failure.  Result may be not
+ * physically contiguous.  Use kvfree() to free.
+ */
+void *vmemdup_user_nul(const void __user *src, size_t len)
+{
+   void *p;
+
+   p = kvmalloc(len, GFP_USER);
+   if (!p)
+   return ERR_PTR(-ENOMEM);
+
+   if (copy_from_user(p, src, len)) {
+   kvfree(p);
+   return ERR_PTR(-EFAULT);
+   }
+
+   return p;
+}
+EXPORT_SYMBOL(vmemdup_user_nul);
+
 /**
  * strndup_user - duplicate an existing string from user space
  * @s: The string to duplicate
-- 
2.24.1



Re: Inverted mount options completely broken (iversion,relatime)

2020-07-29 Thread Josef Bacik

On 7/29/20 2:41 PM, Eric Sandeen wrote:

On 7/29/20 11:32 AM, Josef Bacik wrote:

Hello,

Eric reported a problem to me where we were clearing SB_I_VERSION on remount of a btrfs 
file system.  After digging through I discovered it's because we expect the proper flags 
that we want to be passed in via the mount() syscall, and because we didn't have 
"iversion" in our show_options entry the mount binary (form util-linux) wasn't 
setting MS_I_VERSION for the remount, and thus the VFS was clearing SB_I_VERSION from our 
s_flags.

No big deal, I'll fix show_mount.  Except Eric then noticed that mount -o 
noiversion didn't do anything, we still get iversion set.  That's because btrfs 
just defaults to having SB_I_VERSION set.  Furthermore -o noiversion doesn't 
get sent into mount, it's handled by the mount binary itself, and it does this 
by not having MS_I_VERSION set in the mount flags.


This was beaten^Wdiscussed to death in an earlier thread,
[PATCH] fs: i_version mntopt gets visible through /proc/mounts

https://lore.kernel.org/linux-fsdevel/20200616202123.12656-1-msys.miz...@gmail.com/

tl;dr: hch doesn't think [no]iversion should be exposed as an option /at all/
so exposing it in /proc/mounts in show_mnt_opts for mount(8)'s benefit was
nacked.


This happens as well for -o relatime, it's the default and so if you do mount 
-o norelatime it won't do anything, you still get relatime behavior.


I think that's a different issue.


The only time this changes is if you do mount -o remount,norelatime.


Hm, not on xfs:

# mount -o loop,norelatime xfsfile  mnt
# grep loop /proc/mounts
/dev/loop0 /tmp/mnt xfs 
rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

# mount -o remount,norelatime mnt
# grep loop /proc/mounts
/dev/loop0 /tmp/mnt xfs 
rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0



Oops you're right, I'm blind.  Same happens for btrfs, so using -o norelatime 
simply does nothing because it's considered a kernel wide default.




Are there other oddities besides iversion and relatime?


It doesn't look like it, I checked a few others of the MS_INVERT variety, these 
appear to be the only ones.  I really don't want to have this discussion again 
in the future tho when we introduce MS_SOME_NEW_AWESOME.  Thanks,


Josef



Inverted mount options completely broken (iversion,relatime)

2020-07-29 Thread Josef Bacik

Hello,

Eric reported a problem to me where we were clearing SB_I_VERSION on remount of 
a btrfs file system.  After digging through I discovered it's because we expect 
the proper flags that we want to be passed in via the mount() syscall, and 
because we didn't have "iversion" in our show_options entry the mount binary 
(form util-linux) wasn't setting MS_I_VERSION for the remount, and thus the VFS 
was clearing SB_I_VERSION from our s_flags.


No big deal, I'll fix show_mount.  Except Eric then noticed that mount -o 
noiversion didn't do anything, we still get iversion set.  That's because btrfs 
just defaults to having SB_I_VERSION set.  Furthermore -o noiversion doesn't get 
sent into mount, it's handled by the mount binary itself, and it does this by 
not having MS_I_VERSION set in the mount flags.


This happens as well for -o relatime, it's the default and so if you do mount -o 
norelatime it won't do anything, you still get relatime behavior.  The only time 
this changes is if you do mount -o remount,norelatime.


So my question is, what do we do here?  I know Christoph has the strong opinion 
that we just don't expose I_VERSION at all, which frankly I'm ok with.  However 
more what I'm asking is what do we do with these weird inverted flags that we 
all just kind of ignore on mount?  The current setup is just broken if we want 
to allow overriding the defaults at mount time.  Are we ok with it just being 
broken?  Are we ok with things like mount -o noiversion not working because the 
file system has decided that I_VERSION (or relatime) is the default, and we're 
going to ignore what the user asks for unless we're remounting?  Thanks,


Josef


[PATCH] ftrace: fix ftrace_trace_task return value

2020-07-24 Thread Josef Bacik
I was attempting to use pid filtering with function_graph, but it wasn't
allowing anything to make it through.  Turns out ftrace_trace_task
returns false if ftrace_ignore_pid is not-empty, which isn't correct
anymore.  We're now setting it to FTRACE_PID_IGNORE if we need to ignore
that pid, otherwise it's set to the pid (which is weird considering the
name) or to FTRACE_PID_TRACE.  Fix the check to check for !=
FTRACE_PID_IGNORE.  With this we can now use function_graph with pid
filtering.

Fixes: 717e3f5ebc82 ("ftrace: Make function trace pid filtering a bit more 
exact")
Signed-off-by: Josef Bacik 
---
 kernel/trace/ftrace.c | 3 ---
 kernel/trace/trace.h  | 7 ++-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 1903b80db6eb..7d879fae3777 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -139,9 +139,6 @@ static inline void ftrace_ops_init(struct ftrace_ops *ops)
 #endif
 }
 
-#define FTRACE_PID_IGNORE  -1
-#define FTRACE_PID_TRACE   -2
-
 static void ftrace_pid_func(unsigned long ip, unsigned long parent_ip,
struct ftrace_ops *op, struct pt_regs *regs)
 {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 13db4000af3f..1531ec565cb5 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1103,6 +1103,10 @@ print_graph_function_flags(struct trace_iterator *iter, 
u32 flags)
 extern struct list_head ftrace_pids;
 
 #ifdef CONFIG_FUNCTION_TRACER
+
+#define FTRACE_PID_IGNORE  -1
+#define FTRACE_PID_TRACE   -2
+
 struct ftrace_func_command {
struct list_headlist;
char*name;
@@ -1114,7 +1118,8 @@ struct ftrace_func_command {
 extern bool ftrace_filter_param __initdata;
 static inline int ftrace_trace_task(struct trace_array *tr)
 {
-   return !this_cpu_read(tr->array_buffer.data->ftrace_ignore_pid);
+   return this_cpu_read(tr->array_buffer.data->ftrace_ignore_pid) !=
+   FTRACE_PID_IGNORE;
 }
 extern int ftrace_is_dead(void);
 int ftrace_create_function_files(struct trace_array *tr,
-- 
2.24.1



Re: [PATCH v2] btrfs: fix mount failure caused by race with umount

2020-07-10 Thread Josef Bacik

On 7/10/20 1:23 PM, Boris Burkov wrote:

It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.

Here is the sequence laid out in greater detail:

CPU0CPU1
down_write sb->s_umount
btrfs_kill_super
   kill_anon_super(sb)
 generic_shutdown_super(sb);
   shrink_dcache_for_umount(sb);
   sync_filesystem(sb);
   evict_inodes(sb); // SLOW

   btrfs_mount_root
 btrfs_scan_one_device
 fs_devices = device->fs_devices
 fs_info->fs_devices = 
fs_devices
 // fs_devices-opened makes 
this a no-op
 btrfs_open_devices(fs_devices, 
mode, fs_type)
 s = sget(fs_type, test, set, 
flags, fs_info);
   find sb in s_instances
   grab_super(sb);
 down_write(>s_umount); 
// blocks

   sop->put_super(sb)
 // sb->fs_devices->opened == 2; no-op
   spin_lock(_lock);
   hlist_del_init(>s_instances);
   spin_unlock(_lock);
   up_write(>s_umount);
 return 0;
   retry lookup
   don't find sb in s_instances 
(deleted by CPU0)
   s = alloc_super
   return s;
 btrfs_fill_super(s, 
fs_devices, data)
   open_ctree // fs_devices 
total_rw_bytes improperly set!
 btrfs_read_chunk_tree
   read_one_dev // 
increment total_rw_bytes again!!
   super_total_bytes < 
fs_devices->total_rw_bytes // ERROR!!!

To fix this, we observe that if we have already filled the device, the
state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can
use that to avoid filling it a second time for no reason and,
critically, avoid double counting in total_rw_bytes. One gotcha is that
read_one_chunk also sets this bit, which happens before read_one_dev (in
read_sys_array), so we must remove that setting of the bit as well, for
the state bit to truly correspond to the device struct being filled from
disk.

To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.

for i in $(seq 0 500)
do
   dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo

does the trick for me.

A final note is that this fix actually breaks the fstest btrfs/163, but
having investigated it, I believe that is due to a subtle flaw in how
btrfs replace works when used on a seed device. The replace target device
never gets a correct dev_item with the sprout fsid written out. This
causes several problems, but for the sake of btrfs/163, read_one_chunk
marking the device with IN_FS_METADATA was wallpapering over it, which
this patch breaks. I will be sending a subsequent fix for the seed replace
issue which will also fix btrfs/163.

Signed-off-by: Boris Burkov 


Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] btrfs: fix mount failure caused by race with umount

2020-07-09 Thread Josef Bacik

On 7/9/20 8:44 PM, Boris Burkov wrote:

It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.

Here is the sequence laid out in greater detail:

CPU0CPU1
down_write sb->s_umount
btrfs_kill_super
   kill_anon_super(sb)
 generic_shutdown_super(sb);
   shrink_dcache_for_umount(sb);
   sync_filesystem(sb);
   evict_inodes(sb); // SLOW

   btrfs_mount_root
 btrfs_scan_one_device
 fs_devices = device->fs_devices
 fs_info->fs_devices = 
fs_devices
 // fs_devices-opened makes 
this a no-op
 btrfs_open_devices(fs_devices, 
mode, fs_type)
 s = sget(fs_type, test, set, 
flags, fs_info);
   find sb in s_instances
   grab_super(sb);
 down_write(>s_umount); 
// blocks

   sop->put_super(sb)
 // sb->fs_devices->opened == 2; no-op
   spin_lock(_lock);
   hlist_del_init(>s_instances);
   spin_unlock(_lock);
   up_write(>s_umount);
 return 0;
   retry lookup
   don't find sb in s_instances 
(deleted by CPU0)
   s = alloc_super
   return s;
 btrfs_fill_super(s, 
fs_devices, data)
   open_ctree // fs_devices 
total_rw_bytes improperly set!
 btrfs_read_chunk_tree
   read_one_dev // 
increment total_rw_bytes again!!
   super_total_bytes < 
fs_devices->total_rw_bytes // ERROR!!!

To fix this, we observe that if we have already filled the device, the
state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can
use that to avoid filling it a second time for no reason and,
critically, avoid double counting in total_rw_bytes. One gotcha is that
read_one_chunk also sets this bit, which happens before read_one_dev (in
read_sys_array), so we must remove that setting of the bit as well, for
the state bit to truly correspond to the device struct being filled from
disk.

To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.

for i in $(seq 0 500)
do
   dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo

does the trick for me.

A final note is that this fix actually breaks the fstest btrfs/163, but
having investigated it, I believe that is due to a subtle flaw in how
btrfs replace works when used on a seed device. The replace target device
never gets a correct dev_item with the sprout fsid written out. This
causes several problems, but for the sake of btrfs/163, read_one_chunk
marking the device with IN_FS_METADATA was wallpapering over it, which
this patch breaks. I will be sending a subsequent fix for the seed replace
issue which will also fix btrfs/163.

Signed-off-by: Boris Burkov 
---
  fs/btrfs/volumes.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c7a3d4d730a3..1d9bd1bbf893 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6633,9 +6633,6 @@ static int read_one_chunk(struct btrfs_key *key, struct 
extent_buffer *leaf,
}
btrfs_report_missing_device(fs_info, devid, uuid, 
false);
}
-   set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-   &(map->stripes[i].dev->dev_state));
-
}
  
  	write_lock(_tree->lock);

@@ -6815,6 +6812,15 @@ static int 

Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

2020-06-17 Thread Josef Bacik

On 6/17/20 2:11 PM, Filipe Manana wrote:

On Wed, Jun 17, 2020 at 6:43 PM Chris Mason  wrote:


On 17 Jun 2020, at 13:20, Filipe Manana wrote:


On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov  wrote:


---
  fs/btrfs/extent_io.c | 45

  1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..f6758ebbb6a2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3927,6 +3927,11 @@ static noinline_for_stack int
write_one_eb(struct extent_buffer *eb,
 clear_bit(EXTENT_BUFFER_WRITE_ERR, >bflags);
 num_pages = num_extent_pages(eb);
 atomic_set(>io_pages, num_pages);
+   /*
+* It is possible for releasepage to clear the TREE_REF bit
before we
+* set io_pages. See check_buffer_tree_ref for a more
detailed comment.
+*/
+   check_buffer_tree_ref(eb);


This is a whole different case from the one described in the
changelog, as this is in the write path.
Why do we need this one?


This was Josef’s idea, but I really like the symmetry.  You set
io_pages, you do the tree_ref dance.  Everyone fiddling with the write
back bit right now correctly clears writeback after doing the atomic_dec
on io_pages, but the race is tiny and prone to getting exposed again by
shifting code around.  Tree ref checks around io_pages are the most
reliable way to prevent this bug from coming back again later.


Ok, but that still doesn't answer my question.
Is there an actual race/problem this hunk solves?

Before calling write_one_eb() we increment the ref on the eb and we
also call lock_extent_buffer_for_io(),
which clears the dirty bit and sets the writeback bit on the eb while
holding its ref_locks spin_lock.

Even if we get to try_release_extent_buffer, it calls
extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
so at any time it should return true, as either the dirty or the
writeback bit is set.

Is this purely a safety guard that is being introduced?

Can we at least describe in the changelog why we are adding this hunk
in the write path?
All it mentions is a race between reading and releasing pages, there's
nothing mentioned about races with writeback.



I think maybe we make that bit a separate patch, and leave the panic fix on it's 
own.  I suggested this because I have lofty ideas of killing the refs_lock, and 
this would at least keep us consistent in our usage TREE_REF to save us from 
freeing stuff that may be currently under IO.


I'm super not happy with our reference handling coupled with releasepage, but 
these are the kind of hoops we're going to have to jump through until we have 
some better infrastructure in place to handle metadata.  Thanks,


Josef


Re: [btrfs] 3ae92b3782: xfstests.generic.269.fail

2019-09-03 Thread Josef Bacik
On Tue, Sep 03, 2019 at 04:06:33PM +0800, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: 3ae92b3782182d282a92573abe95c96d34ca6e73 ("btrfs: change the minimum 
> global reserve size")
> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git 
> master
> 
> in testcase: xfstests
> with following parameters:
> 
>   disk: 4HDD
>   fs: btrfs
>   test: generic-group13
> 
> test-description: xfstests is a regression test suite for xfs and other files 
> ystems.
> test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 

It would help if you could capture generic/269.full, but this is likely a
problem with fsck that I fixed a few weeks ago where we're seeing nbytes of an
inode is wrong, but there's an orphan item so it doesn't matter.  This patch
just made it more likely for us to have a still being iput'ed inode after a
transaction commit.  Thanks,

Josef


Re: io.latency controller apparently not working

2019-08-16 Thread Josef Bacik
On Fri, Aug 16, 2019 at 12:57:41PM +0200, Paolo Valente wrote:
> Hi,
> I happened to test the io.latency controller, to make a comparison
> between this controller and BFQ.  But io.latency seems not to work,
> i.e., not to reduce latency compared with what happens with no I/O
> control at all.  Here is a summary of the results for one of the
> workloads I tested, on three different devices (latencies in ms):
> 
>  no I/O controlio.latency BFQ
> NVMe SSD 1.9   1.90.07
> SATA SSD 3956 0.7
> HDD  4500  4500   11
> 
> I have put all details on hardware, OS, scenarios and results in the
> attached pdf.  For your convenience, I'm pasting the source file too.
> 

Do you have the fio jobs you use for this?  I just tested on Jens's most recent
tree and io.latency appears to be doing what its supposed to be doing.  We've
also started testing 5.2 in production and it's still working in production as
well.  The only thing I've touched recently was around wakeups and shouldn't
have broken everything.  I'm not sure why it's not working for you, but a fio
script will help me narrow down what's going on.  Thanks,

Josef


Re: [PATCH v2] nbd: replace kill_bdev() with __invalidate_device() again

2019-07-31 Thread Josef Bacik
On Wed, Jul 31, 2019 at 08:13:10PM +0800, SunKe wrote:
> From: Munehisa Kamata 
> 
> Commit abbbdf12497d ("replace kill_bdev() with __invalidate_device()")
> once did this, but 29eaadc03649 ("nbd: stop using the bdev everywhere")
> resurrected kill_bdev() and it has been there since then. So buffer_head
> mappings still get killed on a server disconnection, and we can still
> hit the BUG_ON on a filesystem on the top of the nbd device.
> 
>   EXT4-fs (nbd0): mounted filesystem with ordered data mode. Opts: (null)
>   block nbd0: Receive control failed (result -32)
>   block nbd0: shutting down sockets
>   print_req_error: I/O error, dev nbd0, sector 66264 flags 3000
>   EXT4-fs warning (device nbd0): htree_dirblock_to_tree:979: inode #2: lblock 
> 0: comm ls: error -5 reading directory block
>   print_req_error: I/O error, dev nbd0, sector 2264 flags 3000
>   EXT4-fs error (device nbd0): __ext4_get_inode_loc:4690: inode #2: block 
> 283: comm ls: unable to read itable block
>   EXT4-fs error (device nbd0) in ext4_reserve_inode_write:5894: IO failure
>   [ cut here ]
>   kernel BUG at fs/buffer.c:3057!
>   invalid opcode:  [#1] SMP PTI
>   CPU: 7 PID: 40045 Comm: jbd2/nbd0-8 Not tainted 5.1.0-rc3+ #4
>   Hardware name: Amazon EC2 m5.12xlarge/, BIOS 1.0 10/16/2017
>   RIP: 0010:submit_bh_wbc+0x18b/0x190
>   ...
>   Call Trace:
>jbd2_write_superblock+0xf1/0x230 [jbd2]
>? account_entity_enqueue+0xc5/0xf0
>jbd2_journal_update_sb_log_tail+0x94/0xe0 [jbd2]
>jbd2_journal_commit_transaction+0x12f/0x1d20 [jbd2]
>? __switch_to_asm+0x40/0x70
>...
>? lock_timer_base+0x67/0x80
>kjournald2+0x121/0x360 [jbd2]
>? remove_wait_queue+0x60/0x60
>kthread+0xf8/0x130
>? commit_timeout+0x10/0x10 [jbd2]
>? kthread_bind+0x10/0x10
>ret_from_fork+0x35/0x40
> 
> With __invalidate_device(), I no longer hit the BUG_ON with sync or
> unmount on the disconnected device.
> 

Jeeze I swear I see this same patch go by every 6 months or so, not sure what
happens to it.  Anyway

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 04/11] trace-cmd: add global functions for live tracing

2019-07-09 Thread Josef Bacik
Yup go for it, thanks,

Josef

Sent from my iPhone

> On Jul 9, 2019, at 9:09 AM, Steven Rostedt  wrote:
> 
> On Fri, 20 Nov 2015 15:57:26 -0500
> Josef Bacik  wrote:
> 
>> We need a few functions to disable/enable tracing as well as add events to be
>> enabled on the first instance, this patch turns a couple of these local
>> functions into library functions.  Thanks,
> 
> Hi Josef,
> 
> Not sure you still use this, as it's not really a library function
> anymore. But we are currently cleaning up the trace-cmd code to create
> a real library, and doing it in baby steps. The
> tracecmd_enable_events() function is causing some issues and it was
> added by you. Are you OK if we remove it. At least temporarily until we
> separate out the "enabling" part into the library?
> 
> Thanks!
> 
> -- Steve
> 
> 
>> 
>> Signed-off-by: Josef Bacik 
>> ---
>> trace-cmd.h|  5 +
>> trace-record.c | 45 +++--
>> 2 files changed, 32 insertions(+), 18 deletions(-)
>> 
>> diff --git a/trace-cmd.h b/trace-cmd.h
>> index b4fa7fd..9a9ca30 100644
>> --- a/trace-cmd.h
>> +++ b/trace-cmd.h
>> @@ -268,6 +268,11 @@ int tracecmd_start_recording(struct tracecmd_recorder 
>> *recorder, unsigned long s
>> void tracecmd_stop_recording(struct tracecmd_recorder *recorder);
>> void tracecmd_stat_cpu(struct trace_seq *s, int cpu);
>> long tracecmd_flush_recording(struct tracecmd_recorder *recorder);
>> +int tracecmd_add_event(const char *event_str, int stack);
>> +void tracecmd_enable_events(void);
>> +void tracecmd_disable_all_tracing(int disable_tracer);
>> +void tracecmd_disable_tracing(void);
>> +void tracecmd_enable_tracing(void);
>> 
>> /* --- Plugin handling --- */
>> extern struct pevent_plugin_option trace_ftrace_options[];
>> diff --git a/trace-record.c b/trace-record.c
>> index 417b701..7c471ab 100644
>> --- a/trace-record.c
>> +++ b/trace-record.c
>> @@ -841,7 +841,6 @@ static void update_ftrace_pids(int reset)
>> 
>> static void update_event_filters(struct buffer_instance *instance);
>> static void update_pid_event_filters(struct buffer_instance *instance);
>> -static void enable_tracing(void);
>> 
>> /**
>>  * make_pid_filter - create a filter string to all pids against @field
>> @@ -1106,7 +1105,7 @@ static void run_cmd(enum trace_type type, int argc, 
>> char **argv)
>>if (!pid) {
>>/* child */
>>update_task_filter();
>> -enable_tracing();
>> +tracecmd_enable_tracing();
>>enable_ptrace();
>>/*
>> * If we are using stderr for stdout, switch
>> @@ -1795,7 +1794,7 @@ static int read_tracing_on(struct buffer_instance 
>> *instance)
>>return ret;
>> }
>> 
>> -static void enable_tracing(void)
>> +void tracecmd_enable_tracing(void)
>> {
>>struct buffer_instance *instance;
>> 
>> @@ -1808,7 +1807,7 @@ static void enable_tracing(void)
>>reset_max_latency();
>> }
>> 
>> -static void disable_tracing(void)
>> +void tracecmd_disable_tracing(void)
>> {
>>struct buffer_instance *instance;
>> 
>> @@ -1816,9 +1815,9 @@ static void disable_tracing(void)
>>write_tracing_on(instance, 0);
>> }
>> 
>> -static void disable_all(int disable_tracer)
>> +void tracecmd_disable_all_tracing(int disable_tracer)
>> {
>> -disable_tracing();
>> +tracecmd_disable_tracing();
>> 
>>if (disable_tracer) {
>>disable_func_stack_trace();
>> @@ -1991,6 +1990,11 @@ static void enable_events(struct buffer_instance 
>> *instance)
>>}
>> }
>> 
>> +void tracecmd_enable_events(void)
>> +{
>> +enable_events(first_instance);
>> +}
>> +
>> static void set_clock(struct buffer_instance *instance)
>> {
>>char *path;
>> @@ -3074,15 +3078,15 @@ static char *get_date_to_ts(void)
>>}
>> 
>>for (i = 0; i < date2ts_tries; i++) {
>> -disable_tracing();
>> +tracecmd_disable_tracing();
>>clear_trace();
>> -enable_tracing();
>> +tracecmd_enable_tracing();
>> 
>>gettimeofday(, NULL);
>>write(tfd, STAMP, 5);
>>gettimeofday(, NULL);
>> 
>> -disable_tracing();
>> +tracecmd_disable_tracing();
>>ts = find_time_stamp(pevent);
>>if (!ts)
>>continue;
>> @@ -3699,6 +3703,11 @@ profile_a

Re: [PATCH 10/10] sched,fair: flatten hierarchical runqueues

2019-07-01 Thread Josef Bacik
On Fri, Jun 28, 2019 at 04:49:13PM -0400, Rik van Riel wrote:
> Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.
> 
> Iteration of the sched_entity hierarchy is rate limited to once per jiffy
> per sched_entity, which is a smaller change than it seems, because load
> average adjustments were already rate limited to once per jiffy before this
> patch series.
> 
> This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
> from throttled cgroups onto their cgroup runqueues, and slowly (using the
> GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
> gets unthrottled, to prevent thundering herd issues.
> 
> Signed-off-by: Rik van Riel 
> ---
>  include/linux/sched.h |   2 +
>  kernel/sched/fair.c   | 452 +++---
>  kernel/sched/pelt.c   |   6 +-
>  kernel/sched/pelt.h   |   2 +-
>  kernel/sched/sched.h  |   2 +-
>  5 files changed, 171 insertions(+), 293 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 84a6cc6f5c47..901c710363e7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -453,6 +453,8 @@ struct sched_entity {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   int depth;
>   unsigned long   enqueued_h_load;
> + unsigned long   enqueued_h_weight;
> + struct load_weight  h_load;
>   struct sched_entity *parent;
>   /* rq on which this entity is (to be) queued: */
>   struct cfs_rq   *cfs_rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fea8849cc12..c31d3da081fb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -450,6 +450,19 @@ find_matching_se(struct sched_entity **se, struct 
> sched_entity **pse)
>   }
>  }
>  
> +/* Add the cgroup cfs_rqs to the list, for update_blocked_averages */
> +static void enqueue_entity_cfs_rqs(struct sched_entity *se)
> +{
> + SCHED_WARN_ON(!entity_is_task(se));
> +
> + for_each_sched_entity(se) {
> + struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
> +
> + if (list_add_leaf_cfs_rq(cfs_rq))
> + break;
> + }
> +}
> +
>  #else/* !CONFIG_FAIR_GROUP_SCHED */
>  
>  static inline struct task_struct *task_of(struct sched_entity *se)
> @@ -672,8 +685,14 @@ int sched_proc_update_handler(struct ctl_table *table, 
> int write,
>   */
>  static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
>  {
> - if (unlikely(se->load.weight != NICE_0_LOAD))
> + if (task_se_in_cgroup(se)) {
> + unsigned long h_weight = task_se_h_weight(se);
> + if (h_weight != se->h_load.weight)
> + update_load_set(>h_load, h_weight);
> + delta = __calc_delta(delta, NICE_0_LOAD, >h_load);
> + } else if (unlikely(se->load.weight != NICE_0_LOAD)) {
>   delta = __calc_delta(delta, NICE_0_LOAD, >load);
> + }
>  
>   return delta;
>  }
> @@ -687,22 +706,16 @@ static inline u64 calc_delta_fair(u64 delta, struct 
> sched_entity *se)
>  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>   u64 slice = sysctl_sched_latency;
> + struct load_weight *load = _rq->load;
> + struct load_weight lw;
>  
> - for_each_sched_entity(se) {
> - struct load_weight *load;
> - struct load_weight lw;
> -
> - cfs_rq = cfs_rq_of(se);
> - load = _rq->load;
> + if (unlikely(!se->on_rq)) {
> + lw = cfs_rq->load;
>  
> - if (unlikely(!se->on_rq)) {
> - lw = cfs_rq->load;
> -
> - update_load_add(, se->load.weight);
> - load = 
> - }
> - slice = __calc_delta(slice, se->load.weight, load);
> + update_load_add(, task_se_h_weight(se));
> + load = 
>   }
> + slice = __calc_delta(slice, task_se_h_weight(se), load);
>  
>   /*
>* To avoid cache thrashing, run at least sysctl_sched_min_granularity.
> @@ -2703,16 +2716,28 @@ static inline void update_scan_period(struct 
> task_struct *p, int new_cpu)
>  static void
>  account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> - update_load_add(_rq->load, se->load.weight);
> - if (!parent_entity(se))
> + struct rq *rq;
> +
> + if (task_se_in_cgroup(se)) {
> + struct cfs_rq *cgroup_rq = group_cfs_rq_of_parent(se);
> + unsigned long h_weight;
> +
> + update_load_add(_rq->load, se->load.weight);
> + cgroup_rq->nr_running++;
> +
> + /* Add the hierarchical weight to the CPU rq */
> + h_weight = task_se_h_weight(se);
> + se->enqueued_h_weight = h_weight;
> + update_load_add(_of(cfs_rq)->load, h_weight);

Ok I think this is where we need to handle the newly enqueued 

Re: [PATCH 09/10] sched,fair: add helper functions for flattened runqueue

2019-07-01 Thread Josef Bacik
On Fri, Jun 28, 2019 at 04:49:12PM -0400, Rik van Riel wrote:
> Add helper functions to make the flattened runqueue patch a little smaller.
> 
> The task_se_h_weight function is similar to task_se_h_load, but scales the
> task weight by the group weight, without taking the task's duty cycle into
> account.
> 
> The task_se_in_cgroup helper is functionally identical to parent_entity,
> but directly calling a function with that name obscures what the other
> code is trying to use it for, and would make the code harder to understand.
> 
> Signed-off-by: Rik van Riel 
> ---
>  kernel/sched/fair.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a751e7a9b228..6fea8849cc12 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -243,6 +243,7 @@ static u64 __calc_delta(u64 delta_exec, unsigned long 
> weight, struct load_weight
>  
>  const struct sched_class fair_sched_class;
>  static unsigned long task_se_h_load(struct sched_entity *se);
> +static unsigned long task_se_h_weight(struct sched_entity *se);
>  
>  /**
>   * CFS operations on generic schedulable entities:
> @@ -411,6 +412,12 @@ static inline struct sched_entity *parent_entity(struct 
> sched_entity *se)
>   return se->parent;
>  }
>  
> +/* Is this (task) sched_entity in a non-root cgroup? */
> +static inline bool task_se_in_cgroup(struct sched_entity *se)
> +{
> + return parent_entity(se);
> +}
> +
>  static void
>  find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  {
> @@ -7819,6 +7826,20 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
>   }
>  }
>  
> +static unsigned long task_se_h_weight(struct sched_entity *se)
> +{
> + struct cfs_rq *cfs_rq;
> +
> + if (!task_se_in_cgroup(se))
> + return se->load.weight;
> +
> + cfs_rq = group_cfs_rq_of_parent(se);
> + update_cfs_rq_h_load(cfs_rq);
> +
> + /* Reduce the load.weight by the h_load of the group the task is in. */
> + return (cfs_rq->h_load * se->load.weight) >> SCHED_FIXEDPOINT_SHIFT;

This should be

scale_load_down(cfs_rq->h_load * se->load.weight);

Thanks,

Josef


Re: [PATCH 02/10] sched: change /proc/sched_debug fields

2019-07-01 Thread Josef Bacik
On Fri, Jun 28, 2019 at 04:49:05PM -0400, Rik van Riel wrote:
> Remove some fields from /proc/sched_debug that are removed from
> sched_entity in a subsequent patch, and add h_load, which comes in
> very handy to debug CPU controller weight distribution.
> 
> Signed-off-by: Rik van Riel 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 01/10] sched: introduce task_se_h_load helper

2019-07-01 Thread Josef Bacik
On Fri, Jun 28, 2019 at 04:49:04PM -0400, Rik van Riel wrote:
> Sometimes the hierarchical load of a sched_entity needs to be calculated.
> Rename task_h_load to task_se_h_load, and directly pass a sched_entity to
> that function.
> 
> Move the function declaration up above where it will be used later.
> 
> No functional changes.
> 
> Signed-off-by: Rik van Riel 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 03/10] sched,fair: redefine runnable_load_avg as the sum of task_h_load

2019-07-01 Thread Josef Bacik
On Mon, Jul 01, 2019 at 12:47:35PM -0400, Rik van Riel wrote:
> On Mon, 2019-07-01 at 12:29 -0400, Josef Bacik wrote:
> > On Fri, Jun 28, 2019 at 04:49:06PM -0400, Rik van Riel wrote:
> > > The runnable_load magic is used to quickly propagate information
> > > about
> > > runnable tasks up the hierarchy of runqueues. The runnable_load_avg
> > > is
> > > mostly used for the load balancing code, which only examines the
> > > value at
> > > the root cfs_rq.
> > > 
> > > Redefine the root cfs_rq runnable_load_avg to be the sum of
> > > task_h_loads
> > > of the runnable tasks. This works because the hierarchical
> > > runnable_load of
> > > a task is already equal to the task_se_h_load today. This provides
> > > enough
> > > information to the load balancer.
> > > 
> > > The runnable_load_avg of the cgroup cfs_rqs does not appear to be
> > > used for anything, so don't bother calculating those.
> > > 
> > > This removes one of the things that the code currently traverses
> > > the
> > > cgroup hierarchy for, and getting rid of it brings us one step
> > > closer
> > > to a flat runqueue for the CPU controller.
> > > 
> > 
> > My memory on this stuff is very hazy, but IIRC we had the
> > runnable_sum and the
> > runnable_avg separated out because you could have the avg lag behind
> > the sum.
> > So like you enqueue a bunch of new entities who's avg may have
> > decayed a bunch
> > and so their overall load is not felt on the CPU until they start
> > running, and
> > now you've overloaded that CPU.  The sum was there to make sure new
> > things
> > coming onto the CPU added actual load to the queue instead of looking
> > like there
> > was no load.
> > 
> > Is this going to be a problem now with this new code?
> 
> That is a good question!
> 
> On the one hand, you may well be right.
> 
> On the other hand, while I see the old code calculating
> runnable_sum, I don't really see it _using_ it to drive
> scheduling decisions.
> 
> It would be easy to define the CPU cfs_rq->runnable_load_sum
> as being the sum of task_se_h_weight() of each runnable task
> on the CPU (for example), but what would we use it for?
> 
> What am I missing?

It's super sublte, but you're right in that we don't really need the
runnable_avg per-se, but what you do is you kill calc_group_runnable, which used
to do this

load_avg = max(cfs_rq->avg.load_avg,
   scale_load_down(cfs_rq->load.weight));

runnable = max(cfs_rq->avg.runnable_load_avg,
   scale_load_down(cfs_rq->runnable_weight));

so we'd account for this weirdness of adding a previously idle process to a new
CPU and overloading the CPU because we'd add a bunch of these 0 weight looking
tasks that suddenly all wake up and are on the same CPU.  So we used the
runnable_weight to account for what was actually happening, and the max of
load_avg and the weight to figure out what the potential load would be.

What you've done here is change the weighting stuff to be completely based on
load avg, which is problematic for the reasons above.  Did you fix this later on
in your patches?  If so then just tell me to keep reading and I'll do that ;).
Thanks,

Josef


Re: [PATCH 03/10] sched,fair: redefine runnable_load_avg as the sum of task_h_load

2019-07-01 Thread Josef Bacik
On Fri, Jun 28, 2019 at 04:49:06PM -0400, Rik van Riel wrote:
> The runnable_load magic is used to quickly propagate information about
> runnable tasks up the hierarchy of runqueues. The runnable_load_avg is
> mostly used for the load balancing code, which only examines the value at
> the root cfs_rq.
> 
> Redefine the root cfs_rq runnable_load_avg to be the sum of task_h_loads
> of the runnable tasks. This works because the hierarchical runnable_load of
> a task is already equal to the task_se_h_load today. This provides enough
> information to the load balancer.
> 
> The runnable_load_avg of the cgroup cfs_rqs does not appear to be
> used for anything, so don't bother calculating those.
> 
> This removes one of the things that the code currently traverses the
> cgroup hierarchy for, and getting rid of it brings us one step closer
> to a flat runqueue for the CPU controller.
> 

My memory on this stuff is very hazy, but IIRC we had the runnable_sum and the
runnable_avg separated out because you could have the avg lag behind the sum.
So like you enqueue a bunch of new entities who's avg may have decayed a bunch
and so their overall load is not felt on the CPU until they start running, and
now you've overloaded that CPU.  The sum was there to make sure new things
coming onto the CPU added actual load to the queue instead of looking like there
was no load.

Is this going to be a problem now with this new code?



> +static inline void
> +update_runnable_load_avg(struct sched_entity *se)
> +{
> + struct cfs_rq *root_cfs_rq = _rq_of(se)->rq->cfs;
> + long new_h_load, delta;
> +
> + SCHED_WARN_ON(!entity_is_task(se));
> +
> + if (!se->on_rq)
> + return;
>  
> - sub_positive(_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
> - sub_positive(_rq->avg.runnable_load_sum,
> -  se_runnable(se) * se->avg.runnable_load_sum);
> + new_h_load = task_se_h_load(se);
> + delta = new_h_load - se->enqueued_h_load;
> + root_cfs_rq->avg.runnable_load_avg += delta;

Should we use add_positive here?  Thanks,

Josef


Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression

2019-06-25 Thread Josef Bacik
On Wed, Jun 26, 2019 at 10:39:36AM +0800, Rong Chen wrote:
> On 6/25/19 10:22 PM, Josef Bacik wrote:
> > On Fri, Jun 21, 2019 at 08:48:03AM +0800, Huang, Ying wrote:
> > > "Huang, Ying"  writes:
> > > 
> > > > "Huang, Ying"  writes:
> > > > 
> > > > > Hi, Josef,
> > > > > 
> > > > > kernel test robot  writes:
> > > > > 
> > > > > > Greeting,
> > > > > > 
> > > > > > FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to 
> > > > > > commit:
> > > > > > 
> > > > > > 
> > > > > > commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve
> > > > > > delalloc metadata differently")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 
> > > > > > master
> > > > > > 
> > > > > > in testcase: aim7
> > > > > > on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 
> > > > > > 3.00GHz with 384G memory
> > > > > > with following parameters:
> > > > > > 
> > > > > > disk: 4BRD_12G
> > > > > > md: RAID0
> > > > > > fs: btrfs
> > > > > > test: disk_rr
> > > > > > load: 1500
> > > > > > cpufreq_governor: performance
> > > > > > 
> > > > > > test-description: AIM7 is a traditional UNIX system level benchmark
> > > > > > suite which is used to test and measure the performance of multiuser
> > > > > > system.
> > > > > > test-url: 
> > > > > > https://sourceforge.net/projects/aimbench/files/aim-suite7/
> > > > > Here's another regression, do you have time to take a look at this?
> > > > Ping
> > > Ping again ...
> > > 
> > Finally got time to look at this but I can't get the reproducer to work
> > 
> > root@destiny ~/lkp-tests# bin/lkp run ~/job-aim.yaml
> > Traceback (most recent call last):
> >  11: from /root/lkp-tests/bin/run-local:18:in `'
> >  10: from 
> > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
> >   9: from 
> > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
> >   8: from /root/lkp-tests/lib/yaml.rb:5:in `'
> >   7: from 
> > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
> >   6: from 
> > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
> >   5: from /root/lkp-tests/lib/common.rb:9:in `'
> >   4: from 
> > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
> >   3: from 
> > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
> >   2: from /root/lkp-tests/lib/array_ext.rb:3:in `'
> >   1: from 
> > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
> > /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require': 
> > cannot load such file -- active_support/core_ext/enumerable (LoadError)
> 
> Hi Josef,
> 
> I tried the latest lkp-tests, and didn't have the problem. Could you please
> update the lkp-tests repo and run "lkp install" again?
> 

I updated it this morning, and I just updated it now, my tree is on

2c5b1a95b08dbe81bba64419c482a877a3b424ac

lkp install says everything is installed except

No match for argument: libipc-run-perl

and it still doesn't run properly.  Thanks,

Josef


Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression

2019-06-25 Thread Josef Bacik
On Fri, Jun 21, 2019 at 08:48:03AM +0800, Huang, Ying wrote:
> "Huang, Ying"  writes:
> 
> > "Huang, Ying"  writes:
> >
> >> Hi, Josef,
> >>
> >> kernel test robot  writes:
> >>
> >>> Greeting,
> >>>
> >>> FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to commit:
> >>>
> >>>
> >>> commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve
> >>> delalloc metadata differently")
> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>>
> >>> in testcase: aim7
> >>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz 
> >>> with 384G memory
> >>> with following parameters:
> >>>
> >>>   disk: 4BRD_12G
> >>>   md: RAID0
> >>>   fs: btrfs
> >>>   test: disk_rr
> >>>   load: 1500
> >>>   cpufreq_governor: performance
> >>>
> >>> test-description: AIM7 is a traditional UNIX system level benchmark
> >>> suite which is used to test and measure the performance of multiuser
> >>> system.
> >>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
> >>
> >> Here's another regression, do you have time to take a look at this?
> >
> > Ping
> 
> Ping again ...
> 

Finally got time to look at this but I can't get the reproducer to work

root@destiny ~/lkp-tests# bin/lkp run ~/job-aim.yaml
Traceback (most recent call last):
11: from /root/lkp-tests/bin/run-local:18:in `'
10: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in 
`require'
 9: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in 
`require'
 8: from /root/lkp-tests/lib/yaml.rb:5:in `'
 7: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in 
`require'
 6: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in 
`require'
 5: from /root/lkp-tests/lib/common.rb:9:in `'
 4: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in 
`require'
 3: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in 
`require'
 2: from /root/lkp-tests/lib/array_ext.rb:3:in `'
 1: from /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in 
`require'
/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require': cannot 
load such file -- active_support/core_ext/enumerable (LoadError)

I can't even figure out from the job file or anything how I'm supposed to run
AIM7 itself.  This is on a Fedora 30 box, so it's relatively recent.  Thanks,

Josef


Re: [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG

2019-06-18 Thread Josef Bacik
On Tue, Jun 18, 2019 at 07:42:46AM +, Naohiro Aota wrote:
> On 2019/06/13 23:09, Josef Bacik wrote:
> > On Fri, Jun 07, 2019 at 10:10:14PM +0900, Naohiro Aota wrote:
> >> If the btrfs volume has mirrored block groups, it unconditionally makes
> >> un-mirrored block groups read only. When we have mirrored block groups, but
> >> don't have writable block groups, this will drop all writable block groups.
> >> So, check if we have at least one writable mirrored block group before
> >> setting un-mirrored block groups read only.
> >>
> > 
> > I don't understand why you want this.  Thanks,
> > 
> > Josef
> > 
> 
> This is necessary to handle e.g. btrfs/124 case.
> 
> When we mount degraded RAID1 FS and write to it, and then
> re-mount with full device, the write pointers of corresponding
> zones of written BG differ.  The patch 07 mark such block group
> as "wp_broken" and make it read only.  In this situation, we only
> have read only RAID1 BGs because of "wp_broken" and un-mirrored BGs
> are also marked read only, because we have RAID1 BGs.
> As a result, all the BGs are now read only, so that we
> cannot even start the rebalance to fix the situation.

Ah ok, please add this explanation to the changelog.  Thanks,

Josef


Re: [PATCH 3/8] blkcg: implement REQ_CGROUP_PUNT

2019-06-14 Thread Josef Bacik
On Thu, Jun 13, 2019 at 05:33:45PM -0700, Tejun Heo wrote:
> When a shared kthread needs to issue a bio for a cgroup, doing so
> synchronously can lead to priority inversions as the kthread can be
> trapped waiting for that cgroup.  This patch implements
> REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
> to a dedicated per-blkcg work item to avoid such priority inversions.
> 
> This will be used to fix priority inversions in btrfs compression and
> should be generally useful as we grow filesystem support for
> comprehensive IO control.
> 
> Signed-off-by: Tejun Heo 
> Cc: Chris Mason 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 2/8] blkcg, writeback: Implement wbc_blkcg_css()

2019-06-14 Thread Josef Bacik
On Thu, Jun 13, 2019 at 05:33:44PM -0700, Tejun Heo wrote:
> Add a helper to determine the target blkcg from wbc.
> 
> Signed-off-by: Tejun Heo 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 1/8] blkcg, writeback: Add wbc->no_wbc_acct

2019-06-14 Thread Josef Bacik
On Thu, Jun 13, 2019 at 05:33:43PM -0700, Tejun Heo wrote:
> When writeback IOs are bounced through async layers, the IOs should
> only be accounted against the wbc from the original bdi writeback to
> avoid confusing cgroup inode ownership arbitration.  Add
> wbc->no_wbc_acct to allow disabling wbc accounting.  This will be used
> make btfs compression work well with cgroup IO control.
> 
> Signed-off-by: Tejun Heo 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 0/2] Fix misuse of blk_rq_stats in blk-iolatency

2019-06-14 Thread Josef Bacik
On Fri, Jun 14, 2019 at 02:44:11PM +0300, Pavel Begunkov (Silence) wrote:
> From: Pavel Begunkov 
> 
> There are implicit assumptions about struct blk_rq_stats, which make
> it's very easy to misuse. The first patch fixes consequences, and the
> second employs type-system to prevent recurrences.
> 
> 
> Pavel Begunkov (2):
>   blk-iolatency: Fix zero mean in previous stats
>   blk-stats: Introduce explicit stat staging buffers
> 

I don't have a problem with this, but it's up to Jens I suppose

Acked-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 2/2] nbd: add support for nbd as root device

2019-06-13 Thread Josef Bacik
On Wed, Jun 12, 2019 at 07:31:44PM +0300, roman.stratiie...@globallogic.com 
wrote:
> From: Roman Stratiienko 
> 
> Adding support to nbd to use it as a root device. This code essentially
> provides a minimal nbd-client implementation within the kernel. It opens
> a socket and makes the negotiation with the server. Afterwards it passes
> the socket to the normal nbd-code to handle the connection.
> 
> The arguments for the server are passed via kernel command line.
> The kernel command line has the format
> 'nbdroot=[:]/'.
> SERVER_IP is optional. If it is not available it will use the
> root_server_addr transmitted through DHCP.
> 
> Based on those arguments, the connection to the server is established
> and is connected to the nbd0 device. The rootdevice therefore is
> root=/dev/nbd0.
> 
> Patch was initialy posted by Markus Pargmann 
> and can be found at https://lore.kernel.org/patchwork/patch/532556/
> 
> Change-Id: I78f7313918bf31b9dc01a74a42f0f068bede312c
> Signed-off-by: Roman Stratiienko 
> Reviewed-by: Aleksandr Bulyshchenko 

Just throw nbd-client in your initramfs.  Every nbd server has it's own
handshake protocol, embedding one particular servers handshake protocol into the
kernel isn't the answer here.  Thanks,

Josef


Re: [PATCH 02/19] btrfs: Get zone information of zoned block devices

2019-06-13 Thread Josef Bacik
On Fri, Jun 07, 2019 at 10:10:08PM +0900, Naohiro Aota wrote:
> If a zoned block device is found, get its zone information (number of zones
> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
> avoid costly run-time zone report commands to test the device zones type
> during block allocation, attach the seqzones bitmap to the device structure
> to indicate if a zone is sequential or accept random writes.
> 
> This patch also introduces the helper function btrfs_dev_is_sequential() to
> test if the zone storing a block is a sequential write required zone.
> 
> Signed-off-by: Damien Le Moal 
> Signed-off-by: Naohiro Aota 
> ---
>  fs/btrfs/volumes.c | 143 +
>  fs/btrfs/volumes.h |  33 +++
>  2 files changed, 176 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1c2a6e4b39da..b673178718e3 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -786,6 +786,135 @@ static int btrfs_free_stale_devices(const char *path,
>   return ret;
>  }
>  
> +static int __btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
> +  struct blk_zone **zones,
> +  unsigned int *nr_zones, gfp_t gfp_mask)
> +{
> + struct blk_zone *z = *zones;
> + int ret;
> +
> + if (!z) {
> + z = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
> + if (!z)
> + return -ENOMEM;
> + }
> +
> + ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
> +   z, nr_zones, gfp_mask);
> + if (ret != 0) {
> + btrfs_err(device->fs_info, "Get zone at %llu failed %d\n",
> +   pos, ret);
> + return ret;
> + }
> +
> + *zones = z;
> +
> + return 0;
> +}
> +
> +static void btrfs_destroy_dev_zonetypes(struct btrfs_device *device)
> +{
> + kfree(device->seq_zones);
> + kfree(device->empty_zones);
> + device->seq_zones = NULL;
> + device->empty_zones = NULL;
> + device->nr_zones = 0;
> + device->zone_size = 0;
> + device->zone_size_shift = 0;
> +}
> +
> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
> +struct blk_zone *zone, gfp_t gfp_mask)
> +{
> + unsigned int nr_zones = 1;
> + int ret;
> +
> + ret = __btrfs_get_dev_zones(device, pos, , _zones, gfp_mask);
> + if (ret != 0 || !nr_zones)
> + return ret ? ret : -EIO;
> +
> + return 0;
> +}
> +
> +int btrfs_get_dev_zonetypes(struct btrfs_device *device)
> +{
> + struct block_device *bdev = device->bdev;
> + sector_t nr_sectors = bdev->bd_part->nr_sects;
> + sector_t sector = 0;
> + struct blk_zone *zones = NULL;
> + unsigned int i, n = 0, nr_zones;
> + int ret;
> +
> + device->zone_size = 0;
> + device->zone_size_shift = 0;
> + device->nr_zones = 0;
> + device->seq_zones = NULL;
> + device->empty_zones = NULL;
> +
> + if (!bdev_is_zoned(bdev))
> + return 0;
> +
> + device->zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
> + device->zone_size_shift = ilog2(device->zone_size);
> + device->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
> + if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
> + device->nr_zones++;
> +
> + device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> + sizeof(*device->seq_zones), GFP_KERNEL);
> + if (!device->seq_zones)
> + return -ENOMEM;
> +
> + device->empty_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> +   sizeof(*device->empty_zones), GFP_KERNEL);
> + if (!device->empty_zones)
> + return -ENOMEM;
> +
> +#define BTRFS_REPORT_NR_ZONES   4096
> +
> + /* Get zones type */
> + while (sector < nr_sectors) {
> + nr_zones = BTRFS_REPORT_NR_ZONES;
> + ret = __btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
> + , _zones, GFP_KERNEL);
> + if (ret != 0 || !nr_zones) {
> + if (!ret)
> + ret = -EIO;
> + goto out;
> + }
> +
> + for (i = 0; i < nr_zones; i++) {
> + if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
> + set_bit(n, device->seq_zones);
> + if (zones[i].cond == BLK_ZONE_COND_EMPTY)
> + set_bit(n, device->empty_zones);
> + sector = zones[i].start + zones[i].len;
> + n++;
> + }
> + }
> +
> + if (n != device->nr_zones) {
> + btrfs_err(device->fs_info,
> +   "Inconsistent number of zones (%u / %u)\n", n,
> +   device->nr_zones);
> + ret = -EIO;
> + 

Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode

2019-06-13 Thread Josef Bacik
On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
> When in HMZONED mode, make sure that device super blocks are located in
> randomly writable zones of zoned block devices. That is, do not write super
> blocks in sequential write required zones of host-managed zoned block
> devices as update would not be possible.
> 
> Signed-off-by: Damien Le Moal 
> Signed-off-by: Naohiro Aota 
> ---
>  fs/btrfs/disk-io.c | 11 +++
>  fs/btrfs/disk-io.h |  1 +
>  fs/btrfs/extent-tree.c |  4 
>  fs/btrfs/scrub.c   |  2 ++
>  4 files changed, 18 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7c1404c76768..ddbb02906042 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct 
> block_device *bdev)
>   return latest;
>  }
>  
> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
> +{
> + /* any address is good on a regular (zone_size == 0) device */
> + /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */

This is not how you do multi-line comments in the kernel.  Thanks,

Josef


Re: [PATCH 11/19] btrfs: introduce submit buffer

2019-06-13 Thread Josef Bacik
On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
> Sequential allocation is not enough to maintain sequential delivery of
> write IOs to the device. Various features (async compress, async checksum,
> ...) of btrfs affect ordering of the IOs. This patch introduces submit
> buffer to sort WRITE bios belonging to a block group and sort them out
> sequentially in increasing block address to achieve sequential write
> sequences with __btrfs_map_bio().
> 
> Signed-off-by: Naohiro Aota 

I hate everything about this.  Can't we just use the plugging infrastructure for
this and then make sure it re-orders the bios before submitting them?  Also
what's to prevent the block layer scheduler from re-arranging these io's?
Thanks,

Josef


Re: [PATCH 16/19] btrfs: wait existing extents before truncating

2019-06-13 Thread Josef Bacik
On Fri, Jun 07, 2019 at 10:10:22PM +0900, Naohiro Aota wrote:
> When truncating a file, file buffers which have already been allocated but
> not yet written may be truncated.  Truncating these buffers could cause
> breakage of a sequential write pattern in a block group if the truncated
> blocks are for example followed by blocks allocated to another file. To
> avoid this problem, always wait for write out of all unwritten buffers
> before proceeding with the truncate execution.
> 
> Signed-off-by: Naohiro Aota 
> ---
>  fs/btrfs/inode.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 89542c19d09e..4e8c7921462f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5137,6 +5137,17 @@ static int btrfs_setsize(struct inode *inode, struct 
> iattr *attr)
>   btrfs_end_write_no_snapshotting(root);
>   btrfs_end_transaction(trans);
>   } else {
> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +
> + if (btrfs_fs_incompat(fs_info, HMZONED)) {
> + u64 sectormask = fs_info->sectorsize - 1;
> +
> + ret = btrfs_wait_ordered_range(inode,
> +newsize & (~sectormask),
> +(u64)-1);

Use ALIGN().  Thanks,

Josef


Re: [PATCH 18/19] btrfs: support dev-replace in HMZONED mode

2019-06-13 Thread Josef Bacik
On Fri, Jun 07, 2019 at 10:10:24PM +0900, Naohiro Aota wrote:
> Currently, dev-replace copy all the device extents on source device to the
> target device, and it also clones new incoming write I/Os from users to the
> source device into the target device.
> 
> Cloning incoming IOs can break the sequential write rule in the target
> device. When write is mapped in the middle of block group, that I/O is
> directed in the middle of a zone of target device, which breaks the
> sequential write rule.
> 
> However, the cloning function cannot be simply disabled since incoming I/Os
> targeting already copied device extents must be cloned so that the I/O is
> executed on the target device.
> 
> We cannot use dev_replace->cursor_{left,right} to determine whether bio
> is going to not yet copied region.  Since we have time gap between
> finishing btrfs_scrub_dev() and rewriting the mapping tree in
> btrfs_dev_replace_finishing(), we can have newly allocated device extent
> which is never cloned (by handle_ops_on_dev_replace) nor copied (by the
> dev-replace process).
> 
> So the point is to copy only already existing device extents. This patch
> introduce mark_block_group_to_copy() to mark existing block group as a
> target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
> check the flag to do their job.
> 
> This patch also handles empty region between used extents. Since
> dev-replace is smart to copy only used extents on source device, we have to
> fill the gap to honor the sequential write rule in the target device.
> 
> Signed-off-by: Naohiro Aota 
> ---
>  fs/btrfs/ctree.h   |   1 +
>  fs/btrfs/dev-replace.c |  96 +++
>  fs/btrfs/extent-tree.c |  32 +++-
>  fs/btrfs/scrub.c   | 169 +
>  fs/btrfs/volumes.c |  27 ++-
>  5 files changed, 319 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index dad8ea5c3b99..a0be2b96117a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -639,6 +639,7 @@ struct btrfs_block_group_cache {
>   unsigned int has_caching_ctl:1;
>   unsigned int removed:1;
>   unsigned int wp_broken:1;
> + unsigned int to_copy:1;
>  
>   int disk_cache_state;
>  
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index fbe5ea2a04ed..5011b5ce0e75 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -263,6 +263,13 @@ static int btrfs_init_dev_replace_tgtdev(struct 
> btrfs_fs_info *fs_info,
>   device->dev_stats_valid = 1;
>   set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>   device->fs_devices = fs_info->fs_devices;
> + if (bdev_is_zoned(bdev)) {
> + ret = btrfs_get_dev_zonetypes(device);
> + if (ret) {
> + mutex_unlock(_info->fs_devices->device_list_mutex);
> + goto error;
> + }
> + }
>   list_add(>dev_list, _info->fs_devices->devices);
>   fs_info->fs_devices->num_devices++;
>   fs_info->fs_devices->open_devices++;
> @@ -396,6 +403,88 @@ static char* btrfs_dev_name(struct btrfs_device *device)
>   return rcu_str_deref(device->name);
>  }
>  
> +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
> + struct btrfs_device *src_dev)
> +{
> + struct btrfs_path *path;
> + struct btrfs_key key;
> + struct btrfs_key found_key;
> + struct btrfs_root *root = fs_info->dev_root;
> + struct btrfs_dev_extent *dev_extent = NULL;
> + struct btrfs_block_group_cache *cache;
> + struct extent_buffer *l;
> + int slot;
> + int ret;
> + u64 chunk_offset, length;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> + path->reada = READA_FORWARD;
> + path->search_commit_root = 1;
> + path->skip_locking = 1;
> +
> + key.objectid = src_dev->devid;
> + key.offset = 0ull;
> + key.type = BTRFS_DEV_EXTENT_KEY;
> +
> + while (1) {
> + ret = btrfs_search_slot(NULL, root, , path, 0, 0);
> + if (ret < 0)
> + break;
> + if (ret > 0) {
> + if (path->slots[0] >=
> + btrfs_header_nritems(path->nodes[0])) {
> + ret = btrfs_next_leaf(root, path);
> + if (ret < 0)
> + break;
> + if (ret > 0) {
> + ret = 0;
> + break;
> + }
> + } else {
> + ret = 0;
> + }
> + }
> +
> + l = path->nodes[0];
> + slot = path->slots[0];
> +
> + btrfs_item_key_to_cpu(l, _key, slot);
> +
> + if (found_key.objectid != src_dev->devid)
> +

Re: [LKP] [btrfs] 302167c50b: fio.write_bw_MBps -12.4% regression

2019-05-24 Thread Josef Bacik
On Fri, May 24, 2019 at 03:46:17PM +0800, Huang, Ying wrote:
> "Huang, Ying"  writes:
> 
> > "Huang, Ying"  writes:
> >
> >> Hi, Josef,
> >>
> >> kernel test robot  writes:
> >>
> >>> Greeting,
> >>>
> >>> FYI, we noticed a -12.4% regression of fio.write_bw_MBps due to commit:
> >>>
> >>>
> >>> commit: 302167c50b32e7fccc98994a91d40ddbbab04e52 ("btrfs: don't end
> >>> the transaction for delayed refs in throttle")
> >>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git 
> >>> pending-fixes
> >>>
> >>> in testcase: fio-basic
> >>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 
> >>> with 64G memory
> >>> with following parameters:
> >>>
> >>>   runtime: 300s
> >>>   nr_task: 8t
> >>>   disk: 1SSD
> >>>   fs: btrfs
> >>>   rw: randwrite
> >>>   bs: 4k
> >>>   ioengine: sync
> >>>   test_size: 400g
> >>>   cpufreq_governor: performance
> >>>   ucode: 0xb2e
> >>>
> >>> test-description: Fio is a tool that will spawn a number of threads
> >>> or processes doing a particular type of I/O action as specified by
> >>> the user.
> >>> test-url: https://github.com/axboe/fio
> >>>
> >>>
> >>
> >> Do you have time to take a look at this regression?
> >
> > Ping
> 
> Ping again.
> 

This happens because now we rely more on on-demand flushing than the catchup
flushing that happened before.  This is just one case where it's slightly worse,
overall this change provides better latencies, and even in this result it
provided better completion latencies because we're not randomly flushing at the
end of a transaction.  It does appear to be costing writes in that they will
spend more time flushing than before, so you get slightly lower throughput on
pure small write workloads.  I can't actually see the slowdown locally.

This patch is here to stay, it just shows we need to continue to refine the
flushing code to be less spikey/painful.  Thanks,

Josef


Re: [PATCH] mm: filemap: correct the comment about VM_FAULT_RETRY

2019-05-15 Thread Josef Bacik
On Fri, Apr 26, 2019 at 07:22:11AM +0800, Yang Shi wrote:
> The commit 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking
> operations") changed when mmap_sem is dropped during filemap page fault
> and when returning VM_FAULT_RETRY.
> 
> Correct the comment to reflect the change.
> 
> Cc: Josef Bacik 
> Signed-off-by: Yang Shi 
> ---

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] btrfs: fix a NULL pointer dereference

2019-03-14 Thread Josef Bacik
On Thu, Mar 14, 2019 at 09:54:07AM +0200, Nikolay Borisov wrote:
> 
> 
> On 14.03.19 г. 9:50 ч., Kangjie Lu wrote:
> > btrfs_lookup_block_group may fail and return NULL. The fix goes
> > to out when it fails to avoid NULL pointer dereference.
> 
> Actually no, in this case btrfs_lookup_block_group must never fail
> because if we have an allocated eb then it must have been allocated from
> a bg.
> 

Agreed, we only get to btrfs_free_tree_block() if we are actually deleting the
extent buffer.  We would have had to read in the extent buffer first to get
here, which would have failed if there was no block group.  We can't get into
this situation with a specifically crafted file system to exploit this as we'd
bail out well before we could get to btrfs_free_tree_block().  Adding an
ASSERT() makes sure developers aren't doing anything stupid, but again we'd have
to be doing something _super_ stupid to hit it.  Thanks,

Josef


Re: [PATCH v2 0/3] blkcg: sync() isolation

2019-03-08 Thread Josef Bacik
On Thu, Mar 07, 2019 at 07:08:31PM +0100, Andrea Righi wrote:
> = Problem =
> 
> When sync() is executed from a high-priority cgroup, the process is forced to
> wait the completion of the entire outstanding writeback I/O, even the I/O that
> was originally generated by low-priority cgroups potentially.
> 
> This may cause massive latencies to random processes (even those running in 
> the
> root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
> priority inversion problem.
> 
> This topic has been previously discussed here:
> https://patchwork.kernel.org/patch/10804489/
> 

Sorry to move the goal posts on you again Andrea, but Tejun and I talked about
this some more offline.

We don't want cgroup to become the arbiter of correctness/behavior here.  We
just want it to be isolating things.

For you that means you can drop the per-cgroup flag stuff, and only do the
priority boosting for multiple sync(2) waiters.  That is a real priority
inversion that needs to be fixed.  io.latency and io.max are capable of noticing
that a low priority group is going above their configured limits and putting
pressure elsewhere accordingly.

Tejun said he'd rather see the sync(2) isolation be done at the namespace level.
That way if you have fs namespacing you are already isolated to your namespace.
If you feel like tackling that then hooray, but that's a separate dragon to slay
so don't feel like you have to right now.

This way we keep cgroup doing its job, controlling resources.  Then we allow
namespacing to do its thing, isolating resources.  Thanks,

Josef


Re: [PATCH v2 2/3] blkcg: introduce io.sync_isolation

2019-03-07 Thread Josef Bacik
On Thu, Mar 07, 2019 at 07:08:33PM +0100, Andrea Righi wrote:
> Add a flag to the blkcg cgroups to make sync()'ers in a cgroup only be
> allowed to write out pages that have been dirtied by the cgroup itself.
> 
> This flag is disabled by default (meaning that we are not changing the
> previous behavior by default).
> 
> When this flag is enabled any cgroup can write out only dirty pages that
> belong to the cgroup itself (except for the root cgroup that would still
> be able to write out all pages globally).
> 
> Signed-off-by: Andrea Righi 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH v2 1/3] blkcg: prevent priority inversion problem during sync()

2019-03-07 Thread Josef Bacik
On Thu, Mar 07, 2019 at 07:08:32PM +0100, Andrea Righi wrote:
> Prevent priority inversion problem when a high-priority blkcg issues a
> sync() and it is forced to wait the completion of all the writeback I/O
> generated by any other low-priority blkcg, causing massive latencies to
> processes that shouldn't be I/O-throttled at all.
> 
> The idea is to save a list of blkcg's that are waiting for writeback:
> every time a sync() is executed the current blkcg is added to the list.
> 
> Then, when I/O is throttled, if there's a blkcg waiting for writeback
> different than the current blkcg, no throttling is applied (we can
> probably refine this logic later, i.e., a better policy could be to
> adjust the throttling I/O rate using the blkcg with the highest speed
> from the list of waiters - priority inheritance, kinda).
> 
> Signed-off-by: Andrea Righi 
> ---
>  block/blk-cgroup.c   | 131 +++
>  block/blk-throttle.c |  11 ++-
>  fs/fs-writeback.c|   5 ++
>  fs/sync.c|   8 +-
>  include/linux/backing-dev-defs.h |   2 +
>  include/linux/blk-cgroup.h   |  23 ++
>  mm/backing-dev.c |   2 +
>  7 files changed, 178 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 2bed5725aa03..4305e78d1bb2 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1351,6 +1351,137 @@ struct cgroup_subsys io_cgrp_subsys = {
>  };
>  EXPORT_SYMBOL_GPL(io_cgrp_subsys);
>  
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +struct blkcg_wb_sleeper {
> + struct backing_dev_info *bdi;
> + struct blkcg *blkcg;
> + refcount_t refcnt;
> + struct list_head node;
> +};
> +
> +static DEFINE_SPINLOCK(blkcg_wb_sleeper_lock);
> +static LIST_HEAD(blkcg_wb_sleeper_list);
> +
> +static struct blkcg_wb_sleeper *
> +blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi)
> +{
> + struct blkcg_wb_sleeper *bws;
> +
> + list_for_each_entry(bws, _wb_sleeper_list, node)
> + if (bws->blkcg == blkcg && bws->bdi == bdi)
> + return bws;
> + return NULL;
> +}
> +
> +static void blkcg_wb_sleeper_add(struct blkcg_wb_sleeper *bws)
> +{
> + list_add(>node, _wb_sleeper_list);
> +}
> +
> +static void blkcg_wb_sleeper_del(struct blkcg_wb_sleeper *bws)
> +{
> + list_del_init(>node);
> +}
> +
> +/**
> + * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
> + * @blkcg: current blkcg cgroup
> + * @bdi: block device to check
> + *
> + * Return true if any other blkcg different than the current one is waiting 
> for
> + * writeback on the target block device, false otherwise.
> + */
> +bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info 
> *bdi)
> +{
> + struct blkcg_wb_sleeper *bws;
> + bool ret = false;
> +
> + spin_lock(_wb_sleeper_lock);
> + list_for_each_entry(bws, _wb_sleeper_list, node)
> + if (bws->bdi == bdi && bws->blkcg != blkcg) {
> + ret = true;
> + break;
> + }
> + spin_unlock(_wb_sleeper_lock);
> +
> + return ret;
> +}

No global lock please, add something to the bdi I think?  Also have a fast path
of

if (list_empty(blkcg_wb_sleeper_list))
   return false;

we don't need to be super accurate here.  Thanks,

Josef


Re: [PATCH v2 3/3] blkcg: implement sync() isolation

2019-03-07 Thread Josef Bacik
On Thu, Mar 07, 2019 at 07:08:34PM +0100, Andrea Righi wrote:
> Keep track of the inodes that have been dirtied by each blkcg cgroup and
> make sure that a blkcg issuing a sync() can trigger the writeback + wait
> of only those pages that belong to the cgroup itself.
> 
> This behavior is applied only when io.sync_isolation is enabled in the
> cgroup, otherwise the old behavior is applied: sync() triggers the
> writeback of any dirty page.
> 
> Signed-off-by: Andrea Righi 
> ---
>  block/blk-cgroup.c | 47 ++
>  fs/fs-writeback.c  | 52 +++---
>  fs/inode.c |  1 +
>  include/linux/blk-cgroup.h | 22 
>  include/linux/fs.h |  4 +++
>  mm/page-writeback.c|  1 +
>  6 files changed, 124 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 4305e78d1bb2..7d3b26ba4575 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1480,6 +1480,53 @@ void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info 
> *bdi)
>   spin_unlock(_wb_sleeper_lock);
>   rcu_read_unlock();
>  }
> +
> +/**
> + * blkcg_set_mapping_dirty - set owner of a dirty mapping
> + * @mapping: target address space
> + *
> + * Set the current blkcg as the owner of the address space @mapping (the 
> first
> + * blkcg that dirties @mapping becomes the owner).
> + */
> +void blkcg_set_mapping_dirty(struct address_space *mapping)
> +{
> + struct blkcg *curr_blkcg, *blkcg;
> +
> + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) ||
> + mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> + return;
> +
> + rcu_read_lock();
> + curr_blkcg = blkcg_from_current();
> + blkcg = blkcg_from_mapping(mapping);
> + if (curr_blkcg != blkcg) {
> + if (blkcg)
> + css_put(>css);
> + css_get(_blkcg->css);
> + rcu_assign_pointer(mapping->i_blkcg, curr_blkcg);
> + }
> + rcu_read_unlock();
> +}
> +
> +/**
> + * blkcg_set_mapping_clean - clear the owner of a dirty mapping
> + * @mapping: target address space
> + *
> + * Unset the owner of @mapping when it becomes clean.
> + */
> +
> +void blkcg_set_mapping_clean(struct address_space *mapping)
> +{
> + struct blkcg *blkcg;
> +
> + rcu_read_lock();
> + blkcg = rcu_dereference(mapping->i_blkcg);
> + if (blkcg) {
> + css_put(>css);
> + RCU_INIT_POINTER(mapping->i_blkcg, NULL);
> + }
> + rcu_read_unlock();
> +}
>  #endif
>  

Why do we need this?  We already have the inode_attach_wb(), which has the
blkcg_css embedded in it for whoever dirtied the inode first.  Can we not just
use that?  Thanks,

Josef


Re: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()

2019-02-11 Thread Josef Bacik
On Mon, Feb 11, 2019 at 09:40:29PM +0100, Andrea Righi wrote:
> On Mon, Feb 11, 2019 at 10:39:34AM -0500, Josef Bacik wrote:
> > On Sat, Feb 09, 2019 at 03:07:49PM +0100, Andrea Righi wrote:
> > > This is an attempt to mitigate the priority inversion problem of a
> > > high-priority blkcg issuing a sync() and being forced to wait the
> > > completion of all the writeback I/O generated by any other low-priority
> > > blkcg, causing massive latencies to processes that shouldn't be
> > > I/O-throttled at all.
> > > 
> > > The idea is to save a list of blkcg's that are waiting for writeback:
> > > every time a sync() is executed the current blkcg is added to the list.
> > > 
> > > Then, when I/O is throttled, if there's a blkcg waiting for writeback
> > > different than the current blkcg, no throttling is applied (we can
> > > probably refine this logic later, i.e., a better policy could be to
> > > adjust the throttling I/O rate using the blkcg with the highest speed
> > > from the list of waiters - priority inheritance, kinda).
> > > 
> > > This topic has been discussed here:
> > > https://lwn.net/ml/cgroups/20190118103127.325-1-righi.and...@gmail.com/
> > > 
> > > But we didn't come up with any definitive solution.
> > > 
> > > This patch is not a definitive solution either, but it's an attempt to
> > > continue addressing this issue and handling the priority inversion
> > > problem with sync() in a better way.
> > > 
> > > Signed-off-by: Andrea Righi 
> > 
> > Talked with Tejun about this some and we agreed the following is probably 
> > the
> > best way forward
> 
> First of all thanks for the update!
> 
> > 
> > 1) Track the submitter of the wb work to the writeback code.
> 
> Are we going to track the cgroup that originated the dirty pages (or
> maybe dirty inodes) or do you have any idea in particular?
> 

The guy doing the sync(), so that way we can accomplish #3.  But really this is
an implementation detail, however you want to accomplish it is fine by me.

> > 2) Sync() defaults to the root cg, and and it writes all the things as the 
> > root
> >cg.
> 
> OK.
> 
> > 3) Add a flag to the cgroups that would make sync()'ers in that group only 
> > be
> >allowed to write out things that belong to its group.
> 
> So, IIUC, when this flag is enabled a cgroup that is doing sync() would
> trigger the writeback of the pages that belong to that cgroup only and
> it waits only for these pages to be sync-ed, right? In this case
> writeback can still go at cgroup's speed.
> 
> Instead when the flag is disabled, sync() would trigger writeback I/O
> globally, as usual, and it goes at full speed (root cgroup's speed).
> 
> Am I understanding correctly?
> 

Yup that's exactly it.  Thanks,

Josef


Re: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()

2019-02-11 Thread Josef Bacik
On Sat, Feb 09, 2019 at 03:07:49PM +0100, Andrea Righi wrote:
> This is an attempt to mitigate the priority inversion problem of a
> high-priority blkcg issuing a sync() and being forced to wait the
> completion of all the writeback I/O generated by any other low-priority
> blkcg, causing massive latencies to processes that shouldn't be
> I/O-throttled at all.
> 
> The idea is to save a list of blkcg's that are waiting for writeback:
> every time a sync() is executed the current blkcg is added to the list.
> 
> Then, when I/O is throttled, if there's a blkcg waiting for writeback
> different than the current blkcg, no throttling is applied (we can
> probably refine this logic later, i.e., a better policy could be to
> adjust the throttling I/O rate using the blkcg with the highest speed
> from the list of waiters - priority inheritance, kinda).
> 
> This topic has been discussed here:
> https://lwn.net/ml/cgroups/20190118103127.325-1-righi.and...@gmail.com/
> 
> But we didn't come up with any definitive solution.
> 
> This patch is not a definitive solution either, but it's an attempt to
> continue addressing this issue and handling the priority inversion
> problem with sync() in a better way.
> 
> Signed-off-by: Andrea Righi 

Talked with Tejun about this some and we agreed the following is probably the
best way forward

1) Track the submitter of the wb work to the writeback code.
2) Sync() defaults to the root cg, and and it writes all the things as the root
   cg.
3) Add a flag to the cgroups that would make sync()'ers in that group only be
   allowed to write out things that belong to its group.

This way we avoid the priority inversion of having things like systemd or random
logged in user doing sync() and having to wait, and we keep low prio cgroups
from causing big IO storms by syncing out stuff and getting upgraded to root
priority just to avoid the inversion.

Obviously by default we want this flag to be off since its such a big change,
but people/setups really worried about this behavior (Facebook for instance
would likely use this flag) can go ahead and set it and be sure we're getting
good isolation and still avoiding the priority inversion associated with running
sync from a high priority context.  Thanks,

Josef


Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-29 Thread Josef Bacik
On Tue, Jan 29, 2019 at 07:39:38PM +0100, Andrea Righi wrote:
> On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote:
> > On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> > > Hi Vivek,
> > > 
> > > sorry for the late reply.
> > > 
> > > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > > > 
> > > > [..]
> > > > > Alright, let's skip the root cgroup for now. I think the point here is
> > > > > if we want to provide sync() isolation among cgroups or not.
> > > > > 
> > > > > According to the manpage:
> > > > > 
> > > > >sync()  causes  all  pending  modifications  to filesystem 
> > > > > metadata and cached file data to be
> > > > >written to the underlying filesystems.
> > > > > 
> > > > > And:
> > > > >According to the standard specification (e.g., POSIX.1-2001), 
> > > > > sync() schedules the writes, but
> > > > >may  return  before  the actual writing is done.  However 
> > > > > Linux waits for I/O completions, and
> > > > >thus sync() or syncfs() provide the same guarantees as fsync 
> > > > > called on every file in the  sys‐
> > > > >tem or filesystem respectively.
> > > > > 
> > > > > Excluding the root cgroup, do you think a sync() issued inside a
> > > > > specific cgroup should wait for I/O completions only for the writes 
> > > > > that
> > > > > have been generated by that cgroup?
> > > > 
> > > > Can we account I/O towards the cgroup which issued "sync" only if write
> > > > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > > > that solve problem, assuming its doable?
> > > 
> > > Maybe this would mitigate the problem, in part, but it doesn't solve it.
> > > 
> > > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> > > issues "sync", the fast cgroup needs to wait a lot, because writeback is
> > > happening at the speed of the slow cgroup.
> > 
> > Hi Andrea,
> > 
> > But that's true only for I/O which has already been submitted to block
> > layer, right? Any new I/O yet to be submitted could still be attributed
> > to faster cgroup requesting sync.
> 
> Right. If we could bump up the new I/O yet to be submitted I think we
> could effectively prevent the priority inversion problem (the ongoing
> writeback I/O should be negligible).
> 
> > 
> > Until and unless cgroups limits are absurdly low, it should not take very
> > long for already submitted I/O to finish. If yes, then in practice, it
> > might not be a big problem?
> 
> I was actually doing my tests with a very low limit (1MB/s both for rbps
> and wbps), but this shows the problem very well I think.
> 
> Here's what I'm doing:
> 
>  [ slow cgroup (1Mbps read/write) ]
> 
>$ cat /sys/fs/cgroup/unified/cg1/io.max
>259:0 rbps=1048576 wbps=1048576 riops=max wiops=max
>$ cat /proc/self/cgroup
>0::/cg1
> 
>$ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer 
> --time_based --runtime=30
> 
>  [ fast cgroup (root cgroup, no limitation) ]
> 
># cat /proc/self/cgroup
>0::/
> 
># time sync
>real   9m32,618s
>user   0m0,000s
>sys0m0,018s
> 
> With this simple test I can easily trigger hung task timeout warnings
> and make the whole system totally sluggish (even the processes running
> in the root cgroup).
> 
> When fio ends, writeback is still taking forever to complete, as you can
> see by the insane amount that sync takes to complete.
> 

Yeah sync() needs to be treated differently, but its kind of special too.  We
don't want slow to run sync() and backup fast doing sync() because we make all
of the io go based on the submitting cgroup.  The problem here is we don't know
who's more important until we get to the blk cgroup layer, and even then
sometimes we can't tell (different hierarchies would make this tricky with
io.weight or io.latency).

We could treat it like REQ_META and just let everything go through and back
charge.  This feels like a way for the slow group to cheat though, unless we
just throttle the shit out of him before returning to user space.  I'll have to
think about this some more.  Thanks,

Josef


Re: [PATCH 07/11] btrfs: move to fn pointers for get/put workspaces

2019-01-29 Thread Josef Bacik
On Tue, Jan 29, 2019 at 01:17:17PM -0500, Josef Bacik wrote:
> On Mon, Jan 28, 2019 at 04:24:33PM -0500, Dennis Zhou wrote:
> > The previous patch added generic helpers for get_workspace() and
> > put_workspace(). Now, we can migrate ownership of the workspace_manager
> > to be in the compression type code as the compression code itself
> > doesn't care beyond being able to get a workspace. The init/cleanup
> > and get/put methods are abstracted so each compression algorithm can
> > decide how they want to manage their workspaces.
> > 
> > Signed-off-by: Dennis Zhou 
> 
> We're doing this to have special handling for extra workspaces to be free'd at
> some point in the future if they are unused.  This is fine by me, but why not
> just add a shrinker and let it be handled by memory pressure?  Then we avoid 
> all
> this abstraction and allow for ztsd to have its shrinker for its extra
> workspaces.  You can even use the list_lru stuff to make it super simple, then
> you don't have to worry about all the infrastructure.  Thanks,
> 

Nevermind, I missed that you also change the get side to lookup the workspace
for the compression level instead of cycling through the idle_ws list.  In that
case this is fine by me.

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 08/11] btrfs: plumb level through the compression interface

2019-01-29 Thread Josef Bacik
On Mon, Jan 28, 2019 at 04:24:34PM -0500, Dennis Zhou wrote:
> Zlib compression supports multiple levels, but doesn't require changing
> in how a workspace itself is created and managed. Zstd introduces a
> different memory requirement such that higher levels of compression
> require more memory. This requires changes in how the alloc()/get()
> methods work for zstd. This pach plumbs compression level through the
> interface as a parameter in preparation for zstd compression levels.
> This gives the compression types opportunity to create/manage based on
> the compression level.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 07/11] btrfs: move to fn pointers for get/put workspaces

2019-01-29 Thread Josef Bacik
On Mon, Jan 28, 2019 at 04:24:33PM -0500, Dennis Zhou wrote:
> The previous patch added generic helpers for get_workspace() and
> put_workspace(). Now, we can migrate ownership of the workspace_manager
> to be in the compression type code as the compression code itself
> doesn't care beyond being able to get a workspace. The init/cleanup
> and get/put methods are abstracted so each compression algorithm can
> decide how they want to manage their workspaces.
> 
> Signed-off-by: Dennis Zhou 

We're doing this to have special handling for extra workspaces to be free'd at
some point in the future if they are unused.  This is fine by me, but why not
just add a shrinker and let it be handled by memory pressure?  Then we avoid all
this abstraction and allow for ztsd to have its shrinker for its extra
workspaces.  You can even use the list_lru stuff to make it super simple, then
you don't have to worry about all the infrastructure.  Thanks,

Josef


Re: [PATCH 06/11] btrfs: add compression interface in (get/put)_workspace()

2019-01-29 Thread Josef Bacik
On Mon, Jan 28, 2019 at 04:24:32PM -0500, Dennis Zhou wrote:
> There are two levels of workspace management. First, alloc()/free()
> which are responsible for actually creating and destroy workspaces.
> Second, at a higher level, get()/put() which is the compression code
> asking for a workspace from a workspace_manager.
> 
> The compression code shouldn't really care how it gets a workspace, but
> that it got a workspace. This adds get_workspace() and put_workspace()
> to be the higher level interface which is responsible for indexing into
> the appropriate compression type. It also introduces
> btrfs_put_workspace() and btrfs_get_workspace() to be the generic
> implementations of the higher interface.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 05/11] btrfs: add helper methods for workspace manager init and cleanup

2019-01-29 Thread Josef Bacik
On Mon, Jan 28, 2019 at 04:24:31PM -0500, Dennis Zhou wrote:
> Workspace manager init and cleanup code is open coded inside a for loop
> over the compression types. This forces each compression type to rely on
> the same workspace manager implementation. This patch creates helper
> methods that will be the generic implementation for btrfs workspace
> management.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 04/11] btrfs: unify compression ops with workspace_manager

2019-01-29 Thread Josef Bacik
On Mon, Jan 28, 2019 at 04:24:30PM -0500, Dennis Zhou wrote:
> Make the workspace_manager own the interface operations rather than
> managing index-paired arrays for the workspace_manager and compression
> operations.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 03/11] btrfs: manage heuristic workspace as index 0

2019-01-29 Thread Josef Bacik
On Mon, Jan 28, 2019 at 04:24:29PM -0500, Dennis Zhou wrote:
> While the heuristic workspaces aren't really compression workspaces,
> they use the same interface for managing them. So rather than branching,
> let's just handle them once again as the index 0 compression type.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 02/11] btrfs: rename workspaces_list to workspace_manager

2019-01-29 Thread Josef Bacik
On Mon, Jan 28, 2019 at 04:24:28PM -0500, Dennis Zhou wrote:
> This is in preparation for zstd compression levels. As each level will
> require different sized workspaces, workspaces_list is no longer a
> really fitting name.
> 
> Signed-off-by: Dennis Zhou 
> ---

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 01/11] btrfs: add macros for compression type and level

2019-01-29 Thread Josef Bacik
On Mon, Jan 28, 2019 at 04:24:27PM -0500, Dennis Zhou wrote:
> It is very easy to miss places that rely on a certain bitshifting for
> decyphering the type_level overloading. Make macros handle this instead.
> 
> Signed-off-by: Dennis Zhou 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-18 Thread Josef Bacik
On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > This is a redesign of my old cgroup-io-throttle controller:
> > > https://lwn.net/Articles/330531/
> > > 
> > > I'm resuming this old patch to point out a problem that I think is still
> > > not solved completely.
> > > 
> > > = Problem =
> > > 
> > > The io.max controller works really well at limiting synchronous I/O
> > > (READs), but a lot of I/O requests are initiated outside the context of
> > > the process that is ultimately responsible for its creation (e.g.,
> > > WRITEs).
> > > 
> > > Throttling at the block layer in some cases is too late and we may end
> > > up slowing down processes that are not responsible for the I/O that
> > > is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are 
> > throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> 
> Right, writeback is per-cgroup and slowing down writeback affects only
> that specific cgroup, but, there are cases where other processes from
> other cgroups may require to wait on that writeback to complete before
> doing I/O (for example an fsync() to a file shared among different
> cgroups). In this case we may end up blocking cgroups that shouldn't be
> blocked, that looks like a priority-inversion problem. This is the
> problem that I'm trying to address.

Well this case is a misconfiguration, you shouldn't be sharing files between
cgroups.  But even if you are, fsync() is synchronous, we should be getting the
context from the process itself and thus should have its own rules applied.
There's nothing we can do for outstanding IO, but that shouldn't be that much.
That would need to be dealt with on a per-contoller basis.

> 
> > 
> > I introduced the blk_cgroup_congested() stuff for paths that it's not easy 
> > to
> > clearly tie IO to the thing generating the IO, such as readahead and such.  
> > If
> > you are running into this case that may be something worth using.  Course it
> > only works for io.latency now but there's no reason you can't add support 
> > to it
> > for io.max or whatever.
> 
> IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> memcg), something like this: if the cgroup is already congested don't
> generate extra I/O due to readahead. Am I right?

Yeah, but that's just how it's currently used, it can be used any which way we
feel like.

> 
> > 
> > > 
> > > = Proposed solution =
> > > 
> > > The main idea of this controller is to split I/O measurement and I/O
> > > throttling: I/O is measured at the block layer for READS, at page cache
> > > (dirty pages) for WRITEs, and processes are limited while they're
> > > generating I/O at the VFS level, based on the measured I/O.
> > > 
> > 
> > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > looking into that route and simply changing the existing io controller you 
> > are
> > using to take advantage of that so it will actually throttle things.  Then 
> > just
> > sprinkle it around the areas where we indirectly generate IO.  Thanks,
> 
> Absolutely, I can probably use blk_cgroup_congested() as a method to
> determine when a cgroup should be throttled (instead of doing my own
> I/O measuring), but to prevent the "slow writeback slowing down other
> cgroups" issue I still need to apply throttling when pages are dirtied
> in page cache.

Again this is just a fuckup from a configuration stand point.  The argument
could be made that sync() is probably broken here, but I think the right
solution here is to just pass the cgroup context along with the writeback
information and use that if it's set instead.  Thanks,

Josef


Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-18 Thread Josef Bacik
On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik  
> > ha scritto:
> > 
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> >> This is a redesign of my old cgroup-io-throttle controller:
> >> https://lwn.net/Articles/330531/
> >> 
> >> I'm resuming this old patch to point out a problem that I think is still
> >> not solved completely.
> >> 
> >> = Problem =
> >> 
> >> The io.max controller works really well at limiting synchronous I/O
> >> (READs), but a lot of I/O requests are initiated outside the context of
> >> the process that is ultimately responsible for its creation (e.g.,
> >> WRITEs).
> >> 
> >> Throttling at the block layer in some cases is too late and we may end
> >> up slowing down processes that are not responsible for the I/O that
> >> is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are 
> > throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> > 
> 
> IIUC, Andrea described this problem: certain processes in a certain group 
> dirty a
> lot of pages, causing write back to start.  Then some other blameless
> process in the same group experiences very high latency, in spite of
> the fact that it has to do little I/O.
> 

In that case the io controller isn't doing it's job and needs to be fixed (or
reconfigured).  io.latency guards against this, I assume io.max would keep this
from happening if it were configured properly.

> Does your blk_cgroup_congested() stuff solves this issue?
> 
> Or simply I didn't get what Andrea meant at all :)
> 

I _think_ Andrea is talking about the fact that we can generate IO indirectly
and never get throttled for it, which is what blk_cgroup_congested() is meant to
address.  I added it specifically because some low prio task was just allocating
all of the memory on the system and causing a lot of pressure because of
swapping, but there was no direct feedback loop there.  blk_cgroup_congested()
provides that feedback loop.

Course I could be wrong too and we're all just talking past each other ;).
Thanks,

Josef


Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

2019-01-18 Thread Josef Bacik
On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> This is a redesign of my old cgroup-io-throttle controller:
> https://lwn.net/Articles/330531/
> 
> I'm resuming this old patch to point out a problem that I think is still
> not solved completely.
> 
> = Problem =
> 
> The io.max controller works really well at limiting synchronous I/O
> (READs), but a lot of I/O requests are initiated outside the context of
> the process that is ultimately responsible for its creation (e.g.,
> WRITEs).
> 
> Throttling at the block layer in some cases is too late and we may end
> up slowing down processes that are not responsible for the I/O that
> is being processed at that level.

How so?  The writeback threads are per-cgroup and have the cgroup stuff set
properly.  So if you dirty a bunch of pages, they are associated with your
cgroup, and then writeback happens and it's done in the writeback thread
associated with your cgroup and then that is throttled.  Then you are throttled
at balance_dirty_pages() because the writeout is taking longer.

I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
clearly tie IO to the thing generating the IO, such as readahead and such.  If
you are running into this case that may be something worth using.  Course it
only works for io.latency now but there's no reason you can't add support to it
for io.max or whatever.

> 
> = Proposed solution =
> 
> The main idea of this controller is to split I/O measurement and I/O
> throttling: I/O is measured at the block layer for READS, at page cache
> (dirty pages) for WRITEs, and processes are limited while they're
> generating I/O at the VFS level, based on the measured I/O.
> 

This is what blk_cgroup_congested() is meant to accomplish, I would suggest
looking into that route and simply changing the existing io controller you are
using to take advantage of that so it will actually throttle things.  Then just
sprinkle it around the areas where we indirectly generate IO.  Thanks,

Josef


Re: [PATCH RFC 0/3] mm: Reduce IO by improving algorithm of memcg pagecache pages eviction

2019-01-09 Thread Josef Bacik
On Wed, Jan 09, 2019 at 07:08:09PM +0300, Kirill Tkhai wrote:
> Hi, Josef,
> 
> On 09.01.2019 18:49, Josef Bacik wrote:
> > On Wed, Jan 09, 2019 at 03:20:18PM +0300, Kirill Tkhai wrote:
> >> On nodes without memory overcommit, it's common a situation,
> >> when memcg exceeds its limit and pages from pagecache are
> >> shrinked on reclaim, while node has a lot of free memory.
> >> Further access to the pages requires real device IO, while
> >> IO causes time delays, worse powerusage, worse throughput
> >> for other users of the device, etc.
> >>
> >> Cleancache is not a good solution for this problem, since
> >> it implies copying of page on every cleancache_put_page()
> >> and cleancache_get_page(). Also, it requires introduction
> >> of internal per-cleancache_ops data structures to manage
> >> cached pages and their inodes relationships, which again
> >> introduces overhead.
> >>
> >> This patchset introduces another solution. It introduces
> >> a new scheme for evicting memcg pages:
> >>
> >>   1)__remove_mapping() uncharges unmapped page memcg
> >> and leaves page in pagecache on memcg reclaim;
> >>
> >>   2)putback_lru_page() places page into root_mem_cgroup
> >> list, since its memcg is NULL. Page may be evicted
> >> on global reclaim (and this will be easily, as
> >> page is not mapped, so shrinker will shrink it
> >> with 100% probability of success);
> >>
> >>   3)pagecache_get_page() charges page into memcg of
> >> a task, which takes it first.
> >>
> >> Below is small test, which shows profit of the patchset.
> >>
> >> Create memcg with limit 20M (exact value does not matter much):
> >>   $ mkdir /sys/fs/cgroup/memory/ct
> >>   $ echo 20M > /sys/fs/cgroup/memory/ct/memory.limit_in_bytes
> >>   $ echo $$ > /sys/fs/cgroup/memory/ct/tasks
> >>
> >> Then twice read 1GB file:
> >>   $ time cat file_1gb > /dev/null
> >>
> >> Before (2 iterations):
> >>   1)0.01user 0.82system 0:11.16elapsed 7%CPU
> >>   2)0.01user 0.91system 0:11.16elapsed 8%CPU
> >>
> >> After (2 iterations):
> >>   1)0.01user 0.57system 0:11.31elapsed 5%CPU
> >>   2)0.00user 0.28system 0:00.28elapsed 100%CPU
> >>
> >> With the patch set applied, we have file pages are cached
> >> during the second read, so the result is 39 times faster.
> >>
> >> This may be useful for slow disks, NFS, nodes without
> >> overcommit by memory, in case of two memcg access the same
> >> files, etc.
> >>
> > 
> > This isn't going to work for us (Facebook).  The whole reason the hard limit
> > exists is to keep different groups from messing up other groups.  Page cache
> > reclaim is not free, most of our pain and most of the reason we use cgroups
> > is to limit the effect of flooding the machine with pagecache from different
> > groups.
> 
> I understand the problem.
> 
> > Memory leaks happen few and far between, but chef doing a yum
> > update in the system container happens regularly.  If you talk about 
> > suddenly
> > orphaning these pages to the root container it still creates pressure on the
> > main workload, pressure that results in it having to take time from what 
> > it's
> > doing and free up memory instead.
> 
> Could you please to clarify additional pressure, which introduces the 
> patchset?
> The number of actions, which are needed to evict a pagecache page, remain 
> almost
> the same: we just delay __delete_from_page_cache() to global reclaim. Global
> reclaim should not introduce much pressure, since it's the iteration on a 
> single
> memcg (we should not dive into hell of children memcg, since root memcg 
> reclaim
> should be successful and free enough pages, should't we?).

If we go into global reclaim at all.  If we're unable to allocate a page as the
most important cgroup we start shrinking ourselves first right?  And then
eventually end up in global reclaim, right?  So it may be easily enough
reclaimed, but we're going to waste a lot of time getting there in the meantime,
which means latency that's hard to pin down.

And secondly this allows hard limited cgroups to essentially leak pagecache into
the whole system, creating waaay more memory pressure than what I think you
intend.  Your logic is that we'll exceed our limit, evict some pagecache to the
root cgroup, and we avoid a OOM and everything is ok.  However what will really
happen is some user is going to do dd if=/dev/zero

Re: [PATCH RFC 0/3] mm: Reduce IO by improving algorithm of memcg pagecache pages eviction

2019-01-09 Thread Josef Bacik
On Wed, Jan 09, 2019 at 03:20:18PM +0300, Kirill Tkhai wrote:
> On nodes without memory overcommit, it's common a situation,
> when memcg exceeds its limit and pages from pagecache are
> shrinked on reclaim, while node has a lot of free memory.
> Further access to the pages requires real device IO, while
> IO causes time delays, worse powerusage, worse throughput
> for other users of the device, etc.
> 
> Cleancache is not a good solution for this problem, since
> it implies copying of page on every cleancache_put_page()
> and cleancache_get_page(). Also, it requires introduction
> of internal per-cleancache_ops data structures to manage
> cached pages and their inodes relationships, which again
> introduces overhead.
> 
> This patchset introduces another solution. It introduces
> a new scheme for evicting memcg pages:
> 
>   1)__remove_mapping() uncharges unmapped page memcg
> and leaves page in pagecache on memcg reclaim;
> 
>   2)putback_lru_page() places page into root_mem_cgroup
> list, since its memcg is NULL. Page may be evicted
> on global reclaim (and this will be easily, as
> page is not mapped, so shrinker will shrink it
> with 100% probability of success);
> 
>   3)pagecache_get_page() charges page into memcg of
> a task, which takes it first.
> 
> Below is small test, which shows profit of the patchset.
> 
> Create memcg with limit 20M (exact value does not matter much):
>   $ mkdir /sys/fs/cgroup/memory/ct
>   $ echo 20M > /sys/fs/cgroup/memory/ct/memory.limit_in_bytes
>   $ echo $$ > /sys/fs/cgroup/memory/ct/tasks
> 
> Then twice read 1GB file:
>   $ time cat file_1gb > /dev/null
> 
> Before (2 iterations):
>   1)0.01user 0.82system 0:11.16elapsed 7%CPU
>   2)0.01user 0.91system 0:11.16elapsed 8%CPU
> 
> After (2 iterations):
>   1)0.01user 0.57system 0:11.31elapsed 5%CPU
>   2)0.00user 0.28system 0:00.28elapsed 100%CPU
> 
> With the patch set applied, we have file pages are cached
> during the second read, so the result is 39 times faster.
> 
> This may be useful for slow disks, NFS, nodes without
> overcommit by memory, in case of two memcg access the same
> files, etc.
> 

This isn't going to work for us (Facebook).  The whole reason the hard limit
exists is to keep different groups from messing up other groups.  Page cache
reclaim is not free, most of our pain and most of the reason we use cgroups
is to limit the effect of flooding the machine with pagecache from different
groups.  Memory leaks happen few and far between, but chef doing a yum
update in the system container happens regularly.  If you talk about suddenly
orphaning these pages to the root container it still creates pressure on the
main workload, pressure that results in it having to take time from what it's
doing and free up memory instead.  Thanks,

Josef


  1   2   3   4   5   6   7   8   9   >