Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads

2021-04-14 Thread Stefan Metzmacher
_ptr=0xffc4fe38, 
data=0xffc4fdb0) at queue.c:122
#3  0x56595e9b in __io_uring_get_cqe (ring=0xffc4feb0, cqe_ptr=0xffc4fe38, 
submit=0, wait_nr=1, sigmask=0x0) at queue.c:150
#4  0x565945d8 in io_uring_wait_cqe_nr (ring=0xffc4feb0, cqe_ptr=0xffc4fe38, 
wait_nr=1) at ../src/include/liburing.h:635
#5  0x56594634 in io_uring_wait_cqe (ring=0xffc4feb0, cqe_ptr=0xffc4fe38) at 
../src/include/liburing.h:655
#6  0x56594dc5 in copy_file (ring=0xffc4feb0, _insize=24) at io_uring-cp.c:232
#7  0x56594ff1 in main (argc=3, argv=0xffc50024) at io_uring-cp.c:278
[Inferior 1 (process 8821) detached]

With -m32:

gdb -q --batch --ex="set height 0" --ex="thread apply all bt" --pid 8831
[New LWP 8832]
[New LWP 8833]
0xf7f16549 in __kernel_vsyscall ()

Thread 3 (LWP 8833):
#0  0x in ?? ()

Thread 2 (LWP 8832):
#0  0x in ?? ()

Thread 1 (LWP 8831):
#0  0xf7f16549 in __kernel_vsyscall ()
#1  0xf7e14efb in syscall () at ../sysdeps/unix/sysv/linux/i386/syscall.S:29
#2  0x5657d297 in __sys_io_uring_enter2 (fd=5, to_submit=0, min_complete=1, 
flags=1, sig=0x0, sz=8) at syscall.c:54
#3  0x5657cb32 in _io_uring_get_cqe (ring=0xff9557b0, cqe_ptr=0xff95573c, 
data=0xff955698) at queue.c:122
#4  0x5657cbf5 in __io_uring_get_cqe (ring=0xff9557b0, cqe_ptr=0xff95573c, 
submit=0, wait_nr=1, sigmask=0x0) at queue.c:150
#5  0x5657b5f2 in io_uring_wait_cqe_nr (ring=0xff9557b0, cqe_ptr=0xff95573c, 
wait_nr=1) at ../src/include/liburing.h:635
#6  0x5657b63f in io_uring_wait_cqe (ring=0xff9557b0, cqe_ptr=0xff95573c) at 
../src/include/liburing.h:655
#7  0x5657bcbe in copy_file (ring=0xff9557b0, _insize=24) at io_uring-cp.c:232
#8  0x5657becd in main (argc=3, argv=0xff9558f4) at io_uring-cp.c:278
[Inferior 1 (process 8831) detached]

So the gdb autodetects 'i386/-m32' and warns in all other cases (including the 
default of -m64)

As a debugged this deeply now, I know (at least printing a backtrace works 
anyway),
but for any random advanced admin or userspace developer warnings like this:

  warning: Selected architecture i386:x86-64 is not compatible with reported 
target architecture i386

  warning: Architecture rejected target-supplied description

typically indicate that something in the system is deeply broken.

With the version proposed by Linus:

+   childregs->cs = __USER_CS;
+   childregs->ss = __USER_DS;
+#ifdef CONFIG_X86_32
+   childregs->ds = __USER_DS;
+   childregs->es = __USER_DS;
+#endif

-m64 and -mx32 are fine, but i386/-m32 generates this:

# gdb -q --batch --ex="set height 0" --ex="thread apply all bt" --pid 984
[New LWP 985]
[New LWP 986]
[New LWP 987]

warning: Selected architecture i386 is not compatible with reported target 
architecture i386:x64-32

warning: Architecture rejected target-supplied description
0xf7f58549 in __kernel_vsyscall ()

Thread 4 (LWP 987):
#0  0x in ?? ()

Thread 3 (LWP 986):
#0  0x in ?? ()

Thread 2 (LWP 985):
#0  0x in ?? ()

Thread 1 (LWP 984):
#0  0xf7f58549 in __kernel_vsyscall ()
#1  0xf7e56efb in syscall () at ../sysdeps/unix/sysv/linux/i386/syscall.S:29
#2  0x5656c297 in __sys_io_uring_enter2 (fd=5, to_submit=0, min_complete=1, 
flags=1, sig=0x0, sz=8) at syscall.c:54
#3  0x5656bb32 in _io_uring_get_cqe (ring=0xffc42ce0, cqe_ptr=0xffc42c6c, 
data=0xffc42bc8) at queue.c:122
#4  0x5656bbf5 in __io_uring_get_cqe (ring=0xffc42ce0, cqe_ptr=0xffc42c6c, 
submit=0, wait_nr=1, sigmask=0x0) at queue.c:150
#5  0x5656a5f2 in io_uring_wait_cqe_nr (ring=0xffc42ce0, cqe_ptr=0xffc42c6c, 
wait_nr=1) at ../src/include/liburing.h:635
#6  0x5656a63f in io_uring_wait_cqe (ring=0xffc42ce0, cqe_ptr=0xffc42c6c) at 
../src/include/liburing.h:655
#7  0x5656acbe in copy_file (ring=0xffc42ce0, _insize=24) at io_uring-cp.c:232
#8  0x5656aecd in main (argc=3, argv=0xffc42e24) at io_uring-cp.c:278
[Inferior 1 (process 984) detached]


With my patch all 3 are fine.

It also feels more natural to me to just keep the values of
the copy_thread() caller.

Do you plan to do something about this before 5.12 final?

metze

Am 11.04.21 um 17:27 schrieb Stefan Metzmacher:
> This allows gdb attach to userspace processes using io-uring,
> which means that they have io_threads (PF_IO_WORKER), which appear
> just like normal as userspace threads.
> 
> See the code comment for more details.
> 
> Fixes: 4727dc20e04 ("arch: setup PF_IO_WORKER threads like PF_KTHREAD")
> Signed-off-by: Stefan Metzmacher 
> cc: Linus Torvalds 
> cc: Jens Axboe 
> cc: linux-kernel@vger.kernel.org
> cc: io-ur...@vger.kernel.org
> ---
>  arch/x86/kernel/process.c | 49 +++
>  1 file changed, 49 insertions(+)
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 9c214d7085a4..72120c4b7618 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -163,6 +163,55 @@ int

[PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads

2021-04-11 Thread Stefan Metzmacher
This allows gdb attach to userspace processes using io-uring,
which means that they have io_threads (PF_IO_WORKER), which appear
just like normal as userspace threads.

See the code comment for more details.

Fixes: 4727dc20e04 ("arch: setup PF_IO_WORKER threads like PF_KTHREAD")
Signed-off-by: Stefan Metzmacher 
cc: Linus Torvalds 
cc: Jens Axboe 
cc: linux-kernel@vger.kernel.org
cc: io-ur...@vger.kernel.org
---
 arch/x86/kernel/process.c | 49 +++
 1 file changed, 49 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9c214d7085a4..72120c4b7618 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -163,6 +163,55 @@ int copy_thread(unsigned long clone_flags, unsigned long 
sp, unsigned long arg,
/* Kernel thread ? */
if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
+   /*
+* gdb sees all userspace threads,
+* including io threads (PF_IO_WORKER)!
+*
+* gdb uses:
+* PTRACE_PEEKUSR, offsetof (struct user_regs_struct, cs)
+*  returning with 0x33 (51) to detect 64 bit
+* and:
+* PTRACE_PEEKUSR, offsetof (struct user_regs_struct, ds)
+*  returning 0x2b (43) to detect 32 bit.
+*
+* GDB relies on that the kernel returns the
+* same values for all threads, which means
+* we don't zero these out.
+*
+* Note that CONFIG_X86_64 handles 'es' and 'ds'
+* differently, see the following above:
+*   savesegment(es, p->thread.es);
+*   savesegment(ds, p->thread.ds);
+* and the CONFIG_X86_64 version of get_segment_reg().
+*
+* Linus proposed something like this:
+* 
(https://lore.kernel.org/io-uring/CAHk-=whEObPkZBe4766DmR46-=5qtuiatwbsoad468etgyc...@mail.gmail.com/)
+*
+*   childregs->cs = __USER_CS;
+*   childregs->ss = __USER_DS;
+*   childregs->ds = __USER_DS;
+*   childregs->es = __USER_DS;
+*
+* might make sense (just do it unconditionally, rather than 
making it
+* special to PF_IO_WORKER).
+*
+* But that doesn't make gdb happy in all cases.
+*
+* While 32bit userspace on a 64bit kernel is legacy,
+* it's still useful to allow 32bit libraries or nss modules
+* use the same code as the 64bit version of that library, which
+* can use io-uring just fine.
+*
+* So we better just inherit the values from
+* the originating process instead of hardcoding
+* values, which would imply 64bit userspace.
+*/
+   childregs->cs = current_pt_regs()->cs;
+   childregs->ss = current_pt_regs()->ss;
+#ifdef CONFIG_X86_32
+   childregs->ds = current_pt_regs()->ds;
+   childregs->es = current_pt_regs()->es;
+#endif
kthread_frame_init(frame, sp, arg);
return 0;
}
-- 
2.25.1



Re: [PATCH 0/6] Allow signals for IO threads

2021-04-02 Thread Stefan Metzmacher
Am 01.04.21 um 18:24 schrieb Linus Torvalds:
> On Thu, Apr 1, 2021 at 9:00 AM Stefan Metzmacher  wrote:
>>
>> I haven't tried it, but it seems gdb tries to use PTRACE_PEEKUSR
>> against the last thread tid listed under /proc//tasks/ in order to
>> get the architecture for the userspace application
> 
> Christ, what an odd hack. Why wouldn't it just do it on the initial
> thread you actually attached to?
> 
> Are you sure it's not simply because your test-case was to attach to
> the io_uring thread? Because the io_uring thread might as well be
> considered to be 64-bit.

  │   └─io_uring-cp,1396 Makefile file
  │   ├─{iou-mgr-1396},1397
  │   ├─{iou-wrk-1396},1398
  │   └─{iou-wrk-1396},1399

strace -ttT -o strace-uring-fail.txt gdb --pid 1396
(note strace -f would deadlock gdb with SIGSTOP)

The full file can be found here:
https://www.samba.org/~metze/strace-uring-fail.txt
(I guess there was a race and the workers 1398 and 1399 exited in between,
that's it using 1397):

18:46:35.429498 ptrace(PTRACE_PEEKUSER, 1397, 8*CS, [NULL]) = 0 <0.22>

>> so my naive assumption
>> would be that it wouldn't allow the detection of a 32-bit application
>> using a 64-bit kernel.
> 
> I'm not entirely convinced we want to care about a confused gdb
> implementation and somebody debugging a case that I don't believe
> happens in practice.
> 
> 32-bit user space is legacy. And legacy isn't io_uring. If somebody
> insists on doing odd things, they can live with the odd results.

Ok, I'd agree for 32-bit applications, but what about libraries?
E.g. distributions ship libraries like libsmbclient or nss modules
as 64-bit and 32-bit version, in order to support legacy applications
to run. Why shouldn't the 32-bit library builds not support io_uring?
(Note libsmbclient don't use io_uring yet, but I guess it will be in future).

Any ideas regarding similar problems for other architectures?

metze




Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-04-01 Thread Stefan Metzmacher
Hi Jens,

>> I know you brought this one up as part of your series, not sure I get
>> why you want it owned by root and read-only? cmdline and exe, yeah those
>> could be hidden, but is there really any point?
>>
>> Maybe I'm missing something here, if so, do clue me in!
> 
> I looked through /proc and I think it's mostly similar to
> the unshare() case, if userspace wants to do stupid things
> like changing "comm" of iothreads, it gets what was asked for.
> 
> But the "cmdline" hiding would be very useful.
> 
> While most tools use "comm", by default.
> 
> ps -eLf or 'iotop' use "cmdline".
> 
> Some processes use setproctitle to change "cmdline" in order
> to identify the process better, without the 15 chars comm restriction,
> that's why I very often press 'c' in 'top' to see the cmdline,
> in that case it would be very helpful to see '[iou-wrk-1234]'
> instead of the seeing the cmdline.
> 
> So I'd very much prefer if this could be applied:
> https://lore.kernel.org/io-uring/d4487f959c778d0b1d4c5738b75bcff17d21df5b.1616197787.git.me...@samba.org/T/#u
> 
> If you want I can add a comment and a more verbose commit message...

I noticed that 'iotop' actually appends ' [iou-wrk-1234]' to the cmdline value,
so that leaves us with 'ps -eLf' and 'top' (with 'c').

pstree -a -t -p is also fine:
  │   └─io_uring-cp,1315 
/root/kernel/linux-image-5.12.0-rc2+-dbg_5.12.0-rc2+-5_amd64.deb file
  │   ├─{iou-mgr-1315},1316
  │   ├─{iou-wrk-1315},1317
  │   ├─{iou-wrk-1315},1318
  │   ├─{iou-wrk-1315},1319
  │   ├─{iou-wrk-1315},1320


In the spirit of "avoid special PF_IO_WORKER checks" I guess it's ok
to leave of as is...

metze


Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-04-01 Thread Stefan Metzmacher
Hi Jens,

>>> I don't assume signals wanted by userspace should potentially handled in an 
>>> io_thread...
>>> e.g. things set with fcntl(fd, F_SETSIG,) used together with F_SETLEASE?
>>
>> I guess we do actually need it, if we're not fiddling with
>> wants_signal() for them. To quell Oleg's concerns, we can just move it
>> to post dup_task_struct(), that should eliminate any race concerns
>> there.
> 
> If that one is racy, don' we better also want this one?
> https://lore.kernel.org/io-uring/438b738c1e4827a7fdfe43087da88bbe17eedc72.1616197787.git.me...@samba.org/T/#u
> 
> And clear tsk->pf_io_worker ?

As the workers don't clone other workers I guess it's fine to defer this to 
5.13.

metze



Re: [PATCH] Document that PF_KTHREAD _is_ ABI

2021-04-01 Thread Stefan Metzmacher


Am 31.03.21 um 21:23 schrieb Alexey Dobriyan:
> On Mon, Mar 22, 2021 at 07:53:10AM +, Christoph Hellwig wrote:
>> On Sat, Mar 20, 2021 at 10:23:12AM -0700, Andy Lutomirski wrote:
 https://github.com/systemd/systemd/blob/main/src/basic/process-util.c#L354
 src/basic/process-util.c:is_kernel_thread()
>>>
>>> Eww.
>>>
>>> Could we fix it differently and more permanently by modifying the proc
>>> code to display the values systemd expects?
>>
>> Yes, do_task_stat needs a mapping from kernel flags to UABI flags.  And
>> we should already discard everything we think we can from the UABI
>> now, and only add the ones back that are required to not break
>> userspace.
> 
> Sure we do. Who is going to find all the flags? I found PF_KTHREAD. :^)
> 
> More seriously,
> 
> /proc/$pid/stat was expanded to include tsk->flags in 0.99.1 (!!!)
> 
> Developers kept adding and shuffling flags probably not even realising
> what's going on. The last incident happened at 5.10 when PF_IO_WORKER
> was exchanged with PF_VCPU for smaller codegen.

With the create_io_thread(), the impact of PF_IO_WORKER becomes more broadly
visible and userspace might start to look at it in order to find the difference
between userspace and kernel io threads. (I also think it should actually be 
renamed to
PF_IO_THREAD...)

Jens, what do you think about that?

metze


Re: [PATCH 0/6] Allow signals for IO threads

2021-04-01 Thread Stefan Metzmacher


Am 01.04.21 um 17:39 schrieb Linus Torvalds:
> On Thu, Apr 1, 2021 at 7:58 AM Stefan Metzmacher  wrote:
>>
>>>
>>> Ok, the following makes gdb happy again:
>>>
>>> --- a/arch/x86/kernel/process.c
>>> +++ b/arch/x86/kernel/process.c
>>> @@ -163,6 +163,8 @@ int copy_thread(unsigned long clone_flags, unsigned 
>>> long sp, unsigned long arg,
>>> /* Kernel thread ? */
>>> if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>> memset(childregs, 0, sizeof(struct pt_regs));
>>> +   if (p->flags & PF_IO_WORKER)
>>> +   childregs->cs = current_pt_regs()->cs;
>>> kthread_frame_init(frame, sp, arg);
>>> return 0;
>>> }
>>
>> Would it be possible to fix this remaining problem before 5.12 final?
> 
> Please not that way.
> 
> But doing something like
> 
> childregs->cs = __USER_CS;
> childregs->ss = __USER_DS;
> childregs->ds = __USER_DS;
> childregs->es = __USER_DS;
> 
> might make sense (just do it unconditionally, rather than making it
> special to PF_IO_WORKER).
> 
> Does that make gdb happy too?

I haven't tried it, but it seems gdb tries to use PTRACE_PEEKUSR
against the last thread tid listed under /proc//tasks/ in order to
get the architecture for the userspace application, so my naive assumption
would be that it wouldn't allow the detection of a 32-bit application
using a 64-bit kernel.

metze


Re: [PATCH 0/6] Allow signals for IO threads

2021-04-01 Thread Stefan Metzmacher
Hi Jens,

>> For help, type "help".
>> Type "apropos word" to search for commands related to "word".
>> Attaching to process 1320
>> [New LWP 1321]
>> [New LWP 1322]
>>
>> warning: Selected architecture i386:x86-64 is not compatible with reported 
>> target architecture i386
>>
>> warning: Architecture rejected target-supplied description
>> syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>> 38  ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or 
>> directory.
>> (gdb)
> 
> Ok, the following makes gdb happy again:
> 
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -163,6 +163,8 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
> /* Kernel thread ? */
> if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
> memset(childregs, 0, sizeof(struct pt_regs));
> +   if (p->flags & PF_IO_WORKER)
> +   childregs->cs = current_pt_regs()->cs;
> kthread_frame_init(frame, sp, arg);
> return 0;
> }
> 
> I'm wondering if we should decouple the PF_KTHREAD and PF_IO_WORKER cases 
> even more
> and keep as much of a userspace-like copy_thread as possible.

Would it be possible to fix this remaining problem before 5.12 final?
(I don't think my change would be the correct fix actually
and other architectures may have similar problems).

Thanks!
metze





Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher


Hi Jens,

> root@ub1704-166:~# LANG=C gdb --pid 1320
> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
> Copyright (C) 2020 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> 
> For help, type "help".
> Type "apropos word" to search for commands related to "word".
> Attaching to process 1320
> [New LWP 1321]
> [New LWP 1322]
> 
> warning: Selected architecture i386:x86-64 is not compatible with reported 
> target architecture i386
> 
> warning: Architecture rejected target-supplied description
> syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> 38  ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or 
> directory.
> (gdb)

Ok, the following makes gdb happy again:

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -163,6 +163,8 @@ int copy_thread(unsigned long clone_flags, unsigned long 
sp, unsigned long arg,
/* Kernel thread ? */
if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
+   if (p->flags & PF_IO_WORKER)
+   childregs->cs = current_pt_regs()->cs;
kthread_frame_init(frame, sp, arg);
return 0;
}

I'm wondering if we should decouple the PF_KTHREAD and PF_IO_WORKER cases even 
more
and keep as much of a userspace-like copy_thread as possible.

metze


Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Stefan Metzmacher
Am 26.03.21 um 16:29 schrieb Jens Axboe:
> On 3/26/21 9:23 AM, Stefan Metzmacher wrote:
>> Am 26.03.21 um 16:01 schrieb Jens Axboe:
>>> On 3/26/21 7:48 AM, Oleg Nesterov wrote:
>>>> Jens, sorry, I got lost :/
>>>
>>> Let's bring you back in :-)
>>>
>>>> On 03/25, Jens Axboe wrote:
>>>>>
>>>>> With IO threads accepting signals, including SIGSTOP,
>>>>
>>>> where can I find this change? Looks like I wasn't cc'ed...
>>>
>>> It's this very series.
>>>
>>>>> unmask the
>>>>> SIGSTOP signal from the default blocked mask.
>>>>>
>>>>> Signed-off-by: Jens Axboe 
>>>>> ---
>>>>>  kernel/fork.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>> index d3171e8e88e5..d5a40552910f 100644
>>>>> --- a/kernel/fork.c
>>>>> +++ b/kernel/fork.c
>>>>> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void 
>>>>> *), void *arg, int node)
>>>>>   tsk = copy_process(NULL, 0, node, );
>>>>>   if (!IS_ERR(tsk)) {
>>>>>   sigfillset(>blocked);
>>>>> - sigdelsetmask(>blocked, sigmask(SIGKILL));
>>>>> + sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>>>>
>>>> siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is 
>>>> minor.
>>>
>>> Ah thanks.
>>>
>>>> To remind, either way this is racy and can't really help.
>>>>
>>>> And if "IO threads accepting signals" then I don't understand why. Sorry,
>>>> I must have missed something.
>>>
>>> I do think the above is a no-op at this point, and we can probably just
>>> kill it. Let me double check, hopefully we can just remove this blocked
>>> part.
>>
>> Is this really correct to drop in your "kernel: stop masking signals in 
>> create_io_thread()"
>> commit?
>>
>> I don't assume signals wanted by userspace should potentially handled in an 
>> io_thread...
>> e.g. things set with fcntl(fd, F_SETSIG,) used together with F_SETLEASE?
> 
> I guess we do actually need it, if we're not fiddling with
> wants_signal() for them. To quell Oleg's concerns, we can just move it
> to post dup_task_struct(), that should eliminate any race concerns
> there.

If that one is racy, don' we better also want this one?
https://lore.kernel.org/io-uring/438b738c1e4827a7fdfe43087da88bbe17eedc72.1616197787.git.me...@samba.org/T/#u

And clear tsk->pf_io_worker ?

metze


Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Stefan Metzmacher
Am 26.03.21 um 16:01 schrieb Jens Axboe:
> On 3/26/21 7:48 AM, Oleg Nesterov wrote:
>> Jens, sorry, I got lost :/
> 
> Let's bring you back in :-)
> 
>> On 03/25, Jens Axboe wrote:
>>>
>>> With IO threads accepting signals, including SIGSTOP,
>>
>> where can I find this change? Looks like I wasn't cc'ed...
> 
> It's this very series.
> 
>>> unmask the
>>> SIGSTOP signal from the default blocked mask.
>>>
>>> Signed-off-by: Jens Axboe 
>>> ---
>>>  kernel/fork.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index d3171e8e88e5..d5a40552910f 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void 
>>> *), void *arg, int node)
>>> tsk = copy_process(NULL, 0, node, );
>>> if (!IS_ERR(tsk)) {
>>> sigfillset(>blocked);
>>> -   sigdelsetmask(>blocked, sigmask(SIGKILL));
>>> +   sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>>
>> siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is minor.
> 
> Ah thanks.
> 
>> To remind, either way this is racy and can't really help.
>>
>> And if "IO threads accepting signals" then I don't understand why. Sorry,
>> I must have missed something.
> 
> I do think the above is a no-op at this point, and we can probably just
> kill it. Let me double check, hopefully we can just remove this blocked
> part.

Is this really correct to drop in your "kernel: stop masking signals in 
create_io_thread()"
commit?

I don't assume signals wanted by userspace should potentially handled in an 
io_thread...
e.g. things set with fcntl(fd, F_SETSIG,) used together with F_SETLEASE?

metze



Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher
Am 26.03.21 um 16:10 schrieb Jens Axboe:
> On 3/26/21 9:08 AM, Stefan Metzmacher wrote:
>> Am 26.03.21 um 15:55 schrieb Jens Axboe:
>>> On 3/26/21 8:53 AM, Jens Axboe wrote:
>>>> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>>>>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>>>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>>>>
>>>>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>>>>> incremental you sent, which wasn't complete.
>>>>>>>>>
>>>>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>>>>
>>>>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>>>>> pushed), fwiw.
>>>>>>>>
>>>>>>>> I can reproduce this one! I'll take a closer look.
>>>>>>>
>>>>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>>>>> and never return. So we might need a special case in there to deal with
>>>>>>> that, or some other way of ensuring that fatal signal gets processed
>>>>>>> correctly for IO threads.
>>>>>>
>>>>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>>>>> being called?
>>>>>
>>>>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
>>>>
>>>> Yes exactly, we're waiting in there being stopped. So we either need to
>>>> check to something ala:
>>>>
>>>> relock:
>>>> +  if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
>>>> +  return false;
>>>>
>>>> to catch it upfront and from the relock case, or add:
>>>>
>>>>fatal:
>>>> +  if (current->flags & PF_IO_WORKER)
>>>> +  return false;
>>>>
>>>> to catch it in the fatal section.
>>>
>>> Can you try this? Not crazy about adding a special case, but I don't
>>> think there's any way around this one. And should be pretty cheap, as
>>> we're already pulling in ->flags right above anyway.
>>>
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 5ad8566534e7..5b75fbe3d2d6 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -2752,6 +2752,15 @@ bool get_signal(struct ksignal *ksig)
>>>  */
>>> current->flags |= PF_SIGNALED;
>>>  
>>> +   /*
>>> +* PF_IO_WORKER threads will catch and exit on fatal signals
>>> +* themselves. They have cleanup that must be performed, so
>>> +* we cannot call do_exit() on their behalf. coredumps also
>>> +* do not apply to them.
>>> +*/
>>> +   if (current->flags & PF_IO_WORKER)
>>> +   return false;
>>> +
>>> if (sig_kernel_coredump(signr)) {
>>> if (print_fatal_signals)
>>> print_fatal_signal(ksig->info.si_signo);
>>>
>>
>> I guess not before next week, but if it resolves the problem for you,
>> I guess it would be good to get this into rc5.
> 
> It does, I pushed out a new branch. I'll send out a v2 series in a bit.

Great, thanks!

Any chance to get the "cmdline" hiding included?

metze



Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher
Am 26.03.21 um 15:55 schrieb Jens Axboe:
> On 3/26/21 8:53 AM, Jens Axboe wrote:
>> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>>
>>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>>> incremental you sent, which wasn't complete.
>>>>>>>
>>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>>
>>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>>> pushed), fwiw.
>>>>>>
>>>>>> I can reproduce this one! I'll take a closer look.
>>>>>
>>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>>> and never return. So we might need a special case in there to deal with
>>>>> that, or some other way of ensuring that fatal signal gets processed
>>>>> correctly for IO threads.
>>>>
>>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>>> being called?
>>>
>>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
>>
>> Yes exactly, we're waiting in there being stopped. So we either need to
>> check to something ala:
>>
>> relock:
>> +if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
>> +return false;
>>
>> to catch it upfront and from the relock case, or add:
>>
>>  fatal:
>> +if (current->flags & PF_IO_WORKER)
>> +return false;
>>
>> to catch it in the fatal section.
> 
> Can you try this? Not crazy about adding a special case, but I don't
> think there's any way around this one. And should be pretty cheap, as
> we're already pulling in ->flags right above anyway.
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5ad8566534e7..5b75fbe3d2d6 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2752,6 +2752,15 @@ bool get_signal(struct ksignal *ksig)
>*/
>   current->flags |= PF_SIGNALED;
>  
> + /*
> +  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * themselves. They have cleanup that must be performed, so
> +  * we cannot call do_exit() on their behalf. coredumps also
> +  * do not apply to them.
> +  */
> + if (current->flags & PF_IO_WORKER)
> + return false;
> +
>   if (sig_kernel_coredump(signr)) {
>   if (print_fatal_signals)
>   print_fatal_signal(ksig->info.si_signo);
> 

I guess not before next week, but if it resolves the problem for you,
I guess it would be good to get this into rc5.

metze


Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher


Am 26.03.21 um 15:53 schrieb Jens Axboe:
> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>
>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>> incremental you sent, which wasn't complete.
>>>>>>
>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>
>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>> pushed), fwiw.
>>>>>
>>>>> I can reproduce this one! I'll take a closer look.
>>>>
>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>> and never return. So we might need a special case in there to deal with
>>>> that, or some other way of ensuring that fatal signal gets processed
>>>> correctly for IO threads.
>>>
>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>> being called?
>>
>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
> 
> Yes exactly, we're waiting in there being stopped. So we either need to
> check to something ala:
> 
> relock:
> + if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
> + return false;
> 
> to catch it upfront and from the relock case, or add:
> 
>   fatal:
> + if (current->flags & PF_IO_WORKER)
> + return false;
> 
> to catch it in the fatal section.
> 

Or something like io_uring_files_cancel()

Maybe change current->pf_io_worker with a generic current->io_thread
structure which, has exit hooks, as well as
io_wq_worker_sleeping() and io_wq_worker_running().

Maybe create_io_thread would take such an structure
as argument instead of a single function pointer.

struct io_thread_description {
const char *name;
int (*thread_fn)(struct io_thread_description *);
void (*sleeping_fn)((struct io_thread_description *);
void (*running_fn)((struct io_thread_description *);
void (*exit_fn)((struct io_thread_description *);
};

And then
struct io_wq_manager {
struct io_thread_description description;
... manager specific stuff...
};

metze


Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher
Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>> The KILL after STOP deadlock still exists.
>>>>
>>>> In which tree? Sounds like you're still on the old one with that
>>>> incremental you sent, which wasn't complete.
>>>>
>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>
>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>> tested both of them separately and didn't observe anything. I also ran
>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>> pushed), fwiw.
>>>
>>> I can reproduce this one! I'll take a closer look.
>>
>> OK, that one is actually pretty straight forward - we rely on cleaning
>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>> and never return. So we might need a special case in there to deal with
>> that, or some other way of ensuring that fatal signal gets processed
>> correctly for IO threads.
> 
> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
> being called?

Ah, we're still in the first get_signal() from SIGSTOP, correct?

metze



Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher
Am 26.03.21 um 15:38 schrieb Jens Axboe:
> On 3/26/21 7:59 AM, Jens Axboe wrote:
>> On 3/26/21 7:54 AM, Jens Axboe wrote:
 The KILL after STOP deadlock still exists.
>>>
>>> In which tree? Sounds like you're still on the old one with that
>>> incremental you sent, which wasn't complete.
>>>
 Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>
>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>> tested both of them separately and didn't observe anything. I also ran
>>> your io_uring-cp example (and found a bug in the example, fixed and
>>> pushed), fwiw.
>>
>> I can reproduce this one! I'll take a closer look.
> 
> OK, that one is actually pretty straight forward - we rely on cleaning
> up on exit, but for fatal cases, get_signal() will call do_exit() for us
> and never return. So we might need a special case in there to deal with
> that, or some other way of ensuring that fatal signal gets processed
> correctly for IO threads.

And if (fatal_signal_pending(current)) doesn't prevent get_signal() from being 
called?

metze



Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher
Am 26.03.21 um 13:56 schrieb Jens Axboe:
> On 3/26/21 5:48 AM, Stefan Metzmacher wrote:
>>
>> Am 26.03.21 um 01:39 schrieb Jens Axboe:
>>> Hi,
>>>
>>> As discussed in a previous thread today, the seemingly much saner approach
>>> is just to allow signals (including SIGSTOP) for the PF_IO_WORKER IO
>>> threads. If we just have the threads call get_signal() for
>>> signal_pending(), then everything just falls out naturally with how
>>> we receive and handle signals.
>>>
>>> Patch 1 adds support for checking and calling get_signal() from the
>>> regular IO workers, the manager, and the SQPOLL thread. Patch 2 unblocks
>>> SIGSTOP from the default IO thread blocked mask, and the rest just revert
>>> special cases that were put in place for PF_IO_WORKER threads.
>>>
>>> With this done, only two special cases remain for PF_IO_WORKER, and they
>>> aren't related to signals so not part of this patchset. But both of them
>>> can go away as well now that we have "real" threads as IO workers, and
>>> then we'll have zero special cases for PF_IO_WORKER.
>>>
>>> This passes the usual regression testing, my other usual 24h run has been
>>> kicked off. But I wanted to send this out early.
>>>
>>> Thanks to Linus for the suggestion. As with most other good ideas, it's
>>> obvious once you hear it. The fact that we end up with _zero_ special
>>> cases with this is a clear sign that this is the right way to do it
>>> indeed. The fact that this series is 2/3rds revert further drives that
>>> point home. Also thanks to Eric for diligent review on the signal side
>>> of things for the past changes (and hopefully ditto on this series :-))
>>
>> Ok, I'm testing a8ff6a3b20bd16d071ef66824ae4428529d114f9 from
>> your io_uring-5.12 branch.
>>
>> And using this patch:
>> diff --git a/examples/io_uring-cp.c b/examples/io_uring-cp.c
>> index cc7a227a5ec7..6e26a4214015 100644
>> --- a/examples/io_uring-cp.c
>> +++ b/examples/io_uring-cp.c
>> @@ -116,13 +116,16 @@ static void queue_write(struct io_uring *ring, struct 
>> io_data *data)
>> io_uring_submit(ring);
>>  }
>>
>> -static int copy_file(struct io_uring *ring, off_t insize)
>> +static int copy_file(struct io_uring *ring, off_t _insize)
>>  {
>> +   off_t insize = _insize;
>> unsigned long reads, writes;
>> struct io_uring_cqe *cqe;
>> off_t write_left, offset;
>> int ret;
>>
>> +again:
>> +   insize = _insize;
>> write_left = insize;
>> writes = reads = offset = 0;
>>
>> @@ -221,6 +224,12 @@ static int copy_file(struct io_uring *ring, off_t 
>> insize)
>> }
>> }
>>
>> +   {
>> +   struct timespec ts = { .tv_nsec = 99, };
>> +   nanosleep(, NULL);
>> +   goto again;
>> +   }
>> +
>> return 0;
>>  }
>>
>> Running ./io_uring-cp ~/linux-image-5.12.0-rc2+-dbg_5.12.0-rc2+-5_amd64.deb 
>> file
>> What I see is this:
>>
>> kill -SIGSTOP to any thread I used a worker with pid 2061 here, results in
>>
>> root@ub1704-166:~# head /proc/2061/status
>> Name:   iou-wrk-2041
>> Umask:  0022
>> State:  R (running)
>> Tgid:   2041
>> Ngid:   0
>> Pid:2061
>> PPid:   1857
>> TracerPid:  0
>> Uid:0   0   0   0
>> Gid:0   0   0   0
>> root@ub1704-166:~# head /proc/2041/status
>> Name:   io_uring-cp
>> Umask:  0022
>> State:  T (stopped)
>> Tgid:   2041
>> Ngid:   0
>> Pid:2041
>> PPid:   1857
>> TracerPid:  0
>> Uid:0   0   0   0
>> Gid:0   0   0   0
>> root@ub1704-166:~# head /proc/2042/status
>> Name:   iou-mgr-2041
>> Umask:  0022
>> State:  T (stopped)
>> Tgid:   2041
>> Ngid:   0
>> Pid:2042
>> PPid:   1857
>> TracerPid:  0
>> Uid:0   0   0   0
>> Gid:0   0   0   0
>>
>> So userspace and iou-mgr-2041 stop, but the workers don't.
>> 49 workers burn cpu as much as possible.
>>
>> kill -KILL 2061
>> results in this:
>> - all workers are gone
>> - iou-mgr-2041 is gone
>> - io_uring-cp waits in status D forever
>>
>> root@ub1704-166:~# head /proc/2041/status
>> Name:   io_uring-cp
>> Umask:  0022
>> State:  D (disk slee

Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-26 Thread Stefan Metzmacher


Hi Jens,

>> And /proc/$iothread/ should be read only and owned by root with
>> "cmdline" and "exe" being empty.
> 
> I know you brought this one up as part of your series, not sure I get
> why you want it owned by root and read-only? cmdline and exe, yeah those
> could be hidden, but is there really any point?
> 
> Maybe I'm missing something here, if so, do clue me in!

I looked through /proc and I think it's mostly similar to
the unshare() case, if userspace wants to do stupid things
like changing "comm" of iothreads, it gets what was asked for.

But the "cmdline" hiding would be very useful.

While most tools use "comm", by default.

ps -eLf or 'iotop' use "cmdline".

Some processes use setproctitle to change "cmdline" in order
to identify the process better, without the 15 chars comm restriction,
that's why I very often press 'c' in 'top' to see the cmdline,
in that case it would be very helpful to see '[iou-wrk-1234]'
instead of the seeing the cmdline.

So I'd very much prefer if this could be applied:
https://lore.kernel.org/io-uring/d4487f959c778d0b1d4c5738b75bcff17d21df5b.1616197787.git.me...@samba.org/T/#u

If you want I can add a comment and a more verbose commit message...

metze


Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Stefan Metzmacher


Am 25.03.21 um 22:44 schrieb Jens Axboe:
> On 3/25/21 2:40 PM, Jens Axboe wrote:
>> On 3/25/21 2:12 PM, Linus Torvalds wrote:
>>> On Thu, Mar 25, 2021 at 12:42 PM Linus Torvalds
>>>  wrote:

 On Thu, Mar 25, 2021 at 12:38 PM Linus Torvalds
  wrote:
>
> I don't know what the gdb logic is, but maybe there's some other
> option that makes gdb not react to them?

 .. maybe we could have a different name for them under the task/
 subdirectory, for example (not  just the pid)? Although that probably
 messes up 'ps' too..
>>>
>>> Actually, maybe the right model is to simply make all the io threads
>>> take signals, and get rid of all the special cases.
>>>
>>> Sure, the signals will never be delivered to user space, but if we
>>>
>>>  - just made the thread loop do "get_signal()" when there are pending 
>>> signals
>>>
>>>  - allowed ptrace_attach on them
>>>
>>> they'd look pretty much like regular threads that just never do the
>>> user-space part of signal handling.
>>>
>>> The whole "signals are very special for IO threads" thing has caused
>>> so many problems, that maybe the solution is simply to _not_ make them
>>> special?
>>
>> Just to wrap up the previous one, yes it broke all sorts of things to
>> make the 'tid' directory different. They just end up being hidden anyway
>> through that, for both ps and top.
>>
>> Yes, I do think that maybe it's better to just embrace maybe just
>> embrace the signals, and have everything just work by default. It's
>> better than continually trying to make the threads special. I'll see
>> if there are some demons lurking down that path.
> 
> In the spirit of "let's just try it", I ran with the below patch. With
> that, I can gdb attach just fine to a test case that creates an io_uring
> and a regular thread with pthread_create(). The regular thread uses
> the ring, so you end up with two iou-mgr threads. Attach:
> 
> [root@archlinux ~]# gdb -p 360
> [snip gdb noise]
> Attaching to process 360
> [New LWP 361]
> [New LWP 362]
> [New LWP 363]
> 
> warning: Selected architecture i386:x86-64 is not compatible with reported 
> target architecture i386
> 
> warning: Architecture rejected target-supplied description
> Error while reading shared library symbols for /usr/lib/libpthread.so.0:
> Cannot find user-level thread for LWP 363: generic error
> 0x7f7aa526e125 in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
> (gdb) info threads
>   Id   Target Id Frame 
> * 1LWP 360 "io_uring"0x7f7aa526e125 in 
> clock_nanosleep@GLIBC_2.2.5 ()
>from /usr/lib/libc.so.6
>   2LWP 361 "iou-mgr-360" 0x in ?? ()
>   3LWP 362 "io_uring"0x7f7aa52a0a9d in syscall () from 
> /usr/lib/libc.so.6
>   4LWP 363 "iou-mgr-362" 0x in ?? ()
> (gdb) thread 2
> [Switching to thread 2 (LWP 361)]
> #0  0x in ?? ()
> (gdb) bt
> #0  0x in ?? ()
> Backtrace stopped: Cannot access memory at address 0x0
> (gdb) cont
> Continuing.
> ^C
> Thread 1 "io_uring" received signal SIGINT, Interrupt.
> [Switching to LWP 360]
> 0x7f7aa526e125 in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
> (gdb) q
> A debugging session is active.
> 
>   Inferior 1 [process 360] will be detached.
> 
> Quit anyway? (y or n) y
> Detaching from program: /root/git/fio/t/io_uring, process 360
> [Inferior 1 (process 360) detached]
> 
> The iou-mgr-x threads are stopped just fine, gdb obviously can't get any
> real info out of them. But it works... Regular test cases work fine too,
> just a sanity check. Didn't expect them not to.

I guess that's basically what I tried to describe when I said they should
look like a userspace process that is blocked in a syscall forever.

> Only thing that I dislike a bit, but I guess that's just a Linuxism, is
> that if can now kill an io_uring owning task by sending a signal to one
> of its IO thread workers.

Can't we just only allow SIGSTOP, which will be only delivered to
the iothread itself? And also SIGKILL should not be allowed from userspace.

And /proc/$iothread/ should be read only and owned by root with
"cmdline" and "exe" being empty.

Thanks!
metze


Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Stefan Metzmacher


Am 25.03.21 um 22:20 schrieb Stefan Metzmacher:
> 
> Am 25.03.21 um 21:55 schrieb Eric W. Biederman:
>> Oleg Nesterov  writes:
>>
>>> On 03/25, Linus Torvalds wrote:
>>>>
>>>> The whole "signals are very special for IO threads" thing has caused
>>>> so many problems, that maybe the solution is simply to _not_ make them
>>>> special?
>>>
>>> Or may be IO threads should not abuse CLONE_THREAD?
>>>
>>> Why does create_io_thread() abuse CLONE_THREAD ?
>>>
>>> One reason (I think) is that this implies SIGKILL when the process 
>>> exits/execs,
>>> anything else?
>>
>> A lot.
>>
>> The io workers perform work on behave of the ordinary userspace threads.
>> Some of that work is opening files.  For things like rlimits to work
>> properly you need to share the signal_struct.  But odds are if you find
>> anything in signal_struct (not counting signals) there will be an
>> io_uring code path that can exercise it as io_uring can traverse the
>> filesystem, open files and read/write files.  So io_uring can exercise
>> all of proc.
>>
>> Using create_io_thread with CLONE_THREAD is the least problematic way
>> (including all of the signal and ptrace problems we are looking at right
>> now) to implement the io worker threads.
>>
>> They _really_ are threads of the process that just never execute any
>> code in userspace.
> 
> So they should look like a userspace thread sitting in something like
> epoll_pwait() with all signals blocked, which will never return to userspace 
> again?

Would gdb work with that?
The question is what backtrace gdb would show for that thread.

Is it possible to block SIGSTOP/SIGCONT?

I also think that all signals to an iothread should not be delivered to
other threads and it may only react on a direct SIGSTOP/SIGCONT.
I guess even SIGKILL should be ignored as the shutdown should happen
via the exit path of the iothread parent only.

> I think that would be useful, but I also think that userspace should see:
> - /proc/$tidofiothread/cmdline as empty (in order to let ps and top use 
> [iou-wrk-$tidofuserspacethread])
> - /proc/$tidofiothread/exe as symlink to that not exists
> - all of /proc/$tidofiothread/ shows root.root as owner and group
>   and things which still allow write access to /proc/$tidofiothread/comm 
> similar things
>   with rw permissions should still disallow modifications:
> 
> For the other kernel threads e.g. "[cryptd]" I see the following:
> 
> LANG=C ls -l /proc/653 | grep rw
> ls: cannot read symbolic link '/proc/653/exe': No such file or directory
> -rw-r--r--  1 root root 0 Mar 25 22:09 autogroup
> -rw-r--r--  1 root root 0 Mar 25 22:09 comm
> -rw-r--r--  1 root root 0 Mar 25 22:09 coredump_filter
> lrwxrwxrwx  1 root root 0 Mar 25 22:09 cwd -> /
> lrwxrwxrwx  1 root root 0 Mar 25 22:09 exe
> -rw-r--r--  1 root root 0 Mar 25 22:09 gid_map
> -rw-r--r--  1 root root 0 Mar 25 22:09 loginuid
> -rw---  1 root root 0 Mar 25 22:09 mem
> -rw-r--r--  1 root root 0 Mar 25 22:09 oom_adj
> -rw-r--r--  1 root root 0 Mar 25 22:09 oom_score_adj
> -rw-r--r--  1 root root 0 Mar 25 22:09 projid_map
> lrwxrwxrwx  1 root root 0 Mar 25 22:09 root -> /
> -rw-r--r--  1 root root 0 Mar 25 22:09 sched
> -rw-r--r--  1 root root 0 Mar 25 22:09 setgroups
> -rw-r--r--  1 root root 0 Mar 25 22:09 timens_offsets
> -rw-rw-rw-  1 root root 0 Mar 25 22:09 timerslack_ns
> -rw-r--r--  1 root root 0 Mar 25 22:09 uid_map
> 
> And this:
> 
> LANG=C echo "bla" > /proc/653/comm
> -bash: echo: write error: Invalid argument
> 
> LANG=C echo "bla" > /proc/653/gid_map
> -bash: echo: write error: Operation not permitted
> 
> Can't we do the same for iothreads regarding /proc?
> Just make things read only there and empty "cmdline"/"exe"?
> 
> Maybe I'm too naive, but that what I'd assume as a userspace developer/admin.
> 
> Does at least parts of it make any sense?

I think the strange glibc setuid() behavior should also be tests here,
I guess we don't want that to reset the credentials of an iothread!

Another idea would be to have the iothreads as a child process with it's 
threads,
but again I'm only looking as an admin to what I'd except to see under /proc
via ps and top.

metze


Re: [PATCH 0/2] Don't show PF_IO_WORKER in /proc//task/

2021-03-25 Thread Stefan Metzmacher


Am 25.03.21 um 21:55 schrieb Eric W. Biederman:
> Oleg Nesterov  writes:
> 
>> On 03/25, Linus Torvalds wrote:
>>>
>>> The whole "signals are very special for IO threads" thing has caused
>>> so many problems, that maybe the solution is simply to _not_ make them
>>> special?
>>
>> Or may be IO threads should not abuse CLONE_THREAD?
>>
>> Why does create_io_thread() abuse CLONE_THREAD ?
>>
>> One reason (I think) is that this implies SIGKILL when the process 
>> exits/execs,
>> anything else?
> 
> A lot.
> 
> The io workers perform work on behave of the ordinary userspace threads.
> Some of that work is opening files.  For things like rlimits to work
> properly you need to share the signal_struct.  But odds are if you find
> anything in signal_struct (not counting signals) there will be an
> io_uring code path that can exercise it as io_uring can traverse the
> filesystem, open files and read/write files.  So io_uring can exercise
> all of proc.
> 
> Using create_io_thread with CLONE_THREAD is the least problematic way
> (including all of the signal and ptrace problems we are looking at right
> now) to implement the io worker threads.
> 
> They _really_ are threads of the process that just never execute any
> code in userspace.

So they should look like a userspace thread sitting in something like
epoll_pwait() with all signals blocked, which will never return to userspace 
again?

I think that would be useful, but I also think that userspace should see:
- /proc/$tidofiothread/cmdline as empty (in order to let ps and top use 
[iou-wrk-$tidofuserspacethread])
- /proc/$tidofiothread/exe as symlink to that not exists
- all of /proc/$tidofiothread/ shows root.root as owner and group
  and things which still allow write access to /proc/$tidofiothread/comm 
similar things
  with rw permissions should still disallow modifications:

For the other kernel threads e.g. "[cryptd]" I see the following:

LANG=C ls -l /proc/653 | grep rw
ls: cannot read symbolic link '/proc/653/exe': No such file or directory
-rw-r--r--  1 root root 0 Mar 25 22:09 autogroup
-rw-r--r--  1 root root 0 Mar 25 22:09 comm
-rw-r--r--  1 root root 0 Mar 25 22:09 coredump_filter
lrwxrwxrwx  1 root root 0 Mar 25 22:09 cwd -> /
lrwxrwxrwx  1 root root 0 Mar 25 22:09 exe
-rw-r--r--  1 root root 0 Mar 25 22:09 gid_map
-rw-r--r--  1 root root 0 Mar 25 22:09 loginuid
-rw---  1 root root 0 Mar 25 22:09 mem
-rw-r--r--  1 root root 0 Mar 25 22:09 oom_adj
-rw-r--r--  1 root root 0 Mar 25 22:09 oom_score_adj
-rw-r--r--  1 root root 0 Mar 25 22:09 projid_map
lrwxrwxrwx  1 root root 0 Mar 25 22:09 root -> /
-rw-r--r--  1 root root 0 Mar 25 22:09 sched
-rw-r--r--  1 root root 0 Mar 25 22:09 setgroups
-rw-r--r--  1 root root 0 Mar 25 22:09 timens_offsets
-rw-rw-rw-  1 root root 0 Mar 25 22:09 timerslack_ns
-rw-r--r--  1 root root 0 Mar 25 22:09 uid_map

And this:

LANG=C echo "bla" > /proc/653/comm
-bash: echo: write error: Invalid argument

LANG=C echo "bla" > /proc/653/gid_map
-bash: echo: write error: Operation not permitted

Can't we do the same for iothreads regarding /proc?
Just make things read only there and empty "cmdline"/"exe"?

Maybe I'm too naive, but that what I'd assume as a userspace developer/admin.

Does at least parts of it make any sense?

metze


Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Stefan Metzmacher
Am 25.03.21 um 14:38 schrieb Jens Axboe:
> On 3/25/21 6:11 AM, Stefan Metzmacher wrote:
>>
>> Am 25.03.21 um 13:04 schrieb Eric W. Biederman:
>>> Stefan Metzmacher  writes:
>>>
>>>> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>>>>> From: "Eric W. Biederman" 
>>>>>
>>>>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>>>>>
>>>>> Just like we don't allow normal signals to IO threads, don't deliver a
>>>>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>>>>> signals in general, and have no means of flushing out a stop either.
>>>>>
>>>>> Longer term, we may want to look into allowing stop of these threads,
>>>>> as it relates to eg process freezing. For now, this prevents a spin
>>>>> issue if a SIGSTOP is delivered to the parent task.
>>>>>
>>>>> Reported-by: Stefan Metzmacher 
>>>>> Signed-off-by: Jens Axboe 
>>>>> Signed-off-by: "Eric W. Biederman" 
>>>>> Signed-off-by: Sasha Levin 
>>>>> ---
>>>>>  kernel/signal.c | 3 ++-
>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>>>> index 55526b941011..00a3840f6037 100644
>>>>> --- a/kernel/signal.c
>>>>> +++ b/kernel/signal.c
>>>>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct 
>>>>> *task, unsigned long mask)
>>>>>   JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>>>>   BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>>>>>  
>>>>> - if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
>>>>> + if (unlikely(fatal_signal_pending(task) ||
>>>>> +  (task->flags & (PF_EXITING | PF_IO_WORKER
>>>>>   return false;
>>>>>  
>>>>>   if (mask & JOBCTL_STOP_SIGMASK)
>>>>>
>>>>
>>>> Again, why is this proposed for 5.11 and 5.10 already?
>>>
>>> Has the bit about the io worker kthreads been backported?
>>> If so this isn't horrible.  If not this is nonsense.
> 
> No not yet - my plan is to do that, but not until we're 100% satisfied
> with it.

Do you understand why the patches where autoselected for 5.11 and 5.10?

>> I don't know, I hope not...
>>
>> But I just tested v5.12-rc4 and attaching to
>> an application with iothreads with gdb is still not possible,
>> it still loops forever trying to attach to the iothreads.
> 
> I do see the looping, gdb apparently doesn't give up when it gets
> -EPERM trying to attach to the threads. Which isn't really a kernel
> thing, but:

Maybe we need to remove the iothreads from /proc/pid/tasks/

>> And I tested 'kill -9 $pidofiothread', and it feezed the whole
>> machine...
> 
> that sounds very strange, I haven't seen anything like that running
> the exact same scenario.
> 
>> So there's still work to do in order to get 5.12 stable.
>>
>> I'm short on time currently, but I hope to send more details soon.
> 
> Thanks! I'll play with it this morning and see if I can provoke
> something odd related to STOP/attach.

Thanks!

Somehow I have the impression that your same_thread_group_account patch
may fix a lot of things...

metze



Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Stefan Metzmacher


Am 25.03.21 um 13:04 schrieb Eric W. Biederman:
> Stefan Metzmacher  writes:
> 
>> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>>> From: "Eric W. Biederman" 
>>>
>>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>>>
>>> Just like we don't allow normal signals to IO threads, don't deliver a
>>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>>> signals in general, and have no means of flushing out a stop either.
>>>
>>> Longer term, we may want to look into allowing stop of these threads,
>>> as it relates to eg process freezing. For now, this prevents a spin
>>> issue if a SIGSTOP is delivered to the parent task.
>>>
>>> Reported-by: Stefan Metzmacher 
>>> Signed-off-by: Jens Axboe 
>>> Signed-off-by: "Eric W. Biederman" 
>>> Signed-off-by: Sasha Levin 
>>> ---
>>>  kernel/signal.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 55526b941011..00a3840f6037 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
>>> unsigned long mask)
>>> JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>> BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>>>  
>>> -   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
>>> +   if (unlikely(fatal_signal_pending(task) ||
>>> +(task->flags & (PF_EXITING | PF_IO_WORKER
>>> return false;
>>>  
>>> if (mask & JOBCTL_STOP_SIGMASK)
>>>
>>
>> Again, why is this proposed for 5.11 and 5.10 already?
> 
> Has the bit about the io worker kthreads been backported?
> If so this isn't horrible.  If not this is nonsense.

I don't know, I hope not...

But I just tested v5.12-rc4 and attaching to
an application with iothreads with gdb is still not possible,
it still loops forever trying to attach to the iothreads.

And I tested 'kill -9 $pidofiothread', and it feezed the whole
machine...

So there's still work to do in order to get 5.12 stable.

I'm short on time currently, but I hope to send more details soon.

metze


Re: [PATCH AUTOSEL 5.11 42/44] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-25 Thread Stefan Metzmacher
Am 25.03.21 um 12:24 schrieb Sasha Levin:
> From: Jens Axboe 
> 
> [ Upstream commit 5be28c8f85ce99ed2d329d2ad8bdd18ea19473a5 ]
> 
> They don't take signals individually, and even if they share signals with
> the parent task, don't allow them to be delivered through the worker
> thread. Linux does allow this kind of behavior for regular threads, but
> it's really a compatability thing that we need not care about for the IO
> threads.
> 
> Reported-by: Stefan Metzmacher 
> Signed-off-by: Jens Axboe 
> Signed-off-by: Sasha Levin 
> ---
>  kernel/signal.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5ad8566534e7..55526b941011 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -833,6 +833,9 @@ static int check_kill_permission(int sig, struct 
> kernel_siginfo *info,
>  
>   if (!valid_signal(sig))
>   return -EINVAL;
> + /* PF_IO_WORKER threads don't take any signals */
> + if (t->flags & PF_IO_WORKER)
> + return -ESRCH;

Why is that proposed for 5.11 and 5.10 now?

Are the create_io_thread() patches already backported?

I think we should hold on with the backports until
everything is stable in v5.12 final.

I'm still about to test on top of v5.12-rc4
and have a pending mail why I think this particular change is
wrong even in 5.12.

Jens, did you send these to stable?

metze



Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Stefan Metzmacher
Am 25.03.21 um 12:24 schrieb Sasha Levin:
> From: "Eric W. Biederman" 
> 
> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
> 
> Just like we don't allow normal signals to IO threads, don't deliver a
> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
> signals in general, and have no means of flushing out a stop either.
> 
> Longer term, we may want to look into allowing stop of these threads,
> as it relates to eg process freezing. For now, this prevents a spin
> issue if a SIGSTOP is delivered to the parent task.
> 
> Reported-by: Stefan Metzmacher 
> Signed-off-by: Jens Axboe 
> Signed-off-by: "Eric W. Biederman" 
> Signed-off-by: Sasha Levin 
> ---
>  kernel/signal.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 55526b941011..00a3840f6037 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
> unsigned long mask)
>   JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>   BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>  
> - if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
> + if (unlikely(fatal_signal_pending(task) ||
> +  (task->flags & (PF_EXITING | PF_IO_WORKER
>   return false;
>  
>   if (mask & JOBCTL_STOP_SIGMASK)
> 

Again, why is this proposed for 5.11 and 5.10 already?

metze


Re: [Linux-cifsd-devel] [PATCH 2/5] cifsd: add server-side procedures for SMB3

2021-03-22 Thread Stefan Metzmacher

Am 22.03.21 um 07:50 schrieb Christoph Hellwig:
> On Mon, Mar 22, 2021 at 09:47:13AM +0300, Dan Carpenter wrote:
>> On Mon, Mar 22, 2021 at 02:13:41PM +0900, Namjae Jeon wrote:
>>> +static unsigned char
>>> +asn1_octet_decode(struct asn1_ctx *ctx, unsigned char *ch)
>>> +{
>>> +   if (ctx->pointer >= ctx->end) {
>>> +   ctx->error = ASN1_ERR_DEC_EMPTY;
>>> +   return 0;
>>> +   }
>>> +   *ch = *(ctx->pointer)++;
>>> +   return 1;
>>> +}
>>
>>
>> Make this bool.
>>
> 
> More importantly don't add another ANS1 parser, but use the generic
> one in lib/asn1_decoder.c instead.  CIFS should also really use it.

I think the best would be to avoid asn1 completely in the kernel
and do the whole authentication in userspace.

The kernel can only deal this blobs here, I don't there's need to
look inside the blobs.

1. ksmbd-mount would provide a fixed initial blob that's always
   the same and will be returned in the
   "2.2.4 SMB2 NEGOTIATE Response" PDU as SecurityBuffer

2. The kernel just blindly forwards the SecurityBuffer
   of "2.2.5 SMB2 SESSION_SETUP Request" to userspace
   together with the client provided SessionId (from
   2.2.1.2 SMB2 Packet Header - SYNC) as well as
   negotiated signing and encryption algorithm ids
   and the latest preauth hash.

3. Userspace passes a NTSTATUS together with SecurityBuffer blob for the
   2.2.6 SMB2 SESSION_SETUP Response back to the kernel:

   - NT_STATUS_MORE_PROCESSING_REQUIRED (more authentication legs are required)
 SecurityBuffer is most likely a non empty buffer

   - NT_STATUS_OK - The authentication is complete:
 SecurityBuffer might be empty or not
 It also pass a channel signing key, a decryption and encrytion key
 as well as the unix token ( I guess in the current form it's only uid/gid)
 down to the kernel

   - Any other status means the authentication failed, which is a hard error 
for the client

The PDU definitions are defined here:
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-smb2/6eaf6e75-9c23-4eda-be99-c9223c60b181

I think everything else belongs to userspace.

Such a "simple" design for the kernel part, would mean that ksmbd-mount would 
do what the
kernel part is currently doing, but it also means it will be trivial to plug 
the userspace
part to samba's winbindd in future order to get domain wide authentication.

metze



OpenPGP_signature
Description: OpenPGP digital signature


Re: linux-next: Signed-off-by missing for commit in the block tree

2021-03-19 Thread Stefan Metzmacher


Am 19.03.21 um 14:08 schrieb Jens Axboe:
> On 3/19/21 2:02 AM, Stefan Metzmacher wrote:
>>
>> Am 19.03.21 um 00:25 schrieb Jens Axboe:
>>> On 3/18/21 5:16 PM, Stephen Rothwell wrote:
>>>> Hi all,
>>>>
>>>> Commit
>>>>
>>>>   c2c6c067c050 ("io_uring: remove structures from 
>>>> include/linux/io_uring.h")
>>>>
>>>> is missing a Signed-off-by from its author.
>>>
>>> Stefan, let me know if you're OK with me adding that, not sure how I missed
>>> that.
>>
>> Yes, sure :-)
>> I guess you removed it while adding 'Link:'
> 
> That was b4, I don't add those manually. But maybe it stripped those too,
> annoying...
> 
>> You may want to remove cc: stable from 
>> 3aab52c9a708f7183460d368700181ef0c2a09e6
>> ("io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls")
>> for now.
>>
>> I'll want to do some more test with it on 5.12,
>> I guess we'd then have to backport it to stable as part of the
>> io_thread worker backport. I'll post some more details later
>> to the io-uring list.
> 
> Sure, let's do that. I also dropped the short link sever as well for now.
> I do like it on principle, but it does have a risk of breaking valid
> use cases.

Thanks, I'll resubmit that with the MSG_WAITALL logic.

metze


Re: linux-next: Signed-off-by missing for commit in the block tree

2021-03-19 Thread Stefan Metzmacher


Am 19.03.21 um 00:25 schrieb Jens Axboe:
> On 3/18/21 5:16 PM, Stephen Rothwell wrote:
>> Hi all,
>>
>> Commit
>>
>>   c2c6c067c050 ("io_uring: remove structures from include/linux/io_uring.h")
>>
>> is missing a Signed-off-by from its author.
> 
> Stefan, let me know if you're OK with me adding that, not sure how I missed
> that.

Yes, sure :-)
I guess you removed it while adding 'Link:'

You may want to remove cc: stable from 3aab52c9a708f7183460d368700181ef0c2a09e6
("io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls")
for now.

I'll want to do some more test with it on 5.12,
I guess we'd then have to backport it to stable as part of the
io_thread worker backport. I'll post some more details later
to the io-uring list.

Thanks!
metze


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-08 Thread Stefan Metzmacher
Am 08.03.21 um 12:11 schrieb David Laight:
> From: Stefan Metzmacher
>> Sent: 07 March 2021 11:35
>>
>> Hi André,
>>>  ** The wait on multiple problem
>>>
>>>  The use case lies in the Wine implementation of the Windows NT interface
>>>  WaitMultipleObjects. This Windows API function allows a thread to sleep
>>>  waiting on the first of a set of event sources (mutexes, timers, signal,
>>>  console input, etc) to signal.
> 
> They are all events.
> You can only wait on either events or sockets (using select).
> There is a socket api to signal an event when data arrives (etc).
> There is also the insane (these days) restriction of 64 events.

Ok.

>> With that in mind would it be good to have some interaction with epoll (and 
>> similar calls)?
> 
> Or hook something up so that pollwakeup can kick a futex as well
> as waking up poll() and adding an event to epoll().

I guess as FUTEX_FD was already there and was removed we can stop this 
discussion.

If there will every be the need to an async call, I guess a io_uring based one 
would
be the best...

metze



Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-08 Thread Stefan Metzmacher


Am 07.03.21 um 12:56 schrieb Daurnimator:
> On Sun, 7 Mar 2021 at 22:35, Stefan Metzmacher  wrote:
>> Instead of having a blocked futex_waitv() waiting on an fd (maybe a generic 
>> eventfd() or a new futex2fd())
>> would be a better interface?
> 
> Like bring back FUTEX_FD? (which was removed back in 2.6.25)

Ah, ok, yes something like that.

But as that was removed because of races, but might not be a good idea to bring 
it back.

metze


Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-07 Thread Stefan Metzmacher


Hi André,
>  ** The wait on multiple problem
> 
>  The use case lies in the Wine implementation of the Windows NT interface
>  WaitMultipleObjects. This Windows API function allows a thread to sleep
>  waiting on the first of a set of event sources (mutexes, timers, signal,
>  console input, etc) to signal.  

With that in mind would it be good to have some interaction with epoll (and 
similar calls)?

Instead of having a blocked futex_waitv() waiting on an fd (maybe a generic 
eventfd() or a new futex2fd())
would be a better interface?

Or instead introduce an IORING_OP_FUTEX2_WAITV? Then the futex_waitv logic wait
in an io-wq kernel thread...

I guess the io_uring way would mean we could have that in mind as future 
addition, which can be implemented
later...

metze


Re: [GIT PULL] cifs fixes

2021-02-19 Thread Stefan Metzmacher
Hi Linus,

>> Do you know about the Zen3 status, I was thinking to replace the system
>> by this one with AMD Ryzen 9 5950X:
> 
> I have heard nothing but good things about Zen3 so far (apart from
> apparently people complaining about availability), but it's only been
> out a few months, so obviously coverage is somewhat limited.
> 
> I wish AMD hadn't decimated their Linux team (several years ago), and
> they definitely had some embarrassing issues early on with Zen (apart
> from the Zen 1 stability issues, they've screwed up rdrand at least
> three times, iirc). But I've yet to hear of any Zen 3 issues, and I
> suspect I'll upgrade when Threadripper comes out (I've become quite
> spoiled by the build speeds of my Threadripper 3970X - the only thing
> I miss is the better 'perf' support from Intel PEBS).
> 
> Note that I'm not necessarily the person who would hear about any
> issues first, though, so take the above with a pinch of salt.

Thanks for the hints! While we're waiting for the Ryzen 9 5950X machine
to get ready, I upgraded the Ryzen Threadripper 2950X to a 5.10 kernel
and we didn't had a freeze yet again.

Do you think 5.10 would be good for the Ryzen 9 5950X too?

Thanks!
metze





signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] cifs fixes

2021-02-12 Thread Stefan Metzmacher
Hi Linus,

>> The machine is running a 'AMD Ryzen Threadripper 2950X 16-Core Processor'
>> and is freezing without any trace every view days.
> 
> I don't think the first-gen Zen issues ever really got solved. There
> were multiple ones, with random segfaults for the early ones (but
> afaik those were fixed by an RMA process with AMD), but the "it
> randomly locks up" ones never had a satisfactory resolution afaik.
> 
> There were lots of random workarounds, but judging by your email:
> 
>> We played with various boot parameters (currently we're using
>> 'mem_encrypt=off rcu_nocbs=0-31 processor.max_cstate=1 idle=nomwait 
>> nomodeset consoleblank=0',
> 
> I suspect you've seen all the bugzilla threads on this issue (kernel
> bugzilla 196683 is probably the main one, but it was discussed
> elsewhere too).

I just found that one, I'll have a closer look at the details in the next days.

> I assume you've updated to latest BIOS and looked at various BIOS
> power management settings too?

No, but I'll have a look at that.

> Zen 2 seems to have fixed things (knock wood - it's certainly working
> for me), But many people obviously never saw any issues with Zen 1
> either.

Do you know about the Zen3 status, I was thinking to replace the system
by this one with AMD Ryzen 9 5950X:
https://www.hetzner.com/dedicated-rootserver/ax101

Thanks!
metze




signature.asc
Description: OpenPGP digital signature


Re: [GIT PULL] cifs fixes

2021-02-12 Thread Stefan Metzmacher
Am 12.02.21 um 21:39 schrieb Steve French:
> Metze/Bjorn,
> Linus is right - samba.org is down for me (I also verified with JRA).
> Any ETA on when it gets back up?
> 
> On Fri, Feb 12, 2021 at 2:05 PM Linus Torvalds
>  wrote:
>>
>> On Fri, Feb 12, 2021 at 10:16 AM Steve French  wrote:
>>>
>>>   git://git.samba.org/sfrench/cifs-2.6.git tags/5.11-rc7-smb3
>>
>> It looks like git.samba.org is feeling very sick and is not answering.
>> Not git, not ping (but maybe icmp ping is blocked).
>>
>> Please give it a kick, or provide some other hosting mirror?


It's online again.

The machine is running a 'AMD Ryzen Threadripper 2950X 16-Core Processor'
and is freezing without any trace every view days.

We played with various boot parameters (currently we're using
'mem_encrypt=off rcu_nocbs=0-31 processor.max_cstate=1 idle=nomwait nomodeset 
consoleblank=0',
with the ubuntu 20.04 5.8 kernel, we also tried 5.4 before), but nothing seems 
to help.

metze




signature.asc
Description: OpenPGP digital signature


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-23 Thread Stefan Metzmacher
> They're basically the same concept, it's a subtle difference.
> 
> FRMR = Fast Register Memory Region
> FRWR = Fast Register Work Request
> 
> The memory region is the mr itself, this is created early on.
> 
> The work request is built when actually binding the physical
> pages to the region, and setting the offset, length, etc, which
> is what's happening in the routine that I made the comment on.
> 
> So, for this discussion I chose to say FRWR. Sorry for any
> confusion!

Ah, thanks! Confusion resolved:-)

metze




signature.asc
Description: OpenPGP digital signature


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-23 Thread Stefan Metzmacher
> They're basically the same concept, it's a subtle difference.
> 
> FRMR = Fast Register Memory Region
> FRWR = Fast Register Work Request
> 
> The memory region is the mr itself, this is created early on.
> 
> The work request is built when actually binding the physical
> pages to the region, and setting the offset, length, etc, which
> is what's happening in the routine that I made the comment on.
> 
> So, for this discussion I chose to say FRWR. Sorry for any
> confusion!

Ah, thanks! Confusion resolved:-)

metze




signature.asc
Description: OpenPGP digital signature


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-23 Thread Stefan Metzmacher
Hi Tom,

>> I just tested that setting:
>>
>> mr->iova &= (PAGE_SIZE - 1);
>> mr->iova |= 0x;
>>
>> after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to
>> work.
> 
> Good! As you know, we were concerned about it after seeing that
> the ib_dma_map_sg() code was unconditionally setting it to the
> dma_mapped address. By salting those 's with varying data,
> this should give your FRWR regions stronger integrity in addition
> to not leaking kernel "addresses" to the wire.

Just wondering... Isn't the thing we use called FRMR?

metze



signature.asc
Description: OpenPGP digital signature


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-23 Thread Stefan Metzmacher
Hi Tom,

>> I just tested that setting:
>>
>> mr->iova &= (PAGE_SIZE - 1);
>> mr->iova |= 0x;
>>
>> after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to
>> work.
> 
> Good! As you know, we were concerned about it after seeing that
> the ib_dma_map_sg() code was unconditionally setting it to the
> dma_mapped address. By salting those 's with varying data,
> this should give your FRWR regions stronger integrity in addition
> to not leaking kernel "addresses" to the wire.

Just wondering... Isn't the thing we use called FRMR?

metze



signature.asc
Description: OpenPGP digital signature


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-21 Thread Stefan Metzmacher
Hi,

>> +    req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
>> +    if (need_invalidate)
>> +    req->Channel = SMB2_CHANNEL_RDMA_V1;
>> +    req->ReadChannelInfoOffset =
>> +    offsetof(struct smb2_read_plain_req, Buffer);
>> +    req->ReadChannelInfoLength =
>> +    sizeof(struct smbd_buffer_descriptor_v1);
>> +    v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
>> +    v1->offset = rdata->mr->mr->iova;
> 
> It's unnecessary, and possibly leaking kernel information, to use
> the IOVA as the offset of a memory region which is registered using
> an FRWR. Because such regions are based on the exact bytes targeted
> by the memory handle, the offset can be set to any value, typically
> zero, but nearly arbitrary. As long as the (offset + length) does
> not wrap or otherwise overflow, offset can be set to anything
> convenient.
> 
> Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
> least significant 23 bits, which should guarantee it. The other 41
> bits, party on. You could randomize them, pass some clever identifier
> such as MID sequence, whatever.

I just tested that setting:

mr->iova &= (PAGE_SIZE - 1);
mr->iova |= 0x;

after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to work.

metze



signature.asc
Description: OpenPGP digital signature


Re: [Patch v7 21/22] CIFS: SMBD: Upper layer performs SMB read via RDMA write through memory registration

2018-09-21 Thread Stefan Metzmacher
Hi,

>> +    req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
>> +    if (need_invalidate)
>> +    req->Channel = SMB2_CHANNEL_RDMA_V1;
>> +    req->ReadChannelInfoOffset =
>> +    offsetof(struct smb2_read_plain_req, Buffer);
>> +    req->ReadChannelInfoLength =
>> +    sizeof(struct smbd_buffer_descriptor_v1);
>> +    v1 = (struct smbd_buffer_descriptor_v1 *) >Buffer[0];
>> +    v1->offset = rdata->mr->mr->iova;
> 
> It's unnecessary, and possibly leaking kernel information, to use
> the IOVA as the offset of a memory region which is registered using
> an FRWR. Because such regions are based on the exact bytes targeted
> by the memory handle, the offset can be set to any value, typically
> zero, but nearly arbitrary. As long as the (offset + length) does
> not wrap or otherwise overflow, offset can be set to anything
> convenient.
> 
> Since SMB reads and writes range up to 8MB, I'd suggest zeroing the
> least significant 23 bits, which should guarantee it. The other 41
> bits, party on. You could randomize them, pass some clever identifier
> such as MID sequence, whatever.

I just tested that setting:

mr->iova &= (PAGE_SIZE - 1);
mr->iova |= 0x;

after the ib_map_mr_sg() and before doing the IB_WR_REG_MR, seems to work.

metze



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] CIFS: SMBD: fix configurations with INFINIBAND=m

2017-12-19 Thread Stefan Metzmacher
Am 19.12.2017 um 22:21 schrieb Long Li via samba-technical:
>>  depends on CIFS && INFINIBAND
>> +depends on CIFS=m || INFINIBAND=y
> 
> How about we change them to
> 
> depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y
> 
> This makes it easy to read.

I like it :-)

metze




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] CIFS: SMBD: fix configurations with INFINIBAND=m

2017-12-19 Thread Stefan Metzmacher
Am 19.12.2017 um 22:21 schrieb Long Li via samba-technical:
>>  depends on CIFS && INFINIBAND
>> +depends on CIFS=m || INFINIBAND=y
> 
> How about we change them to
> 
> depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y
> 
> This makes it easy to read.

I like it :-)

metze




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] CIFS: SMBD: fix configurations with INFINIBAND=m

2017-12-19 Thread Stefan Metzmacher
Am 19.12.2017 um 11:56 schrieb Arnd Bergmann via samba-technical:
> On Tue, Dec 19, 2017 at 11:33 AM, Stefan Metzmacher <me...@samba.org> wrote:
>> Hi Arnd,
>>
>>> diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
>>> index 500fd69fb58b..3bfc55c08bef 100644
>>> --- a/fs/cifs/Kconfig
>>> +++ b/fs/cifs/Kconfig
>>> @@ -199,6 +199,7 @@ config CIFS_SMB311
>>>  config CIFS_SMB_DIRECT
>>>   bool "SMB Direct support (Experimental)"
>>>   depends on CIFS && INFINIBAND
>>> + depends on CIFS=m || INFINIBAND=y
>>>   help
>>> Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
>>> SMB Direct allows transferring SMB packets over RDMA. If unsure,
>>
>> Is this really correct? Should CIFS_SMB_DIRECT be allowed with:
>>
>> CIFS=n and INFINIBAND=y ???
>> or
>> CIFS=m and INFINIBAND=n ???
>>
>> I guess a more complex logic should be used here
>> or am I missing something?
> 
> The two ones you listed are prohibited by the existing
> 'depends on CIFS && INFINIBAND' dependency.
> 
> We could rephrase the dependency as
> 
> depends on (CIFS=y && INFINIBAND=y) || \
> (CIFS=m && INFINIBAND=y) || \
> (CIFS=m && INFINIBAND=m)
> 
> which has the same effect as
> 
>   depends on CIFS && INFINIBAND
>   depends on CIFS=m || INFINIBAND=y
> 
> but I don't think that adds any clarity.

Thanks for the clarification!

I somehow assumed the patch has been:


-  depends on CIFS && INFINIBAND
+  depends on CIFS=m || INFINIBAND=y

instead of:
   depends on CIFS && INFINIBAND
+  depends on CIFS=m || INFINIBAND=y

metze



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] CIFS: SMBD: fix configurations with INFINIBAND=m

2017-12-19 Thread Stefan Metzmacher
Am 19.12.2017 um 11:56 schrieb Arnd Bergmann via samba-technical:
> On Tue, Dec 19, 2017 at 11:33 AM, Stefan Metzmacher  wrote:
>> Hi Arnd,
>>
>>> diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
>>> index 500fd69fb58b..3bfc55c08bef 100644
>>> --- a/fs/cifs/Kconfig
>>> +++ b/fs/cifs/Kconfig
>>> @@ -199,6 +199,7 @@ config CIFS_SMB311
>>>  config CIFS_SMB_DIRECT
>>>   bool "SMB Direct support (Experimental)"
>>>   depends on CIFS && INFINIBAND
>>> + depends on CIFS=m || INFINIBAND=y
>>>   help
>>> Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
>>> SMB Direct allows transferring SMB packets over RDMA. If unsure,
>>
>> Is this really correct? Should CIFS_SMB_DIRECT be allowed with:
>>
>> CIFS=n and INFINIBAND=y ???
>> or
>> CIFS=m and INFINIBAND=n ???
>>
>> I guess a more complex logic should be used here
>> or am I missing something?
> 
> The two ones you listed are prohibited by the existing
> 'depends on CIFS && INFINIBAND' dependency.
> 
> We could rephrase the dependency as
> 
> depends on (CIFS=y && INFINIBAND=y) || \
> (CIFS=m && INFINIBAND=y) || \
> (CIFS=m && INFINIBAND=m)
> 
> which has the same effect as
> 
>   depends on CIFS && INFINIBAND
>   depends on CIFS=m || INFINIBAND=y
> 
> but I don't think that adds any clarity.

Thanks for the clarification!

I somehow assumed the patch has been:


-  depends on CIFS && INFINIBAND
+  depends on CIFS=m || INFINIBAND=y

instead of:
   depends on CIFS && INFINIBAND
+  depends on CIFS=m || INFINIBAND=y

metze



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] CIFS: SMBD: fix configurations with INFINIBAND=m

2017-12-19 Thread Stefan Metzmacher
Hi Arnd,

> diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
> index 500fd69fb58b..3bfc55c08bef 100644
> --- a/fs/cifs/Kconfig
> +++ b/fs/cifs/Kconfig
> @@ -199,6 +199,7 @@ config CIFS_SMB311
>  config CIFS_SMB_DIRECT
>   bool "SMB Direct support (Experimental)"
>   depends on CIFS && INFINIBAND
> + depends on CIFS=m || INFINIBAND=y
>   help
> Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
> SMB Direct allows transferring SMB packets over RDMA. If unsure,

Is this really correct? Should CIFS_SMB_DIRECT be allowed with:

CIFS=n and INFINIBAND=y ???
or
CIFS=m and INFINIBAND=n ???

I guess a more complex logic should be used here
or am I missing something?

metze




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] CIFS: SMBD: fix configurations with INFINIBAND=m

2017-12-19 Thread Stefan Metzmacher
Hi Arnd,

> diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
> index 500fd69fb58b..3bfc55c08bef 100644
> --- a/fs/cifs/Kconfig
> +++ b/fs/cifs/Kconfig
> @@ -199,6 +199,7 @@ config CIFS_SMB311
>  config CIFS_SMB_DIRECT
>   bool "SMB Direct support (Experimental)"
>   depends on CIFS && INFINIBAND
> + depends on CIFS=m || INFINIBAND=y
>   help
> Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
> SMB Direct allows transferring SMB packets over RDMA. If unsure,

Is this really correct? Should CIFS_SMB_DIRECT be allowed with:

CIFS=n and INFINIBAND=y ???
or
CIFS=m and INFINIBAND=n ???

I guess a more complex logic should be used here
or am I missing something?

metze




signature.asc
Description: OpenPGP digital signature


Re: [Patch v2 00/19] CIFS: Implement SMBDirect

2017-08-29 Thread Stefan Metzmacher
Hi,

>> This is great to see.  Is there a Linux implementation of the server side (in
>> Samba?) so that the client can be tested without needing a Windows server?
> 
> I'm not aware of a Linux implementation on server side.

Here's a very early work in progress branch:
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-rdma
It only explores how the protocol works, as it uses a userspace
smb-direct proxy (which works around the missing fork support of the
userspace libibverbs), which makes it really slow, but is required
in order to have the code tested in Samba's autobuild.

I think once this code lands in the kernel tree, we'll be able to
arrange a userspace api to it, in order to make a useful implementation,
so we can skip the userspace smb-dorect proxy.

metze



signature.asc
Description: OpenPGP digital signature


Re: [Patch v2 00/19] CIFS: Implement SMBDirect

2017-08-29 Thread Stefan Metzmacher
Hi,

>> This is great to see.  Is there a Linux implementation of the server side (in
>> Samba?) so that the client can be tested without needing a Windows server?
> 
> I'm not aware of a Linux implementation on server side.

Here's a very early work in progress branch:
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-rdma
It only explores how the protocol works, as it uses a userspace
smb-direct proxy (which works around the missing fork support of the
userspace libibverbs), which makes it really slow, but is required
in order to have the code tested in Samba's autobuild.

I think once this code lands in the kernel tree, we'll be able to
arrange a userspace api to it, in order to make a useful implementation,
so we can skip the userspace smb-dorect proxy.

metze



signature.asc
Description: OpenPGP digital signature


Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

2017-08-14 Thread Stefan Metzmacher
Hi Long,

>> It seems that the new transport is tied to it's caller regarding structures 
>> and
>> naming conventions.
>>
>> I think it would be better to strictly separate them, as I'd like to use the
>> SMBDirect transport also from the userspace for the client side e.g. in
>> Samba's '[lib]smbclient', but also in Samba's server side code 'smbd'.
> 
> Thank you for reviewing the patch set.
> 
> I think it is possible to separate the common code that implements the 
> SMBDirect transport. There are some challenges to reuse the same code for 
> both kernel and user spaces.
> 1. Kernel mode RDMA verbs are similar but different to user-mode ones.
> 2. Some RDMA features (e.g Fast Registration Work Request) are not available 
> in user-mode.
> 3. Locking and synchronization mechanism is different
> 4. Memory management is different.
> 5. Process creation/scheduling and data sharing between processes are 
> different, and there is no user-mode code running in interrupt/softirq.
> 
> Those needs to be abstracted through a layer, the rest of the code can be 
> shared. I can work on this after patch set is reviewed.

I guess this is a misunderstanding...

I don't want to use that code and run it in userspace,
I have a userspace prototype more or less working here, see
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-rdma
and
https://git.samba.org/?p=metze/samba/wip.git;a=blob;f=libcli/smb/smb_direct.c;h=9cc0d861ccfcbb4df9ef6ad85a7fe3d262e549c0;hb=85d46de6fdbba041d3e8004af46865a72d2b8405

I goal is that we'll have an api that allows userspace
code to use the kernel code SMBDirect code. This
userspace code would get a file descriptor from the kernel
and would be able to use it similar to a tcp socket.
If the kernel would simulate the 4 byte length header,
it's trivial to get to a stage were smbclient and smbd
are able to support SMBDirect without much changes.
We only need to replace connect(), listen(), accept() and a few more
by SMBDirect versions.

For the real data transfer we might be able to use memfd_create()
or something similar to share the buffers between userspace and kernel.

I guess this is a long way, but having the basic SMBDirect code in
dependently in the kernel would help a lot.

>> Would it be possible to isolate this in
>> smb_direct.c and smb_direct.h while using
>> smb_direct_* prefixes for structures and functions? Also avoiding the usage
>> of other headers from fs/cifs/*.h, expect for something generic like nterr.h.
> 
> Sure I will make naming changes and clean up the header files.

Thanks!

>> I guess 'struct cifs_rdma_info' would then be 'struct smb_direct_connection'.
>> And it won't have a reference to struct TCP_Server_Info.
> 
> I will look for ways to remove reference to struct TCP_Server_Info . The 
> reason why it has a reference to TCP_Server_Info is that: TCP_Server_Info 
> represents a transport connection, although it also has many other TCP 
> related code. SMBD needs to get to this connection TCP_Server_Info and set 
> the transport status on shutdown (and maybe other situations).
> 

Wouldn't it be better to provide a way to ask for the connection state
and let the caller ask for the state instead of changing the callers
structure?

metze



signature.asc
Description: OpenPGP digital signature


Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

2017-08-14 Thread Stefan Metzmacher
Hi Long,

>> It seems that the new transport is tied to it's caller regarding structures 
>> and
>> naming conventions.
>>
>> I think it would be better to strictly separate them, as I'd like to use the
>> SMBDirect transport also from the userspace for the client side e.g. in
>> Samba's '[lib]smbclient', but also in Samba's server side code 'smbd'.
> 
> Thank you for reviewing the patch set.
> 
> I think it is possible to separate the common code that implements the 
> SMBDirect transport. There are some challenges to reuse the same code for 
> both kernel and user spaces.
> 1. Kernel mode RDMA verbs are similar but different to user-mode ones.
> 2. Some RDMA features (e.g Fast Registration Work Request) are not available 
> in user-mode.
> 3. Locking and synchronization mechanism is different
> 4. Memory management is different.
> 5. Process creation/scheduling and data sharing between processes are 
> different, and there is no user-mode code running in interrupt/softirq.
> 
> Those needs to be abstracted through a layer, the rest of the code can be 
> shared. I can work on this after patch set is reviewed.

I guess this is a misunderstanding...

I don't want to use that code and run it in userspace,
I have a userspace prototype more or less working here, see
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-rdma
and
https://git.samba.org/?p=metze/samba/wip.git;a=blob;f=libcli/smb/smb_direct.c;h=9cc0d861ccfcbb4df9ef6ad85a7fe3d262e549c0;hb=85d46de6fdbba041d3e8004af46865a72d2b8405

I goal is that we'll have an api that allows userspace
code to use the kernel code SMBDirect code. This
userspace code would get a file descriptor from the kernel
and would be able to use it similar to a tcp socket.
If the kernel would simulate the 4 byte length header,
it's trivial to get to a stage were smbclient and smbd
are able to support SMBDirect without much changes.
We only need to replace connect(), listen(), accept() and a few more
by SMBDirect versions.

For the real data transfer we might be able to use memfd_create()
or something similar to share the buffers between userspace and kernel.

I guess this is a long way, but having the basic SMBDirect code in
dependently in the kernel would help a lot.

>> Would it be possible to isolate this in
>> smb_direct.c and smb_direct.h while using
>> smb_direct_* prefixes for structures and functions? Also avoiding the usage
>> of other headers from fs/cifs/*.h, expect for something generic like nterr.h.
> 
> Sure I will make naming changes and clean up the header files.

Thanks!

>> I guess 'struct cifs_rdma_info' would then be 'struct smb_direct_connection'.
>> And it won't have a reference to struct TCP_Server_Info.
> 
> I will look for ways to remove reference to struct TCP_Server_Info . The 
> reason why it has a reference to TCP_Server_Info is that: TCP_Server_Info 
> represents a transport connection, although it also has many other TCP 
> related code. SMBD needs to get to this connection TCP_Server_Info and set 
> the transport status on shutdown (and maybe other situations).
> 

Wouldn't it be better to provide a way to ask for the connection state
and let the caller ask for the state instead of changing the callers
structure?

metze



signature.asc
Description: OpenPGP digital signature


Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

2017-08-08 Thread Stefan Metzmacher
Hi Li,

thanks for providing this patchset, I guess it will be a huge win to
have SMBDirect support for the kernel client!

> Define a new structure for SMBD transport. This stucture will have all the
> information on the transport, and it will be stored in the current SMB 
> session.
...
> +/*
> + * The context for the SMBDirect transport
> + * Everything related to the transport is here. It has several
logical parts
> + * 1. RDMA related structures
> + * 2. SMBDirect connection parameters
> + * 3. Reassembly queue for data receive path
> + * 4. mempools for allocating packets
> + */
> +struct cifs_rdma_info {
> + struct TCP_Server_Info *server_info;
> +
> + // for debug purposes
> + unsigned int count_receive_buffer;
> + unsigned int count_get_receive_buffer;
> + unsigned int count_put_receive_buffer;
> + unsigned int count_send_empty;
> +};
> +#endif
>

It seems that the new transport is tied to it's caller
regarding structures and naming conventions.

I think it would be better to strictly separate them,
as I'd like to use the SMBDirect transport also from the
userspace for the client side e.g. in Samba's '[lib]smbclient',
but also in Samba's server side code 'smbd'.

Would it be possible to isolate this in
smb_direct.c and smb_direct.h while using
smb_direct_* prefixes for structures and
functions? Also avoiding the usage of other headers
from fs/cifs/*.h, expect for something generic like
nterr.h.

I guess 'struct cifs_rdma_info' would then be
'struct smb_direct_connection'. And it won't
have a reference to struct TCP_Server_Info.

It the strict layering is too much change,
I'd at least like to have the name changes.

This should relatively easy to do by using somthing like

git format-patch --stdout -37 > before

cat before | sed \
-e 's!struct cifs_rdma_info!struct smb_direct_connection!g' \
-e 's!cifsrdma\.h!smb_direct.h!g' \
-e 's!cifsrdma\.c!smb_direct.c!g' \
> after

git reset --hard HEAD~37
git am after

metze


Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

2017-08-08 Thread Stefan Metzmacher
Hi Li,

thanks for providing this patchset, I guess it will be a huge win to
have SMBDirect support for the kernel client!

> Define a new structure for SMBD transport. This stucture will have all the
> information on the transport, and it will be stored in the current SMB 
> session.
...
> +/*
> + * The context for the SMBDirect transport
> + * Everything related to the transport is here. It has several
logical parts
> + * 1. RDMA related structures
> + * 2. SMBDirect connection parameters
> + * 3. Reassembly queue for data receive path
> + * 4. mempools for allocating packets
> + */
> +struct cifs_rdma_info {
> + struct TCP_Server_Info *server_info;
> +
> + // for debug purposes
> + unsigned int count_receive_buffer;
> + unsigned int count_get_receive_buffer;
> + unsigned int count_put_receive_buffer;
> + unsigned int count_send_empty;
> +};
> +#endif
>

It seems that the new transport is tied to it's caller
regarding structures and naming conventions.

I think it would be better to strictly separate them,
as I'd like to use the SMBDirect transport also from the
userspace for the client side e.g. in Samba's '[lib]smbclient',
but also in Samba's server side code 'smbd'.

Would it be possible to isolate this in
smb_direct.c and smb_direct.h while using
smb_direct_* prefixes for structures and
functions? Also avoiding the usage of other headers
from fs/cifs/*.h, expect for something generic like
nterr.h.

I guess 'struct cifs_rdma_info' would then be
'struct smb_direct_connection'. And it won't
have a reference to struct TCP_Server_Info.

It the strict layering is too much change,
I'd at least like to have the name changes.

This should relatively easy to do by using somthing like

git format-patch --stdout -37 > before

cat before | sed \
-e 's!struct cifs_rdma_info!struct smb_direct_connection!g' \
-e 's!cifsrdma\.h!smb_direct.h!g' \
-e 's!cifsrdma\.c!smb_direct.c!g' \
> after

git reset --hard HEAD~37
git am after

metze