On 04/02, David Laight wrote: > On Wed, 2 Apr 2025 14:21:35 -0700 > Stanislav Fomichev <stfomic...@gmail.com> wrote: > > > On 04/02, David Laight wrote: > > > On Wed, 2 Apr 2025 07:19:46 -0700 > > > Stanislav Fomichev <stfomic...@gmail.com> wrote: > > > > > > > On 04/02, David Laight wrote: > > > > > On Wed, 2 Apr 2025 00:53:58 +0200 > > > > > Stefan Metzmacher <me...@samba.org> wrote: > > > > > > > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > > > > On 04/01, Stefan Metzmacher wrote: > > > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > > > > >>> On 04/01, Breno Leitao wrote: > > > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher > > > > > > >>>> wrote: > > > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET > > > > > > >>>>>>>>> limitation > > > > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> The reason for this limitation is that > > > > > > >>>>>>>>> io_uring_cmd_getsockopt() > > > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> The first idea would be to change the optval and optlen > > > > > > >>>>>>>>> arguments > > > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > > > > >>>>>>>>> is already used for setsockopt() and also by > > > > > > >>>>>>>>> do_sock_getsockopt() > > > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different > > > > > > >>>>>>>>> approach. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > > > > >>>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> [..] > > > > > > >>>>>>>> > > > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > > > > >>>>>>>>> > > > > > > >>>>>>>> > > > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You > > > > > > >>>>>>>> have the size > > > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > > > > >>>>>>> > > > > > > >>>>>>> I think that would only be possible if we introduce > > > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the > > > > > > >>>>>>> implementations > > > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential > > > > > > >>>>>>> to break > > > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and > > > > > > >>>>>>> tcp myself, but > > > > > > >>>>>>> the rest needs to be converted by the maintainer of the > > > > > > >>>>>>> specific protocol, > > > > > > >>>>>>> as it needs to be tested. As there are crazy things > > > > > > >>>>>>> happening in the existing > > > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use > > > > > > >>>>>>> optval as in and out > > > > > > >>>>>>> buffer. > > > > > > >>>>>>> > > > > > > >>>>>>> I first tried to convert both optval and optlen of > > > > > > >>>>>>> getsockopt to sockptr_t, > > > > > > >>>>>>> and that showed that touching the optval part starts to get > > > > > > >>>>>>> complex very soon, > > > > > > >>>>>>> see > > > > > > >>>>>>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > > > >>>>>>> (note it didn't converted everything, I gave up after > > > > > > >>>>>>> hitting > > > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, > > > > > > >>>>>>> sctp_getsockopt_associnfo and maybe > > > > > > >>>>>>> more are the ones also doing both copy_from_user and > > > > > > >>>>>>> copy_to_user on optval) > > > > > > >>>>>>> > > > > > > >>>>>>> I come also across one implementation that returned -ERANGE > > > > > > >>>>>>> because *optlen was > > > > > > >>>>>>> too short and put the required length into *optlen, which > > > > > > >>>>>>> means the returned > > > > > > >>>>>>> *optlen is larger than the optval buffer given from > > > > > > >>>>>>> userspace. > > > > > > >>>>>>> > > > > > > >>>>>>> Because of all these strange things I tried to do a minimal > > > > > > >>>>>>> change > > > > > > >>>>>>> in order to get rid of the io_uring limitation and only > > > > > > >>>>>>> converted > > > > > > >>>>>>> optlen and leave optval as is. > > > > > > >>>>>>> > > > > > > >>>>>>> In order to have a patchset that has a low risk to cause > > > > > > >>>>>>> regressions. > > > > > > >>>>>>> > > > > > > >>>>>>> But as alternative introducing a prototype like this: > > > > > > >>>>>>> > > > > > > >>>>>>> int (*getsockopt_iter)(struct socket *sock, int > > > > > > >>>>>>> level, int optname, > > > > > > >>>>>>> struct iov_iter > > > > > > >>>>>>> *optval_iter); > > > > > > >>>>>>> > > > > > > >>>>>>> That returns a non-negative value which can be placed into > > > > > > >>>>>>> *optlen > > > > > > >>>>>>> or negative value as error and *optlen will not be changed > > > > > > >>>>>>> on error. > > > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be > > > > > > >>>>>>> written to. > > > > > > >>>>>>> > > > > > > >>>>>>> Implementations could then opt in for the new interface and > > > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > > > > >>>>>>> > > > > > > >>>>>>> So what should be the way to go? > > > > > > >>>>>> > > > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see > > > > > > >>>>>> below, > > > > > > >>>>>> but the first part I wanted to convert was > > > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from > > > > > > >>>>>> userspace before > > > > > > >>>>>> writing. > > > > > > >>>>>> > > > > > > >>>>>> So we could go with the optlen_t approach, or we need > > > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC > > > > > > >>>>>> and one > > > > > > >>>>>> with ITER_DEST... > > > > > > >>>>>> > > > > > > >>>>>> So who wants to decide? > > > > > > >>>>> > > > > > > >>>>> I just noticed that it's even possible in same cases > > > > > > >>>>> to pass in a short buffer to optval, but have a longer value > > > > > > >>>>> in optlen, > > > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores > > > > > > >>>>> optlen. > > > > > > >>>>> > > > > > > >>>>> This makes it really hard to believe that trying to use > > > > > > >>>>> iov_iter for this > > > > > > >>>>> is a good idea :-( > > > > > > >>>> > > > > > > >>>> That was my finding as well a while ago, when I was planning > > > > > > >>>> to get the > > > > > > >>>> __user pointers converted to iov_iter. There are some weird > > > > > > >>>> ways of > > > > > > >>>> using optlen and optval, which makes them non-trivial to > > > > > > >>>> covert to > > > > > > >>>> iov_iter. > > > > > > >>> > > > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should > > > > > > >>> cover +90% > > > > > > >>> of useful socket opts. See if there are any obvious problems > > > > > > >>> with them > > > > > > >>> and if not, try converting. The rest we can cover separately > > > > > > >>> when/if > > > > > > >>> needed. > > > > > > >> > > > > > > >> That's what I tried, but it fails with > > > > > > >> tcp_getsockopt -> > > > > > > >> do_tcp_getsockopt -> > > > > > > >> tcp_ao_get_mkts -> > > > > > > >> tcp_ao_copy_mkts_to_user -> > > > > > > >> copy_struct_from_sockptr > > > > > > >> tcp_ao_get_sock_info -> > > > > > > >> copy_struct_from_sockptr > > > > > > >> > > > > > > >> That's not possible with a ITER_DEST iov_iter. > > > > > > >> > > > > > > >> metze > > > > > > > > > > > > > > Can we create two iterators over the same memory? One for > > > > > > > ITER_SOURCE and > > > > > > > another for ITER_DEST. And then make getsockopt_iter accept > > > > > > > optval_in and > > > > > > > optval_out. We can also use optval_out position (iov_offset) as > > > > > > > optlen output > > > > > > > value. Don't see why it won't work, but I agree that's gonna be a > > > > > > > messy > > > > > > > conversion so let's see if someone else has better suggestions. > > > > > > > > > > > > > > > > > > > Yes, that might work, but it would be good to get some feedback > > > > > > if this would be the way to go: > > > > > > > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > > > > int level, int optname, > > > > > > struct iov_iter *optval_in, > > > > > > struct iov_iter *optval_out); > > > > > > > > > > > > And *optlen = optval_out->iov_offset; > > > > > > > > > > > > Any objection or better ideas? Linus would that be what you had in > > > > > > mind? > > > > > > > > > > I'd worry about performance - yes I know 'iter' are used elsewhere > > > > > but... > > > > > Also look at the SCTP code. > > > > > > > > Performance usually does not matter for set/getsockopts, there > > > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) > > > > > > That might be the one that is really horrid and completely abuses > > > the 'length' parameter. > > > > It is reading and writing, yes, but it's not a huge problem. And it > > does enforce the optlen (to copy back the same amount of bytes). It's > > not that bad, it's just an example of where we need to be extra > > careful. > > > > > > and maybe recent > > > > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > > > > idk. I'm skeptical we can convert everything though, that's why the > > > > suggestion to start with sk/ip/tcp/udp. > > > > > > > > > How do you handle code that wants to return an updated length (often > > > > > longer > > > > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > > > > > > > There is also a very strange use (I think it is a sockopt rather than > > > > > an ioctl) > > > > > where the buffer length the application provides is only that of the > > > > > header. > > > > > The actual buffer length is contained in the header. > > > > > The return length is the amount written into the full buffer. > > > > > > > > Let's discuss these special cases as they come up? Worst case these > > > > places can always re-init iov_iter with a comment on why it is ok. > > > > But I do agree in general that there are a few places that do wild > > > > stuff. > > > > > > The problem is that the generic code has to deal with all the 'wild > > > stuff'. > > > > getsockopt_iter will have optval_in for the minority of socket options > > (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well > > as optval_out. The latter is what the majority of socket options > > will use to write their value. That doesn't seem too complicated to > > handle? > > > > > It is also common to do non-sequential accesses - so iov_iter doesn't > > > match > > > at all. > > > > I disagree that it's 'common'. Searching for copy_from_sockptr_offset > > returns a few cases and they are mostly using read-with-offset because > > there is no sequential read (iterator) semantics with sockptr_t. > > > > > There also isn't a requirement for scatter-gather. > > > > > > For 'normal' getsockopt (and setsockopt) with short lengths it actually > > > makes > > > sense for the syscall wrapper to do the user copies. > > > But it would need to pass the user ptr+len as well as the kernel ptr+len > > > to give the required flexibilty. > > > Then you have to work out whether the final copy to user is needed or not. > > > (not that hard, but it all adds complication). > > > > Not sure I understand what's the problem. The user vs kernel part will > > be abstracted by iov_iter. The callers will have to write the optlen > > back. And there are two call sites we care about: io_uring and regular > > system call. What's your suggestion? Maybe I'm missing something. Do you > > prefer get_optlen/put_optlen? > > I think the final aim should be to pass the user supplied length to the > per-protocol code and have it return the length/error to be passed back to the > user.
Like what Stefan's patch 3 is doing? Or you're suggesting to change getsockopt handlers to handle length more explicitly? If we were to proceed with sockptr to iov_iter conversion we'll have to do it anyway (or pass the length as the size of iov_iter). > But in a lot of cases the syscall wrapper can do the buffer copies (as well > as the length copies). > That would be restricted to short length (on stack). > So code that needed a long buffer (like some of the sctp options) > would need to directly access the user buffer (or a long buffer provided > by an in-kernel user). This sounds similar to what we did with bpf hooks - copy (head of) the buffer and run bpf program on top of it. I remember iptables setsockopt begin problematic because of its huge size.. It is an option, yes (to convert protocol handler to kernel memory mostly). > But you'll find code that reads/writes well beyond the apparent size of > the user buffer. > (And not just code that accesses 4 bytes without checking the length). With can start with getsockopt_iter + sk_getsockopt to see if there are any issues with that approach. If not, adding ip/tcp/udp to the mix should be doable. We can explain and comment on special cases if needed. When other protocols are needed from io_uring, we can convert more. But at least the new code will use the correct abstractions.