Re: PROPOSAL: Split uiomove into uiopeek, uioskip
>>> To the extent that it's impossible, it's impossible only because >>> the APIs provided by the kernel to userland don't have any way to >>> represent such a thing. This would border on trivial to fix, >>> except that it would be difficult to get much userland code to use >>> the resulting APIs because of their lack of portability to other >>> UNIX variants. > Since write(2) is one of the oldest interfaces in Unix the chances of > any change taking hold are vanishingly small... Oh, I'm not suggesting that write(2) change. What I'm suggesting is the creation of some alternative interface, write_extended(2) let's call it for the sake of concreteness[%], which is just like write(2) except that it _does_ provide some way to unambiguously return "wrote N, then error E". (Exactly how is pretty much irrelevant; I'm sure practically everyone here can imagine plenty of possible alternatives. If it comes to arguing choices for that, I'd paint the bikeshed a dark forest green.) But write(2) would continue to exist, with more or less its traditional semantics. Only if - and only when - write_extended becomes so popular that nobody uses plain write(2) any longer would I propose removing it. [%] If anything like this happens I certainly hope someone will invent a better name. But userland uptake for write_extended will be minimal, especially initially; that's the portability issue I was talking about. > All of this is not _independent_ of fixing uiomove callers, [...], > but it's largely orthogonal to the original problem of incorrectly > rolling back partial uiomoves. :-( Agreed. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
On Wed, May 10, 2023 at 10:11:31AM +, Taylor R Campbell wrote: > > > (In general, erroring in I/O is a whole additional can of worms; it's > > > wrong to not report back however much work happened before the error > > > when it can't be undone, but also impossible to both report work and > > > raise an error. [...]) > > > > To the extent that it's impossible, it's impossible only because the > > APIs provided by the kernel to userland don't have any way to represent > > such a thing. This would border on trivial to fix, except that it > > would be difficult to get much userland code to use the resulting APIs > > because of their lack of portability to other UNIX variants. Since write(2) is one of the oldest interfaces in Unix the chances of any change taking hold are vanishingly small... > write/writev/pwrite(2) could return the number of bytes it actually > did transfer before the fault, unless it is zero in which case it > could fail with EFAULT. Yes; this is really the only correct option. The same goes for EIO, and also other conditions that can arise in the middle of an I/O, like EDQUOT, or given nfs, ECONNRESET. Ultimately if you actually do some I/O the count reflecting what you did needs to be returned. Otherwise any application response other than "give up entirely" can result in duplicating or losing data. Since there are at least some errors that aren't sticky (that is, might not repeat if the application restarts the rest of the I/O, and not repeating doesn't automatically mean that the condition cleared and no longer matters), you really also want to not lose them. As I recall the conclusion from past discussions is that the only really suitable approach, messy though it is, is to stash the error in the struct file (or somewhere) and then return it immediately on the next operation. This applies at least to EIO and maybe others. (For EFAULT, if it doesn't repeat it means the condition cleared, and this is expected in the case of a generational G/C's write barriers and similar usages. So just dropping it in the case of partial success is ok.) Note though that returning short counts without an error can also violate expectations about short counts, interruption, and fast vs. slow devices, but none of that is very well defined so it is almost certainly a better compromise than returning an error after successfully doing I/O. The question that comes to my mind is whether it's better as an interface for writes (and reads) to return errors along with a uio_resid change, or whether we should distinguish failure of the operation (stuff like ENXIO/ENODEV or possibly ESTALE) from failure of the I/O, and return the latter separately in struct uio. The latter seems kind of preferable to me (because it maintains the expectation that if you get a failure return nothing happened) but it's a big architectural change. All of this is not _independent_ of fixing uiomove callers, because they're where EIO gets generated and some of them have some flexibility, and so we need a model for the error handling to update them correctly, but it's largely orthogonal to the original problem of incorrectly rolling back partial uiomoves. :-( (Examples of flexibility: if you copy 32 bytes out of a uio into a scratch buffer and then get EFAULT, you haven't actually done anything yet, so you can leave uio_resid alone and let the enclosing write call ultimately fail with EFAULT. But if you copied the same 32 bytes over a buffer-cache buffer, for example, you can't undo that so you have to report it. Conversely, if you read 32 bytes out of a tty you can't put it back and have to report it, but if you read 32 bytes out of a buffer cache buffer it'll still be there later and you haven't actually done anything yet.) > Strikes me as a bug that it doesn't do this for EFAULT -- and I'm not > sure it's right to fail after partial progress with _any_ error code, > except possibly EIO if metadata updates failed. It's not. (and I don't think EIO is an exception) > I don't see clear guidance in POSIX on the subject. shocking :-) > The attached test program successfully writes 65536 bytes to a pipe in > a single writev(2) call that fails and reports no progress, in a > NetBSD 10ish kernel: That's definitely a bug... -- David A. Holland dholl...@netbsd.org
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
If we think too hard about making this precise we'll wind up with names like uiosimulateviolatingnocloningtheorem and uiowavecollapse. uioskip serves either to discard data without consuming it, or to advance the pointer after using the data with uiopeek. peek is in contrast to get, like bufq_get/peek -- it's nondestructive of the queue/transfer. Probably won't be used much in the UIO_READ direction (i.e., writing data into a user's buffer) but who knows. None of these opposites are perfect: bufq_poke would be bonkers, memcpy vs memmove is hopeless, and nothing here actually destroys the source object like C++ move semantics vs copy semantics which more accurately reflect the physical meanings of the English words. I'm more interested in how the nuclear reactor is going to work!
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
>> - uiopeek leaves uio itself untouched ([...]). > Hmâ?? Iâ??m having second thoughts about uiopeek(), as well. It implies a d$ That is a good point. But would it be a problem to have uiopeek (works only to move from uio to buffer) and uiopoke (the other way)? I've never liked the way uiomove can move data one direction or the other depending on how the uio is set up. (I'd rather have uioread and uiowrite except that I'd be constantly trying to remember whether uioread reads from the uio or moves data in the direction a read() operation does.) Maybe uioget and uioput? Is there any significant amount of code that calls uiomove without knowing which direction the bits are moving? As for uiocopy versus uiomove, those are similar enough to memcpy and memmove that the implication feels to me more like "buffers can overlap" (for all that that is a highly unlikely use case) or some such. Finding good names is a mess. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
On 2023-05-10 14:00, Jason Thorpe wrote: On May 9, 2023, at 3:09 PM, Taylor R Campbell wrote: - uiopeek leaves uio itself untouched (it may modify the source/target buffers but it's idempotent). Hm… I’m having second thoughts about uiopeek(), as well. It implies a direction (“peek” feels like “read”, and “write” would feel more like a “poke”). I think uiocopy() is a better name, and I think it is sufficiently different from uiomove() (“move” implies a sortof destructive-ness that “copy” does not). I would sortof agree. But for me, "peek" more suggests that you are looking at the content, but intentionally leaving the data in there for something else to later pick up/process. copy seems to more appropriately describe what the intent is. Also, skip for also implies that you are skipping over the data, intentionally not interested in it. After a copy, I would feel it would be more describing to say advance rather than skip (or if someone else have another good verb for it, I'm all ears...) Johnny -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: b...@softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> On May 9, 2023, at 3:09 PM, Taylor R Campbell > wrote: > > - uiopeek leaves uio itself untouched (it may modify the source/target > buffers but it's idempotent). Hm… I’m having second thoughts about uiopeek(), as well. It implies a direction (“peek” feels like “read”, and “write” would feel more like a “poke”). I think uiocopy() is a better name, and I think it is sufficiently different from uiomove() (“move” implies a sort of destructive-ness that “copy” does not). -- thorpej
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> Date: Tue, 9 May 2023 23:03:27 -0400 (EDT) > From: Mouse > > > (In general, erroring in I/O is a whole additional can of worms; it's > > wrong to not report back however much work happened before the error > > when it can't be undone, but also impossible to both report work and > > raise an error. [...]) > > To the extent that it's impossible, it's impossible only because the > APIs provided by the kernel to userland don't have any way to represent > such a thing. This would border on trivial to fix, except that it > would be difficult to get much userland code to use the resulting APIs > because of their lack of portability to other UNIX variants. write/writev/pwrite(2) could return the number of bytes it actually did transfer before the fault, unless it is zero in which case it could fail with EFAULT. It already behaves this way for EINTR and EAGAIN/EWOULDBLOCK -- but not for EFAULT: https://nxr.netbsd.org/xref/src/sys/kern/sys_generic.c?r=1.134#478 Strikes me as a bug that it doesn't do this for EFAULT -- and I'm not sure it's right to fail after partial progress with _any_ error code, except possibly EIO if metadata updates failed. I don't see clear guidance in POSIX on the subject. - The EINTR and EAGAIN/EWOULDBLOCK behaviour is mandated: `If write() is interrupted by a signal before it writes any data, it shall return -1 with errno set to [EINTR].' `Write requests to a pipe or FIFO ... If the O_NONBLOCK flag is set A write request for more than {PIPE_BUF} bytes ... When at least one byte can be written, transfer what it can and return the number of bytes written.' - Without O_NONBLOCK, the DESCRIPTION section says `on normal completion it shall return nbyte', but a fault is not normal completion. - The RATIONALE section says `Partial and deferred writes are only possible with O_NONBLOCK set' and `There is no exception regarding partial writes when O_NONBLOCK is set', but that's inconsistent with the behaviour mandated for interruption by a signal. https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html The attached test program successfully writes 65536 bytes to a pipe in a single writev(2) call that fails and reports no progress, in a NetBSD 10ish kernel: $ ./writefault | cat >foo space=65536 nwrit=-1 errno=14 $ stat -f %z foo 65536 A similar program on macOS 13.3 (using hard-coded space instead of ioctl(FIONSPACE) which seems to be missing from macOS) exhibits the same behaviour. But a similar program on Linux 4.15 (using fpathconf(_PC_PIPE_BUF)) returns the partial number of bytes written (and anything less than fpathconf(_PC_PIPE_BUF) fails without writing anything to foo). _If_ write/writev/pwrite(2) were modified to handle EFAULT like it currently handles EINTR and EAGAIN/EWOULDBLOCK and return partial progress, then the current uiomove(9) semantics would give the wrong indication of progress in the event of fault -- too much progress, rather than too little. In contrast, uiopeek/uioskip would allow pipe_write to report exactly the number of bytes that it actually made available to the reader. #include #include #include #include #include #include #include #include #include #include #define HOWMANY(X, N) (((X) + ((N) - 1))/(N)) int main(void) { const int PAGE_SIZE = sysconf(_SC_PAGESIZE); int space; struct iovec iov[2]; size_t allocsize; void *pg; ssize_t nwrit; if (ioctl(STDOUT_FILENO, FIONSPACE, &space) == -1) err(1, "space"); space *= 4; /* BIG_PIPE_SIZE */ fprintf(stderr, "space=%d\n", space); allocsize = (HOWMANY(space, PAGE_SIZE) + 1) * PAGE_SIZE; pg = mmap(NULL, allocsize, PROT_READ|PROT_WRITE, MAP_ANON, -1, 0); if (pg == MAP_FAILED) err(1, "mmap"); memset(pg, 0x53, allocsize); if (mprotect(pg + allocsize - PAGE_SIZE, PAGE_SIZE, PROT_NONE) == -1) err(1, "mprotect"); iov[0].iov_base = pg; iov[0].iov_len = space; iov[1].iov_base = pg + allocsize - PAGE_SIZE; iov[1].iov_len = 1; errno = 0; nwrit = writev(STDOUT_FILENO, iov, __arraycount(iov)); fprintf(stderr, "nwrit=%zd errno=%d\n", nwrit, errno); return 0; }
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> (In general, erroring in I/O is a whole additional can of worms; it's > wrong to not report back however much work happened before the error > when it can't be undone, but also impossible to both report work and > raise an error. [...]) To the extent that it's impossible, it's impossible only because the APIs provided by the kernel to userland don't have any way to represent such a thing. This would border on trivial to fix, except that it would be difficult to get much userland code to use the resulting APIs because of their lack of portability to other UNIX variants. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
On Tue, May 09, 2023 at 09:21:38PM +, Taylor R Campbell wrote: > I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which > together are equivalent to successful uiomove(buf, n, uio). > > This allows a driver to separately: > 1. transfer data into a buffer (uiopeek) first, and then > 2. advance the uio (uioskip) only after determining how much of the >data the driver was able to actually process. > > Thoughts? I also prefer the name "uioadvance", because that seems like it characterizes the operation better. Otherwise, seems fine. > (a) Is ttwrite violating a contract by advancing the uio's iovs past > the number of bytes it consumed of uio_resid? Yes, IMO. That's supposed to be an invariant of the data structure, and it shouldn't get broken, even on error. > P.S. I suspect this issue will also apply in the EFAULT case. If you > try to uiomove (say) 2 bytes in separate iov entries, and the > first iov doesn't fault but the second iov does, uiomove will > advance the uio by one byte but return error. That also seems totally wrong. (In general, erroring in I/O is a whole additional can of worms; it's wrong to not report back however much work happened before the error when it can't be undone, but also impossible to both report work and raise an error. And unfortunately in the presence of things like generational garbage collectors, it's unsafe to assume that it doesn't matter because EFAULT is fatal/unrecoverable.) -- David A. Holland dholl...@netbsd.org
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
>> I'm not a fan of uioskip() as a name - [...] > I agree. "skip" seem to have wrong connotations (cf. dd(1)). I'm not sure I agree. (The dd analogy is weak; dd has no peek operation, and uioskip - under whatever name - would border on useless without uiopeek.) For uioskip, it _is_ skip semantics. The bytes skipped are not copied anywhere, not by uioskip. (They may have been copied earlier with uiopeek, but that doesn't affect what uioskip does.) If you want skip-*with*-copy, well, uiomove is still there. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
On Tue, May 09, 2023 at 14:33:26 -0700, Jason Thorpe wrote: > I'm not a fan of uioskip() as a name - I think uioadvance() would be > better. Skip implies, to my brain, that the data is being thrown > away (even if you're already consumed it). I agree. "skip" seem to have wrong connotations (cf. dd(1)). -uwe
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> On May 9, 2023, at 2:21 PM, Taylor R Campbell > wrote: > > tl;dr > > I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which I’m not a fan of uioskip() as a name … I think uioadvance() would be better. Skip implies, to my brain, that the data is being thrown away (even if you’re already consumed it). -- thorpej
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which > together are equivalent to successful uiomove(buf, n, uio). For what it may be worth, I like this. I don't _think_ I've ever run into issues caused by this issue before, but I have trouble seeing it as anything other than a bug waiting to happen. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
PROPOSAL: Split uiomove into uiopeek, uioskip
tl;dr I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which together are equivalent to successful uiomove(buf, n, uio). This allows a driver to separately: 1. transfer data into a buffer (uiopeek) first, and then 2. advance the uio (uioskip) only after determining how much of the data the driver was able to actually process. Thoughts? DETAILS A common pattern when transferring data between a driver and userland is unlocked uiomove(9) into a temporary buffer, and then some I/O operation under a driver lock. The uiomove(9) itself normally can't be done under a driver lock because it may fault and sleep indefinitely for I/O. However, sometimes the I/O operation doesn't consume everything. This can happen in ttwrite (sys/kern/tty.c): in the non-blocking case (flags & IO_NDELAY), it may bite off a chunk of the uio bigger than the tty can chew without blocking. In that case, ttwrite hacks uio->uio_resid so that the write(2) system call reports the correct number of bytes writen so the user program can pick up where it left off (a hack that has existed since 1990 or earlier) -- but the uio itself becomes inconsistent, because it doesn't roll back uio->uio_iov or uio->uio_iov[0].iov_base or anything else. And ktrwrite (sys/kern/kern_ktrace.c) tries to reuse the uio in the event that fp->f_ops->fo_write returns EWOULDBLOCK, after pausing for a moment. This requires the uio to be consistent, which falls apart with ttwrite, as in a litany of syzbot reports like this one: https://syzkaller.appspot.com/bug?id=976210ed438d48ac275d77d7ebf4a086e43b5fcb Now, ktracing to a tty is pretty weird, but the salient point is: (a) Is ttwrite violating a contract by advancing the uio's iovs past the number of bytes it consumed of uio_resid? or (b) Is ktrwrite abusing struct fileops::fo_write by reusing uio even after an error? If (a), the attached patch breaks ttwrite's uiomove into two parts: uiopeek, to get the data into a temporary buffer, which may be then fed byte-by-byte into ttyoutput; and uioskip, to advance the uio by as many bytes as were actually consumed by the tty -- all of them in the success case, but not all if we stop with EWOULDBLOCK. If (b), we have to convince ktrwrite to reconstruct a uio that picks up where fo_write left off in the same buffer. Maybe we could reconstruct it but skip the first n bytes, or take a snapshot of the uio, but either way it would be annoying to deal with -- more fiddly than writing, using, and testing uiopeek/uioskip which I already did. The problem appears not to manifest in OpenBSD or FreeBSD because they have nothing that calls fo_write in a loop with the same uio -- their ktrace is limited to vnodes and doesn't handle EWOULDBLOCK by looping. But their ttwrite routines (or equivalent) do leave the uio inconsistent on error, with the same comment Mike Karels left back in 1990 in the CSRG tty.c 7.21.[*] P.S. I suspect this issue will also apply in the EFAULT case. If you try to uiomove (say) 2 bytes in separate iov entries, and the first iov doesn't fault but the second iov does, uiomove will advance the uio by one byte but return error. The caller probably won't process the first byte even though uio_resid has been decremented. So it will appear as though one byte has been transferred even though that byte was actually just copied to a buffer and then discarded. That might be a more common issue throughout the tree which could also be addressed by separating it into uiopeek and uioskip, but I haven't put further thought to whether it's a real issue or further effort into auditing drivers for it. [*] https://github.com/robohack/ucb-csrg-bsd/commit/b6660865c9f41cce98f8728f7f27b34b4a231e81#diff-37ae898972ce315c35bb81ff0e209efb141a9967efef821c7e31fda8d67d7ff2R1479-R1486 diff --git a/share/man/man9/uiomove.9 b/share/man/man9/uiomove.9 index 6417ea14b684..d2f579253386 100644 --- a/share/man/man9/uiomove.9 +++ b/share/man/man9/uiomove.9 @@ -24,7 +24,7 @@ .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE .\" POSSIBILITY OF SUCH DAMAGE. .\" -.Dd April 26, 2010 +.Dd May 9, 2023 .Dt UIOMOVE 9 .Os .Sh NAME @@ -34,6 +34,10 @@ .In sys/systm.h .Ft int .Fn uiomove "void *buf" "size_t n" "struct uio *uio" +.Ft int +.Fn uiopeek "void *buf" "size_t n" "struct uio *uio" +.Ft void +.Fn uioskip "void *buf" "size_t n" "struct uio *uio" .Sh DESCRIPTION The .Fn uiomove @@ -140,10 +144,35 @@ to point that much farther into the region described. This allows multiple calls to .Fn uiomove to easily be used to fill or drain the region of data. +.Pp +The +.Fn uiopeek +function copies up to +.Fa n +bytes of data without updating +.Fa uio ; +the +.Fn uioskip +function updates +.Fa uio +without copying any data, and is guaranteed never to sleep or fault +even if the buffers are in userspace and memory access via +.Fn uiomove +or +.Fn uiopeek +would trigger pag