Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-10 Thread Mouse
>>> To the extent that it's impossible, it's impossible only because
>>> the APIs provided by the kernel to userland don't have any way to
>>> represent such a thing.  This would border on trivial to fix,
>>> except that it would be difficult to get much userland code to use
>>> the resulting APIs because of their lack of portability to other
>>> UNIX variants.
> Since write(2) is one of the oldest interfaces in Unix the chances of
> any change taking hold are vanishingly small...

Oh, I'm not suggesting that write(2) change.  What I'm suggesting is
the creation of some alternative interface, write_extended(2) let's
call it for the sake of concreteness[%], which is just like write(2)
except that it _does_ provide some way to unambiguously return "wrote
N, then error E".  (Exactly how is pretty much irrelevant; I'm sure
practically everyone here can imagine plenty of possible alternatives.
If it comes to arguing choices for that, I'd paint the bikeshed a dark
forest green.)  But write(2) would continue to exist, with more or less
its traditional semantics.  Only if - and only when - write_extended
becomes so popular that nobody uses plain write(2) any longer would I
propose removing it.

[%] If anything like this happens I certainly hope someone will invent
a better name.

But userland uptake for write_extended will be minimal, especially
initially; that's the portability issue I was talking about.

> All of this is not _independent_ of fixing uiomove callers, [...],
> but it's largely orthogonal to the original problem of incorrectly
> rolling back partial uiomoves. :-(

Agreed.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-10 Thread David Holland
On Wed, May 10, 2023 at 10:11:31AM +, Taylor R Campbell wrote:
 > > > (In general, erroring in I/O is a whole additional can of worms; it's
 > > > wrong to not report back however much work happened before the error
 > > > when it can't be undone, but also impossible to both report work and
 > > > raise an error.  [...])
 > > 
 > > To the extent that it's impossible, it's impossible only because the
 > > APIs provided by the kernel to userland don't have any way to represent
 > > such a thing.  This would border on trivial to fix, except that it
 > > would be difficult to get much userland code to use the resulting APIs
 > > because of their lack of portability to other UNIX variants.

Since write(2) is one of the oldest interfaces in Unix the chances of
any change taking hold are vanishingly small...

 > write/writev/pwrite(2) could return the number of bytes it actually
 > did transfer before the fault, unless it is zero in which case it
 > could fail with EFAULT.

Yes; this is really the only correct option. The same goes for EIO,
and also other conditions that can arise in the middle of an I/O, like
EDQUOT, or given nfs, ECONNRESET. Ultimately if you actually do some
I/O the count reflecting what you did needs to be returned. Otherwise
any application response other than "give up entirely" can result in
duplicating or losing data.

Since there are at least some errors that aren't sticky (that is,
might not repeat if the application restarts the rest of the I/O, and
not repeating doesn't automatically mean that the condition cleared
and no longer matters), you really also want to not lose them. As I
recall the conclusion from past discussions is that the only really
suitable approach, messy though it is, is to stash the error in the
struct file (or somewhere) and then return it immediately on the next
operation. This applies at least to EIO and maybe others.

(For EFAULT, if it doesn't repeat it means the condition cleared, and
this is expected in the case of a generational G/C's write barriers
and similar usages. So just dropping it in the case of partial success
is ok.)

Note though that returning short counts without an error can also
violate expectations about short counts, interruption, and fast vs.
slow devices, but none of that is very well defined so it is almost
certainly a better compromise than returning an error after
successfully doing I/O.

The question that comes to my mind is whether it's better as an
interface for writes (and reads) to return errors along with a
uio_resid change, or whether we should distinguish failure of the
operation (stuff like ENXIO/ENODEV or possibly ESTALE) from failure of
the I/O, and return the latter separately in struct uio.

The latter seems kind of preferable to me (because it maintains the
expectation that if you get a failure return nothing happened) but
it's a big architectural change.

All of this is not _independent_ of fixing uiomove callers, because
they're where EIO gets generated and some of them have some
flexibility, and so we need a model for the error handling to update
them correctly, but it's largely orthogonal to the original problem of
incorrectly rolling back partial uiomoves. :-(

(Examples of flexibility: if you copy 32 bytes out of a uio into a
scratch buffer and then get EFAULT, you haven't actually done anything
yet, so you can leave uio_resid alone and let the enclosing write call
ultimately fail with EFAULT. But if you copied the same 32 bytes over
a buffer-cache buffer, for example, you can't undo that so you have to
report it. Conversely, if you read 32 bytes out of a tty you can't put
it back and have to report it, but if you read 32 bytes out of a
buffer cache buffer it'll still be there later and you haven't
actually done anything yet.)

 > Strikes me as a bug that it doesn't do this for EFAULT -- and I'm not
 > sure it's right to fail after partial progress with _any_ error code,
 > except possibly EIO if metadata updates failed.

It's not. (and I don't think EIO is an exception)

 > I don't see clear guidance in POSIX on the subject.

shocking :-)

 > The attached test program successfully writes 65536 bytes to a pipe in
 > a single writev(2) call that fails and reports no progress, in a
 > NetBSD 10ish kernel:

That's definitely a bug...

-- 
David A. Holland
dholl...@netbsd.org


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-10 Thread Taylor R Campbell
If we think too hard about making this precise we'll wind up with
names like uiosimulateviolatingnocloningtheorem and uiowavecollapse.

uioskip serves either to discard data without consuming it, or to
advance the pointer after using the data with uiopeek.

peek is in contrast to get, like bufq_get/peek -- it's nondestructive
of the queue/transfer.  Probably won't be used much in the UIO_READ
direction (i.e., writing data into a user's buffer) but who knows.
None of these opposites are perfect: bufq_poke would be bonkers,
memcpy vs memmove is hopeless, and nothing here actually destroys the
source object like C++ move semantics vs copy semantics which more
accurately reflect the physical meanings of the English words.

I'm more interested in how the nuclear reactor is going to work!


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-10 Thread Mouse
>> - uiopeek leaves uio itself untouched ([...]).
> Hmâ?? Iâ??m having second thoughts about uiopeek(), as well.  It implies a d$

That is a good point.  But would it be a problem to have uiopeek
(works only to move from uio to buffer) and uiopoke (the other way)?

I've never liked the way uiomove can move data one direction or the
other depending on how the uio is set up.  (I'd rather have uioread and
uiowrite except that I'd be constantly trying to remember whether
uioread reads from the uio or moves data in the direction a read()
operation does.)  Maybe uioget and uioput?

Is there any significant amount of code that calls uiomove without
knowing which direction the bits are moving?

As for uiocopy versus uiomove, those are similar enough to memcpy and
memmove that the implication feels to me more like "buffers can
overlap" (for all that that is a highly unlikely use case) or some
such.

Finding good names is a mess.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-10 Thread Johnny Billquist

On 2023-05-10 14:00, Jason Thorpe wrote:



On May 9, 2023, at 3:09 PM, Taylor R Campbell 
 wrote:

- uiopeek leaves uio itself untouched (it may modify the source/target
  buffers but it's idempotent).


Hm… I’m having second thoughts about uiopeek(), as well.  It implies a 
direction (“peek” feels like “read”, and “write” would feel more like a 
“poke”).  I think uiocopy() is a better name, and I think it is sufficiently 
different from uiomove() (“move” implies a sortof destructive-ness that “copy” 
does not).


I would sortof agree. But for me, "peek" more suggests that you are 
looking at the content, but intentionally leaving the data in there for 
something else to later pick up/process. copy seems to more 
appropriately describe what the intent is.


Also, skip for also implies that you are skipping over the data, 
intentionally not interested in it. After a copy, I would feel it would 
be more describing to say advance rather than skip (or if someone else 
have another good verb for it, I'm all ears...)


  Johnny

--
Johnny Billquist  || "I'm on a bus
  ||  on a psychedelic trip
email: b...@softjar.se ||  Reading murder books
pdp is alive! ||  tryin' to stay hip" - B. Idol


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-10 Thread Jason Thorpe


> On May 9, 2023, at 3:09 PM, Taylor R Campbell 
>  wrote:
> 
> - uiopeek leaves uio itself untouched (it may modify the source/target
>  buffers but it's idempotent).

Hm… I’m having second thoughts about uiopeek(), as well.  It implies a 
direction (“peek” feels like “read”, and “write” would feel more like a 
“poke”).  I think uiocopy() is a better name, and I think it is sufficiently 
different from uiomove() (“move” implies a sort of destructive-ness that “copy” 
does not).

-- thorpej



Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-10 Thread Taylor R Campbell
> Date: Tue, 9 May 2023 23:03:27 -0400 (EDT)
> From: Mouse 
> 
> > (In general, erroring in I/O is a whole additional can of worms; it's
> > wrong to not report back however much work happened before the error
> > when it can't be undone, but also impossible to both report work and
> > raise an error.  [...])
> 
> To the extent that it's impossible, it's impossible only because the
> APIs provided by the kernel to userland don't have any way to represent
> such a thing.  This would border on trivial to fix, except that it
> would be difficult to get much userland code to use the resulting APIs
> because of their lack of portability to other UNIX variants.

write/writev/pwrite(2) could return the number of bytes it actually
did transfer before the fault, unless it is zero in which case it
could fail with EFAULT.

It already behaves this way for EINTR and EAGAIN/EWOULDBLOCK -- but
not for EFAULT:

https://nxr.netbsd.org/xref/src/sys/kern/sys_generic.c?r=1.134#478

Strikes me as a bug that it doesn't do this for EFAULT -- and I'm not
sure it's right to fail after partial progress with _any_ error code,
except possibly EIO if metadata updates failed.

I don't see clear guidance in POSIX on the subject.

- The EINTR and EAGAIN/EWOULDBLOCK behaviour is mandated:

  `If write() is interrupted by a signal before it writes any data, it
   shall return -1 with errno set to [EINTR].'

  `Write requests to a pipe or FIFO ... If the O_NONBLOCK flag is
   set A write request for more than {PIPE_BUF} bytes ... When at
   least one byte can be written, transfer what it can and return the
   number of bytes written.'

- Without O_NONBLOCK, the DESCRIPTION section says `on normal
  completion it shall return nbyte', but a fault is not normal
  completion.

- The RATIONALE section says `Partial and deferred writes are only
  possible with O_NONBLOCK set' and `There is no exception regarding
  partial writes when O_NONBLOCK is set', but that's inconsistent with
  the behaviour mandated for interruption by a signal.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

The attached test program successfully writes 65536 bytes to a pipe in
a single writev(2) call that fails and reports no progress, in a
NetBSD 10ish kernel:

$ ./writefault | cat >foo
space=65536
nwrit=-1 errno=14
$ stat -f %z foo
65536

A similar program on macOS 13.3 (using hard-coded space instead of
ioctl(FIONSPACE) which seems to be missing from macOS) exhibits the
same behaviour.  But a similar program on Linux 4.15 (using
fpathconf(_PC_PIPE_BUF)) returns the partial number of bytes written
(and anything less than fpathconf(_PC_PIPE_BUF) fails without writing
anything to foo).

_If_ write/writev/pwrite(2) were modified to handle EFAULT like it
currently handles EINTR and EAGAIN/EWOULDBLOCK and return partial
progress, then the current uiomove(9) semantics would give the wrong
indication of progress in the event of fault -- too much progress,
rather than too little.  In contrast, uiopeek/uioskip would allow
pipe_write to report exactly the number of bytes that it actually made
available to the reader.
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 

#define HOWMANY(X, N)   (((X) + ((N) - 1))/(N))

int
main(void)
{
const int PAGE_SIZE = sysconf(_SC_PAGESIZE);
int space;
struct iovec iov[2];
size_t allocsize;
void *pg;
ssize_t nwrit;

if (ioctl(STDOUT_FILENO, FIONSPACE, &space) == -1)
err(1, "space");
space *= 4; /* BIG_PIPE_SIZE */
fprintf(stderr, "space=%d\n", space);

allocsize = (HOWMANY(space, PAGE_SIZE) + 1) * PAGE_SIZE;
pg = mmap(NULL, allocsize, PROT_READ|PROT_WRITE, MAP_ANON, -1, 0);
if (pg == MAP_FAILED)
err(1, "mmap");
memset(pg, 0x53, allocsize);
if (mprotect(pg + allocsize - PAGE_SIZE, PAGE_SIZE, PROT_NONE) == -1)
err(1, "mprotect");

iov[0].iov_base = pg;
iov[0].iov_len = space;
iov[1].iov_base = pg + allocsize - PAGE_SIZE;
iov[1].iov_len = 1;
errno = 0;
nwrit = writev(STDOUT_FILENO, iov, __arraycount(iov));
fprintf(stderr, "nwrit=%zd errno=%d\n", nwrit, errno);

return 0;
}


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-09 Thread Mouse
> (In general, erroring in I/O is a whole additional can of worms; it's
> wrong to not report back however much work happened before the error
> when it can't be undone, but also impossible to both report work and
> raise an error.  [...])

To the extent that it's impossible, it's impossible only because the
APIs provided by the kernel to userland don't have any way to represent
such a thing.  This would border on trivial to fix, except that it
would be difficult to get much userland code to use the resulting APIs
because of their lack of portability to other UNIX variants.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-09 Thread David Holland
On Tue, May 09, 2023 at 09:21:38PM +, Taylor R Campbell wrote:
 > I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which
 > together are equivalent to successful uiomove(buf, n, uio).
 > 
 > This allows a driver to separately:
 > 1. transfer data into a buffer (uiopeek) first, and then
 > 2. advance the uio (uioskip) only after determining how much of the
 >data the driver was able to actually process.
 > 
 > Thoughts?

I also prefer the name "uioadvance", because that seems like it
characterizes the operation better.

Otherwise, seems fine.

 > (a) Is ttwrite violating a contract by advancing the uio's iovs past
 > the number of bytes it consumed of uio_resid?

Yes, IMO. That's supposed to be an invariant of the data structure,
and it shouldn't get broken, even on error.

 > P.S.  I suspect this issue will also apply in the EFAULT case.  If you
 >   try to uiomove (say) 2 bytes in separate iov entries, and the
 >   first iov doesn't fault but the second iov does, uiomove will
 >   advance the uio by one byte but return error.

That also seems totally wrong.

(In general, erroring in I/O is a whole additional can of worms; it's
wrong to not report back however much work happened before the error
when it can't be undone, but also impossible to both report work and
raise an error. And unfortunately in the presence of things like
generational garbage collectors, it's unsafe to assume that it doesn't
matter because EFAULT is fatal/unrecoverable.)

-- 
David A. Holland
dholl...@netbsd.org


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-09 Thread Mouse
>> I'm not a fan of uioskip() as a name - [...]
> I agree.  "skip" seem to have wrong connotations (cf. dd(1)).

I'm not sure I agree.  (The dd analogy is weak; dd has no peek
operation, and uioskip - under whatever name - would border on useless
without uiopeek.)

For uioskip, it _is_ skip semantics.  The bytes skipped are not copied
anywhere, not by uioskip.  (They may have been copied earlier with
uiopeek, but that doesn't affect what uioskip does.)  If you want
skip-*with*-copy, well, uiomove is still there.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-09 Thread Valery Ushakov
On Tue, May 09, 2023 at 14:33:26 -0700, Jason Thorpe wrote:

> I'm not a fan of uioskip() as a name - I think uioadvance() would be
> better.  Skip implies, to my brain, that the data is being thrown
> away (even if you're already consumed it).

I agree.  "skip" seem to have wrong connotations (cf. dd(1)).

-uwe


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-09 Thread Jason Thorpe



> On May 9, 2023, at 2:21 PM, Taylor R Campbell 
>  wrote:
> 
> tl;dr
> 
> I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which

I’m not a fan of uioskip() as a name … I think uioadvance() would be better.  
Skip implies, to my brain, that the data is being thrown away (even if you’re 
already consumed it).

-- thorpej



Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-09 Thread Mouse
> I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which
> together are equivalent to successful uiomove(buf, n, uio).

For what it may be worth, I like this.  I don't _think_ I've ever run
into issues caused by this issue before, but I have trouble seeing it
as anything other than a bug waiting to happen.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-09 Thread Taylor R Campbell
tl;dr

I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which
together are equivalent to successful uiomove(buf, n, uio).

This allows a driver to separately:
1. transfer data into a buffer (uiopeek) first, and then
2. advance the uio (uioskip) only after determining how much of the
   data the driver was able to actually process.

Thoughts?


DETAILS

A common pattern when transferring data between a driver and userland
is unlocked uiomove(9) into a temporary buffer, and then some I/O
operation under a driver lock.  The uiomove(9) itself normally can't
be done under a driver lock because it may fault and sleep
indefinitely for I/O.

However, sometimes the I/O operation doesn't consume everything.  This
can happen in ttwrite (sys/kern/tty.c): in the non-blocking case
(flags & IO_NDELAY), it may bite off a chunk of the uio bigger than
the tty can chew without blocking.

In that case, ttwrite hacks uio->uio_resid so that the write(2) system
call reports the correct number of bytes writen so the user program
can pick up where it left off (a hack that has existed since 1990 or
earlier) -- but the uio itself becomes inconsistent, because it
doesn't roll back uio->uio_iov or uio->uio_iov[0].iov_base or anything
else.

And ktrwrite (sys/kern/kern_ktrace.c) tries to reuse the uio in the
event that fp->f_ops->fo_write returns EWOULDBLOCK, after pausing for
a moment.  This requires the uio to be consistent, which falls apart
with ttwrite, as in a litany of syzbot reports like this one:

https://syzkaller.appspot.com/bug?id=976210ed438d48ac275d77d7ebf4a086e43b5fcb

Now, ktracing to a tty is pretty weird, but the salient point is:

(a) Is ttwrite violating a contract by advancing the uio's iovs past
the number of bytes it consumed of uio_resid?

or

(b) Is ktrwrite abusing struct fileops::fo_write by reusing uio even
after an error?

If (a), the attached patch breaks ttwrite's uiomove into two parts:
uiopeek, to get the data into a temporary buffer, which may be then
fed byte-by-byte into ttyoutput; and uioskip, to advance the uio by as
many bytes as were actually consumed by the tty -- all of them in the
success case, but not all if we stop with EWOULDBLOCK.

If (b), we have to convince ktrwrite to reconstruct a uio that picks
up where fo_write left off in the same buffer.  Maybe we could
reconstruct it but skip the first n bytes, or take a snapshot of the
uio, but either way it would be annoying to deal with -- more fiddly
than writing, using, and testing uiopeek/uioskip which I already did.

The problem appears not to manifest in OpenBSD or FreeBSD because they
have nothing that calls fo_write in a loop with the same uio -- their
ktrace is limited to vnodes and doesn't handle EWOULDBLOCK by looping.
But their ttwrite routines (or equivalent) do leave the uio
inconsistent on error, with the same comment Mike Karels left back in
1990 in the CSRG tty.c 7.21.[*]

P.S.  I suspect this issue will also apply in the EFAULT case.  If you
  try to uiomove (say) 2 bytes in separate iov entries, and the
  first iov doesn't fault but the second iov does, uiomove will
  advance the uio by one byte but return error.

  The caller probably won't process the first byte even though
  uio_resid has been decremented.  So it will appear as though one
  byte has been transferred even though that byte was actually
  just copied to a buffer and then discarded.

  That might be a more common issue throughout the tree which
  could also be addressed by separating it into uiopeek and
  uioskip, but I haven't put further thought to whether it's a
  real issue or further effort into auditing drivers for it.

[*] 
https://github.com/robohack/ucb-csrg-bsd/commit/b6660865c9f41cce98f8728f7f27b34b4a231e81#diff-37ae898972ce315c35bb81ff0e209efb141a9967efef821c7e31fda8d67d7ff2R1479-R1486
diff --git a/share/man/man9/uiomove.9 b/share/man/man9/uiomove.9
index 6417ea14b684..d2f579253386 100644
--- a/share/man/man9/uiomove.9
+++ b/share/man/man9/uiomove.9
@@ -24,7 +24,7 @@
 .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 .\" POSSIBILITY OF SUCH DAMAGE.
 .\"
-.Dd April 26, 2010
+.Dd May 9, 2023
 .Dt UIOMOVE 9
 .Os
 .Sh NAME
@@ -34,6 +34,10 @@
 .In sys/systm.h
 .Ft int
 .Fn uiomove "void *buf" "size_t n" "struct uio *uio"
+.Ft int
+.Fn uiopeek "void *buf" "size_t n" "struct uio *uio"
+.Ft void
+.Fn uioskip "void *buf" "size_t n" "struct uio *uio"
 .Sh DESCRIPTION
 The
 .Fn uiomove
@@ -140,10 +144,35 @@ to point that much farther into the region described.
 This allows multiple calls to
 .Fn uiomove
 to easily be used to fill or drain the region of data.
+.Pp
+The
+.Fn uiopeek
+function copies up to
+.Fa n
+bytes of data without updating
+.Fa uio ;
+the
+.Fn uioskip
+function updates
+.Fa uio
+without copying any data, and is guaranteed never to sleep or fault
+even if the buffers are in userspace and memory access via
+.Fn uiomove
+or
+.Fn uiopeek
+would trigger pag