Re: partial failures in write(2) (and read(2))

2021-02-18 Thread David Holland
On Tue, Feb 16, 2021 at 05:29:00PM +0700, Robert Elz wrote:
 > We could, of course, invent new interfaces (a write variant with an
 > extra pointer to length written arg perhaps, or where the length arg
 > is a pointer to a size_t and that is read and then written with either
 > the amount written, or the amount not written).
 > 
 > But I don't believe that any of this is needed, or desirable.

Right, I think succeeding with a short count is preferable in all
cases where anyone actually cares what happened.

 > We should first make sure that we do what POSIX requires, and simply
 > return a short write count (and no error) in the cases where that
 > should happen (out of space, over quota, exceeding file size limit,
 > and writing any more would block and O_NONBLOCK is set, more?).

As far as I can tell these errors are not currently handled in this
way, except maybe the EWOULDBLOCK case.

(And there's one other: signal delivery after writing some data to a
slow device. But that already works correctly.)

 > In the other error cases we should simply leave things alone and
 > accept it - it is the way unix always has been, and we have survived.
 > If we have a drive returning I/O errors (on writes), do we really
 > expect that earlier data written will have been written correctly?

Since writes to regular files will always go into the cache and not (I
think ever, absent O_DIRECT) be written to disk directly, I don't
think that case actually arises. Instead it will be filtering through
the completely broken fsync error reporting chain. (More on that
elsewhere.)

However, for reads... if you read part of a file and then get EIO
because the disk is going bad, it's reasonably likely that the part
you did get is ok, and moreover, if what you're trying to do is rescue
data from a dying disk, chances are you _do_ want it, even if there's
a moderate chance of it being corrupted. So I kind of think the EIO
case should succeed with a short count too.

As for EFAULT, I was testing with that because it's easy to test, but
I agree that it isn't particularly useful to continue, the one thing
I'm not sure of being possible interactions with generational garbage
collectors.

 > So, let's all forget fanciful interface redesigns, fix whatever we
 > need to fix to make things work the way they are supposed to work
 > (if there is anything) and leave the rest as "the world just broke"
 > type territory.

I'm pretty sure the only on-the-fly error that _does_ work in this
sense (in the sense of being converted to success with a short count)
is EINTR.

-- 
David A. Holland
dholl...@netbsd.org


Re: partial failures in write(2) (and read(2))

2021-02-16 Thread Robert Elz
Date:Mon, 15 Feb 2021 23:18:33 +0100
From:Rhialto 
Message-ID:  

  | A system call with error can return with the carry set and the error and
  | short count returned in a separate registers. The carry bit is how
  | errors used to be indicated since at least V7 (even V6?) anyway.

Earlier than v6, this dates back to when much of the system was
written in assembly code (including many of the utilities).

The issue isn't how to return multiple values from the kernel, that's
easy, we even have standard sys calls (like pipe()) which do that
routinely.

The problem is that the definition of write() (and most other system
calls) is that they don't affect errno unless there is an error, and
if there is an error, they return -1 (which leaves no place to return
a short count as well).   This all actually happens in the libc stub.

We could, of course, invent new interfaces (a write variant with an
extra pointer to length written arg perhaps, or where the length arg
is a pointer to a size_t and that is read and then written with either
the amount written, or the amount not written).

But I don't believe that any of this is needed, or desirable.

We should first make sure that we do what POSIX requires, and simply
return a short write count (and no error) in the cases where that
should happen (out of space, over quota, exceeding file size limit,
and writing any more would block and O_NONBLOCK is set, more?).

In the other error cases we should simply leave things alone and
accept it - it is the way unix always has been, and we have survived.
If we have a drive returning I/O errors (on writes), do we really
expect that earlier data written will have been written correctly?
Do you want to rely upon that?It might have been possible once,
when drives were stupid, and simply wrote sectors in the order
presented, but with modern drives, with internal caches, which
write the data in any order they like, when they like, and do block
remapping when a sector goes bad, I wouldn't trust anything on
the drive once it starts saying write failed.   Pretending that
the first 8K of a 16KB write worked, and there was an I/O error
after that is folly.   It may easily have been that the 2nd 8K
block was written, and the first one gave up in error, eventually.
Some of the data intended to be written may have been written, but
we have no sane way to work out what (again, entire new interfaces
could allow the info to be returned, but to what point?  Who would
ever write code to make use of that info?)

It's even worse for the remaining cases, where the error is caused
by broken software (either a broken kernel doing insane things, or
a broken application asking to write data from memory it does not
own, etc).   Nothing can be assumed reliable in cases like that.

So, let's all forget fanciful interface redesigns, fix whatever we
need to fix to make things work the way they are supposed to work
(if there is anything) and leave the rest as "the world just broke"
type territory.

kre



Re: partial failures in write(2) (and read(2))

2021-02-15 Thread Rhialto
On Thu 11 Feb 2021 at 09:41:50 -0500, John Franklin wrote:
> From the application?s perspective of calling write() in libc, when
> there is such an error the libc function could return a short response
> and set errno.  That doesn?t work so much for the kernel?s system
> call.

A system call with error can return with the carry set and the error and
short count returned in a separate registers. The carry bit is how
errors used to be indicated since at least V7 (even V6?) anyway.

-Olaf.
-- 
Olaf 'Rhialto' Seibert -- rhialto at falu dot nl
___  Anyone who is capable of getting themselves made President should on
\X/  no account be allowed to do the job.   --Douglas Adams, "THGTTG"


signature.asc
Description: PGP signature


Re: partial failures in write(2) (and read(2))

2021-02-11 Thread Edgar Fuß
> I suppose libc could set a default handler for the new signal, and do some 
> extra work to set errno.
Then the libc routine could better use a new syscall, no?


Re: partial failures in write(2) (and read(2))

2021-02-11 Thread John Franklin
On Feb 10, 2021, at 18:28, Thor Lancelot Simon  wrote:
> 
> On Fri, Feb 05, 2021 at 08:10:06PM -0500, Mouse wrote:
>>> It is possible for write() calls to fail partway through, after
>>> already having written some data.
>> 
>> It is.  As you note later, it's also possible for read().
>> 
>> The rightest thing to do, it seems to me, would be to return the error
>> indication along with how much was successfully written (or read).  But
>> that, of course, requires a completely new API, which I gather is more
>> intrusive than you want to get into here.
> 
> I think it could be done with a signal in combination with the existing
> API.


From the application’s perspective of calling write() in libc, when there is 
such an error the libc function could return a short response and set errno.  
That doesn’t work so much for the kernel’s system call.  I suppose libc could 
set a default handler for the new signal, and do some extra work to set errno.  
Apps that need more control can override the signal hander.

jf
-- 
John Franklin
frank...@elfie.org

Re: partial failures in write(2) (and read(2))

2021-02-10 Thread Thor Lancelot Simon
On Fri, Feb 05, 2021 at 08:10:06PM -0500, Mouse wrote:
> > It is possible for write() calls to fail partway through, after
> > already having written some data.
> 
> It is.  As you note later, it's also possible for read().
> 
> The rightest thing to do, it seems to me, would be to return the error
> indication along with how much was successfully written (or read).  But
> that, of course, requires a completely new API, which I gather is more
> intrusive than you want to get into here.

I think it could be done with a signal in combination with the existing
API.

Thor


Re: partial failures in write(2) (and read(2))

2021-02-10 Thread Mouse
>>> It is possible for write() calls to fail partway through, after
>>> already having written some data.
>> The rightest thing to do, it seems to me, would be to return the
>> error indication along with how much was successfully written (or
>> read).  But that, of course, requires a completely new API, [...]
> I think it could be done with a signal in combination with the
> existing API.

It could!  Thank you for thinking of that; I evidently wasn't thinking
creatively enough.  Yes.  A new signal, which is generated when an I/O
operation errors after partial completion (and returning the amount of
I/O done).  I'm spinning an off-the-cuff design here: maybe, if the
signal is set to SIG_DFL, the error sticks around and is returned next
call, but if the signal is SIG_IGN or caught, not (like SIGCHLD,
SIG_IGN and SIG_DFL have different semantics even though the signal is
ignored by default).

That cound be retrofit relatively easily.

Not sure I like the price of a signal delivery just to get the error
status; an alternative I/O API could also exist.  Programs for which
the additional cost is acceptable could use the signal; those whose
tradeoffs go the other way could use the API

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: partial failures in write(2) (and read(2))

2021-02-06 Thread Robert Elz
Date:Fri, 05 Feb 2021 20:43:30 -0500
From:Greg Troxel 
Message-ID:  

  | An obvious question is what POSIX requires, pause for `kill -HUP kred` :)

Hey!   wiz is the daemon, I'm an angel...

  | I think your case (a) is the only conforming behavior and obviously what
  | the spec says must happen.

For what I'd call detectable in advance errors (and signals) yes, I agree,
that's required (that is all the cases where you can tell simply from the
state of the world that the write cannot complete as asked).  For hardware
errors (and in that category I think include the case of a buffer that
starts out with valid addresses and continues to invalid ones, where a
SIGSEGV would perhaps also be acceptable behaviour, but if not, and EFAULT
is generated), I don't think anything is specified at all.

The standard recommends advancing the file offset to the point of the error,
but doesn't require it, and certainly doesn't require returning the number of
bytes written up to the point where the error occurs (nor does it preclude
that I believe).   This is not surprising, as what it describes is what
systems actually do, and most systems traditionally upon detecting an I/O
error, or copy{in/out} failure, simply return -1 with errno set, rather
than attempting to advise the application how much data was actually 
transferred before the error.

kre



Re: partial failures in write(2) (and read(2))

2021-02-05 Thread Greg Troxel

David Holland  writes:

> Basically, it is not feasible to check for and report all possible
> errors ahead of time, nor in general is it possible or even desirable
> to unwind portions of a write that have already been completed, which
> means that if a failure occurs partway through a write there are two
> reasonable choices for proceeding:
>(a) return success with a short count reporting how much data has
>already been written;
>(b) return failure.
>
> In case (a) the error gets lost unless additional steps are taken
> (which as far as I know we currently have no support for); in case (b)
> the fact that some data was written gets lost, potentially leading to
> corrupted output. Neither of these outcomes is optimal, but optimal
> (detecting all errors beforehand, or rolling back the data already
> written) isn't on the table.
>
> It seems to me that for most errors (a) is preferable, since correctly
> written user software will detect the short count, retry with the rest
> of the data, and hit the error case directly, but it seems not
> everyone agrees with me.

It seems to me that (a) is obviously the correct approach.

An obvious question is what POSIX requires, pause for `kill -HUP kred` :)

I am only a junior POSIX lawyer, not a senior one, but as I read

  
https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html#tag_16_685

I think your case (a) is the only conforming behavior and obviously what
the spec says must happen.   I do not even see a glimmer of support for (b).

There is the issue of PIPE_BUF, and  requests <= PIPE_BUF being atomic,
but I don't think you are talking about that.

Note that write is obligated to return partial completion if interrupted
by a signal.

I think your notion that it's ok to not return the reason the full
amount wasn't written is enirely valid.

I am surprised this is contentious (really; not trying to be difficult).



signature.asc
Description: PGP signature


Re: partial failures in write(2) (and read(2))

2021-02-05 Thread Mouse
> It is possible for write() calls to fail partway through, after
> already having written some data.

It is.  As you note later, it's also possible for read().

The rightest thing to do, it seems to me, would be to return the error
indication along with how much was successfully written (or read).  But
that, of course, requires a completely new API, which I gather is more
intrusive than you want to get into here.

> Basically, it is not feasible to check for and report all possible
> errors ahead of time,

In some cases - such as EIO - it is not possible even in theory.

> nor in general is it possible or even desirable to unwind portions of
> a write that have already been completed,

Agreed.  In some cases, by the time the error is detected, the bits may
not even exist on the local machine any longer.

> which means that if a failure occurs partway through a write there
> are two reasonable choices for proceeding:
>(a) return success with a short count reporting how much data has
>already been written;
>(b) return failure.

Right.

Personally, my own preference is for (a), with the error remembered and
returned on the next write (resp. read) even if there is nothing (else)
erroneous about that next operation.

> It seems to me that for most errors (a) is preferable, since
> correctly written user software will detect the short count, retry
> with the rest of the data, and hit the error case directly, but it
> seems not everyone agrees with me.

Well, if it _will_ "hit the error case directly", maybe.  It is not
clear to me that it will.  Except for EPIPE (which will rarely be
returned; most writers will die on SIGPIPE instead), none of those is
guaranteed to repeat on the next write - though admittedly some are
more likely to than others, and some of them (eg, EFAULT) definitely
will recur unless something in the writing process intervenes.

> [test with deliberately mprotect()ed part of buffer]
>- for regular files on ffs and probably most things that use
>  uiomove_ubc, the data in the accessible part of the buffer is
>  written, the call fails with EFAULT, and the size of the file is
>  reverted to what it was at the start.

!!  That, I would say, strongly violates POLA.  It is not behaviour I
would have been likely to guess.

> Anyhow, if you've made it this far, the actual question is: is the
> current behavior really what we want?

It is not what _I_ would prefer.  If we _had_ a more elaborate API, one
that could return partial success followed by an error, then I'd say we
could ignore the question of what write() and read() do on the grounds
that code that really cares can always use the more detailed call.

If adding that is an option, great.  If not, well, I think returning a
short count and remembering the error for the next call is about the
best option available.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


partial failures in write(2) (and read(2))

2021-02-05 Thread David Holland
(This came up in chat, and since there was no agreement at all there
it seems it ought to be discussed here.)

It is possible for write() calls to fail partway through, after
already having written some data. We do not currently document the
behavior under these circumstances (though we should), and some
experimentation suggests that at least some of the behavior violates
the principle of least surprise.

Basically, it is not feasible to check for and report all possible
errors ahead of time, nor in general is it possible or even desirable
to unwind portions of a write that have already been completed, which
means that if a failure occurs partway through a write there are two
reasonable choices for proceeding:
   (a) return success with a short count reporting how much data has
   already been written;
   (b) return failure.

In case (a) the error gets lost unless additional steps are taken
(which as far as I know we currently have no support for); in case (b)
the fact that some data was written gets lost, potentially leading to
corrupted output. Neither of these outcomes is optimal, but optimal
(detecting all errors beforehand, or rolling back the data already
written) isn't on the table.

It seems to me that for most errors (a) is preferable, since correctly
written user software will detect the short count, retry with the rest
of the data, and hit the error case directly, but it seems not
everyone agrees with me.

The cases that exist (going by the errors documented in write(2)) are:
   ENOSPC/EDQUOT (disk fills during the I/O)
   EFBIG (file size exceeds a limit)
   EFAULT (invalid user memory)
   EIO (hardware error)
   EPIPE (pipe gets closed during the I/O)

In the first three cases it's notionally possible to check for the
error case beforehand, but it doesn't actually work because the
activities of other processes or threads while the I/O is in progress
can invalidate the results of any check. (Also, for EFAULT the check
is expensive.)

Some of the same cases (particularly EFAULT and EIO) exist for read.
(Note that while for ordinary files stopping a partial read,
discarding the results, and returning failure is harmless, this is not
the case for pipes, ttys, and sockets, so it also matters for read.)

We were experimenting with the EFAULT behavior by using mprotect() to
deny access to part of a buffer and then writing the whole buffer out.
The results so far (with sufficiently large buffers):

   - for pipes, ttys, and probably everything that uses ordinary
 uiomove, the data in the accessible part of the buffer is written
 out and the call fails with EFAULT.

   - for regular files on ffs and probably most things that use
 uiomove_ubc, the data in the accessible part of the buffer is
 written, the call fails with EFAULT, and the size of the file is
 reverted to what it was at the start.

   - nobody's tested sockets yet, I think.

   - in all cases the mtime is updated.

The size reversion does unwind the common case, but in other cases it
produces bizarre behavior; e.g. if you have a 1M file and you write 2M
to it and then fault, the 1M of the file is replaced with the first 1M
of what you wrote and the rest is discarded; plus given that the call
failed most users' first instinct would be to assume that nothing was
written.

The behavior is probably the same for the other errors, though I
haven't looked and it's definitely possible that ENOSPC/EDQUOT are
handled more carefully.

Anyhow, if you've made it this far, the actual question is: is the
current behavior really what we want? (Whether or not it's technically
correct, or happens to be consistent with the exact wording in the man
pages, various aspects of it seem undesirable.)

ISTM that for all these cases except EIO it's sufficient to return
success with a short count and let the user code retry with the rest
of its data. For EIO I think it's best to do that and also retain the
error somewhere for the next write attempt.

-- 
David A. Holland
dholl...@netbsd.org