Re: partial failures in write(2) (and read(2))
On Tue, Feb 16, 2021 at 05:29:00PM +0700, Robert Elz wrote: > We could, of course, invent new interfaces (a write variant with an > extra pointer to length written arg perhaps, or where the length arg > is a pointer to a size_t and that is read and then written with either > the amount written, or the amount not written). > > But I don't believe that any of this is needed, or desirable. Right, I think succeeding with a short count is preferable in all cases where anyone actually cares what happened. > We should first make sure that we do what POSIX requires, and simply > return a short write count (and no error) in the cases where that > should happen (out of space, over quota, exceeding file size limit, > and writing any more would block and O_NONBLOCK is set, more?). As far as I can tell these errors are not currently handled in this way, except maybe the EWOULDBLOCK case. (And there's one other: signal delivery after writing some data to a slow device. But that already works correctly.) > In the other error cases we should simply leave things alone and > accept it - it is the way unix always has been, and we have survived. > If we have a drive returning I/O errors (on writes), do we really > expect that earlier data written will have been written correctly? Since writes to regular files will always go into the cache and not (I think ever, absent O_DIRECT) be written to disk directly, I don't think that case actually arises. Instead it will be filtering through the completely broken fsync error reporting chain. (More on that elsewhere.) However, for reads... if you read part of a file and then get EIO because the disk is going bad, it's reasonably likely that the part you did get is ok, and moreover, if what you're trying to do is rescue data from a dying disk, chances are you _do_ want it, even if there's a moderate chance of it being corrupted. So I kind of think the EIO case should succeed with a short count too. As for EFAULT, I was testing with that because it's easy to test, but I agree that it isn't particularly useful to continue, the one thing I'm not sure of being possible interactions with generational garbage collectors. > So, let's all forget fanciful interface redesigns, fix whatever we > need to fix to make things work the way they are supposed to work > (if there is anything) and leave the rest as "the world just broke" > type territory. I'm pretty sure the only on-the-fly error that _does_ work in this sense (in the sense of being converted to success with a short count) is EINTR. -- David A. Holland dholl...@netbsd.org
Re: partial failures in write(2) (and read(2))
Date:Mon, 15 Feb 2021 23:18:33 +0100 From:Rhialto Message-ID: | A system call with error can return with the carry set and the error and | short count returned in a separate registers. The carry bit is how | errors used to be indicated since at least V7 (even V6?) anyway. Earlier than v6, this dates back to when much of the system was written in assembly code (including many of the utilities). The issue isn't how to return multiple values from the kernel, that's easy, we even have standard sys calls (like pipe()) which do that routinely. The problem is that the definition of write() (and most other system calls) is that they don't affect errno unless there is an error, and if there is an error, they return -1 (which leaves no place to return a short count as well). This all actually happens in the libc stub. We could, of course, invent new interfaces (a write variant with an extra pointer to length written arg perhaps, or where the length arg is a pointer to a size_t and that is read and then written with either the amount written, or the amount not written). But I don't believe that any of this is needed, or desirable. We should first make sure that we do what POSIX requires, and simply return a short write count (and no error) in the cases where that should happen (out of space, over quota, exceeding file size limit, and writing any more would block and O_NONBLOCK is set, more?). In the other error cases we should simply leave things alone and accept it - it is the way unix always has been, and we have survived. If we have a drive returning I/O errors (on writes), do we really expect that earlier data written will have been written correctly? Do you want to rely upon that?It might have been possible once, when drives were stupid, and simply wrote sectors in the order presented, but with modern drives, with internal caches, which write the data in any order they like, when they like, and do block remapping when a sector goes bad, I wouldn't trust anything on the drive once it starts saying write failed. Pretending that the first 8K of a 16KB write worked, and there was an I/O error after that is folly. It may easily have been that the 2nd 8K block was written, and the first one gave up in error, eventually. Some of the data intended to be written may have been written, but we have no sane way to work out what (again, entire new interfaces could allow the info to be returned, but to what point? Who would ever write code to make use of that info?) It's even worse for the remaining cases, where the error is caused by broken software (either a broken kernel doing insane things, or a broken application asking to write data from memory it does not own, etc). Nothing can be assumed reliable in cases like that. So, let's all forget fanciful interface redesigns, fix whatever we need to fix to make things work the way they are supposed to work (if there is anything) and leave the rest as "the world just broke" type territory. kre
Re: partial failures in write(2) (and read(2))
On Thu 11 Feb 2021 at 09:41:50 -0500, John Franklin wrote: > From the application?s perspective of calling write() in libc, when > there is such an error the libc function could return a short response > and set errno. That doesn?t work so much for the kernel?s system > call. A system call with error can return with the carry set and the error and short count returned in a separate registers. The carry bit is how errors used to be indicated since at least V7 (even V6?) anyway. -Olaf. -- Olaf 'Rhialto' Seibert -- rhialto at falu dot nl ___ Anyone who is capable of getting themselves made President should on \X/ no account be allowed to do the job. --Douglas Adams, "THGTTG" signature.asc Description: PGP signature
Re: partial failures in write(2) (and read(2))
> I suppose libc could set a default handler for the new signal, and do some > extra work to set errno. Then the libc routine could better use a new syscall, no?
Re: partial failures in write(2) (and read(2))
On Feb 10, 2021, at 18:28, Thor Lancelot Simon wrote: > > On Fri, Feb 05, 2021 at 08:10:06PM -0500, Mouse wrote: >>> It is possible for write() calls to fail partway through, after >>> already having written some data. >> >> It is. As you note later, it's also possible for read(). >> >> The rightest thing to do, it seems to me, would be to return the error >> indication along with how much was successfully written (or read). But >> that, of course, requires a completely new API, which I gather is more >> intrusive than you want to get into here. > > I think it could be done with a signal in combination with the existing > API. From the application’s perspective of calling write() in libc, when there is such an error the libc function could return a short response and set errno. That doesn’t work so much for the kernel’s system call. I suppose libc could set a default handler for the new signal, and do some extra work to set errno. Apps that need more control can override the signal hander. jf -- John Franklin frank...@elfie.org
Re: partial failures in write(2) (and read(2))
On Fri, Feb 05, 2021 at 08:10:06PM -0500, Mouse wrote: > > It is possible for write() calls to fail partway through, after > > already having written some data. > > It is. As you note later, it's also possible for read(). > > The rightest thing to do, it seems to me, would be to return the error > indication along with how much was successfully written (or read). But > that, of course, requires a completely new API, which I gather is more > intrusive than you want to get into here. I think it could be done with a signal in combination with the existing API. Thor
Re: partial failures in write(2) (and read(2))
>>> It is possible for write() calls to fail partway through, after >>> already having written some data. >> The rightest thing to do, it seems to me, would be to return the >> error indication along with how much was successfully written (or >> read). But that, of course, requires a completely new API, [...] > I think it could be done with a signal in combination with the > existing API. It could! Thank you for thinking of that; I evidently wasn't thinking creatively enough. Yes. A new signal, which is generated when an I/O operation errors after partial completion (and returning the amount of I/O done). I'm spinning an off-the-cuff design here: maybe, if the signal is set to SIG_DFL, the error sticks around and is returned next call, but if the signal is SIG_IGN or caught, not (like SIGCHLD, SIG_IGN and SIG_DFL have different semantics even though the signal is ignored by default). That cound be retrofit relatively easily. Not sure I like the price of a signal delivery just to get the error status; an alternative I/O API could also exist. Programs for which the additional cost is acceptable could use the signal; those whose tradeoffs go the other way could use the API /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: partial failures in write(2) (and read(2))
Date:Fri, 05 Feb 2021 20:43:30 -0500 From:Greg Troxel Message-ID: | An obvious question is what POSIX requires, pause for `kill -HUP kred` :) Hey! wiz is the daemon, I'm an angel... | I think your case (a) is the only conforming behavior and obviously what | the spec says must happen. For what I'd call detectable in advance errors (and signals) yes, I agree, that's required (that is all the cases where you can tell simply from the state of the world that the write cannot complete as asked). For hardware errors (and in that category I think include the case of a buffer that starts out with valid addresses and continues to invalid ones, where a SIGSEGV would perhaps also be acceptable behaviour, but if not, and EFAULT is generated), I don't think anything is specified at all. The standard recommends advancing the file offset to the point of the error, but doesn't require it, and certainly doesn't require returning the number of bytes written up to the point where the error occurs (nor does it preclude that I believe). This is not surprising, as what it describes is what systems actually do, and most systems traditionally upon detecting an I/O error, or copy{in/out} failure, simply return -1 with errno set, rather than attempting to advise the application how much data was actually transferred before the error. kre
Re: partial failures in write(2) (and read(2))
David Holland writes: > Basically, it is not feasible to check for and report all possible > errors ahead of time, nor in general is it possible or even desirable > to unwind portions of a write that have already been completed, which > means that if a failure occurs partway through a write there are two > reasonable choices for proceeding: >(a) return success with a short count reporting how much data has >already been written; >(b) return failure. > > In case (a) the error gets lost unless additional steps are taken > (which as far as I know we currently have no support for); in case (b) > the fact that some data was written gets lost, potentially leading to > corrupted output. Neither of these outcomes is optimal, but optimal > (detecting all errors beforehand, or rolling back the data already > written) isn't on the table. > > It seems to me that for most errors (a) is preferable, since correctly > written user software will detect the short count, retry with the rest > of the data, and hit the error case directly, but it seems not > everyone agrees with me. It seems to me that (a) is obviously the correct approach. An obvious question is what POSIX requires, pause for `kill -HUP kred` :) I am only a junior POSIX lawyer, not a senior one, but as I read https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html#tag_16_685 I think your case (a) is the only conforming behavior and obviously what the spec says must happen. I do not even see a glimmer of support for (b). There is the issue of PIPE_BUF, and requests <= PIPE_BUF being atomic, but I don't think you are talking about that. Note that write is obligated to return partial completion if interrupted by a signal. I think your notion that it's ok to not return the reason the full amount wasn't written is enirely valid. I am surprised this is contentious (really; not trying to be difficult). signature.asc Description: PGP signature
Re: partial failures in write(2) (and read(2))
> It is possible for write() calls to fail partway through, after > already having written some data. It is. As you note later, it's also possible for read(). The rightest thing to do, it seems to me, would be to return the error indication along with how much was successfully written (or read). But that, of course, requires a completely new API, which I gather is more intrusive than you want to get into here. > Basically, it is not feasible to check for and report all possible > errors ahead of time, In some cases - such as EIO - it is not possible even in theory. > nor in general is it possible or even desirable to unwind portions of > a write that have already been completed, Agreed. In some cases, by the time the error is detected, the bits may not even exist on the local machine any longer. > which means that if a failure occurs partway through a write there > are two reasonable choices for proceeding: >(a) return success with a short count reporting how much data has >already been written; >(b) return failure. Right. Personally, my own preference is for (a), with the error remembered and returned on the next write (resp. read) even if there is nothing (else) erroneous about that next operation. > It seems to me that for most errors (a) is preferable, since > correctly written user software will detect the short count, retry > with the rest of the data, and hit the error case directly, but it > seems not everyone agrees with me. Well, if it _will_ "hit the error case directly", maybe. It is not clear to me that it will. Except for EPIPE (which will rarely be returned; most writers will die on SIGPIPE instead), none of those is guaranteed to repeat on the next write - though admittedly some are more likely to than others, and some of them (eg, EFAULT) definitely will recur unless something in the writing process intervenes. > [test with deliberately mprotect()ed part of buffer] >- for regular files on ffs and probably most things that use > uiomove_ubc, the data in the accessible part of the buffer is > written, the call fails with EFAULT, and the size of the file is > reverted to what it was at the start. !! That, I would say, strongly violates POLA. It is not behaviour I would have been likely to guess. > Anyhow, if you've made it this far, the actual question is: is the > current behavior really what we want? It is not what _I_ would prefer. If we _had_ a more elaborate API, one that could return partial success followed by an error, then I'd say we could ignore the question of what write() and read() do on the grounds that code that really cares can always use the more detailed call. If adding that is an option, great. If not, well, I think returning a short count and remembering the error for the next call is about the best option available. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
partial failures in write(2) (and read(2))
(This came up in chat, and since there was no agreement at all there it seems it ought to be discussed here.) It is possible for write() calls to fail partway through, after already having written some data. We do not currently document the behavior under these circumstances (though we should), and some experimentation suggests that at least some of the behavior violates the principle of least surprise. Basically, it is not feasible to check for and report all possible errors ahead of time, nor in general is it possible or even desirable to unwind portions of a write that have already been completed, which means that if a failure occurs partway through a write there are two reasonable choices for proceeding: (a) return success with a short count reporting how much data has already been written; (b) return failure. In case (a) the error gets lost unless additional steps are taken (which as far as I know we currently have no support for); in case (b) the fact that some data was written gets lost, potentially leading to corrupted output. Neither of these outcomes is optimal, but optimal (detecting all errors beforehand, or rolling back the data already written) isn't on the table. It seems to me that for most errors (a) is preferable, since correctly written user software will detect the short count, retry with the rest of the data, and hit the error case directly, but it seems not everyone agrees with me. The cases that exist (going by the errors documented in write(2)) are: ENOSPC/EDQUOT (disk fills during the I/O) EFBIG (file size exceeds a limit) EFAULT (invalid user memory) EIO (hardware error) EPIPE (pipe gets closed during the I/O) In the first three cases it's notionally possible to check for the error case beforehand, but it doesn't actually work because the activities of other processes or threads while the I/O is in progress can invalidate the results of any check. (Also, for EFAULT the check is expensive.) Some of the same cases (particularly EFAULT and EIO) exist for read. (Note that while for ordinary files stopping a partial read, discarding the results, and returning failure is harmless, this is not the case for pipes, ttys, and sockets, so it also matters for read.) We were experimenting with the EFAULT behavior by using mprotect() to deny access to part of a buffer and then writing the whole buffer out. The results so far (with sufficiently large buffers): - for pipes, ttys, and probably everything that uses ordinary uiomove, the data in the accessible part of the buffer is written out and the call fails with EFAULT. - for regular files on ffs and probably most things that use uiomove_ubc, the data in the accessible part of the buffer is written, the call fails with EFAULT, and the size of the file is reverted to what it was at the start. - nobody's tested sockets yet, I think. - in all cases the mtime is updated. The size reversion does unwind the common case, but in other cases it produces bizarre behavior; e.g. if you have a 1M file and you write 2M to it and then fault, the 1M of the file is replaced with the first 1M of what you wrote and the rest is discarded; plus given that the call failed most users' first instinct would be to assume that nothing was written. The behavior is probably the same for the other errors, though I haven't looked and it's definitely possible that ENOSPC/EDQUOT are handled more carefully. Anyhow, if you've made it this far, the actual question is: is the current behavior really what we want? (Whether or not it's technically correct, or happens to be consistent with the exact wording in the man pages, various aspects of it seem undesirable.) ISTM that for all these cases except EIO it's sufficient to return success with a short count and let the user code retry with the rest of its data. For EIO I think it's best to do that and also retain the error somewhere for the next write attempt. -- David A. Holland dholl...@netbsd.org