Date: Mon, 15 Feb 2021 23:18:33 +0100 From: Rhialto <rhia...@falu.nl> Message-ID: <ycrzoz0yiiy9q...@falu.nl>
| A system call with error can return with the carry set and the error and | short count returned in a separate registers. The carry bit is how | errors used to be indicated since at least V7 (even V6?) anyway. Earlier than v6, this dates back to when much of the system was written in assembly code (including many of the utilities). The issue isn't how to return multiple values from the kernel, that's easy, we even have standard sys calls (like pipe()) which do that routinely. The problem is that the definition of write() (and most other system calls) is that they don't affect errno unless there is an error, and if there is an error, they return -1 (which leaves no place to return a short count as well). This all actually happens in the libc stub. We could, of course, invent new interfaces (a write variant with an extra pointer to length written arg perhaps, or where the length arg is a pointer to a size_t and that is read and then written with either the amount written, or the amount not written). But I don't believe that any of this is needed, or desirable. We should first make sure that we do what POSIX requires, and simply return a short write count (and no error) in the cases where that should happen (out of space, over quota, exceeding file size limit, and writing any more would block and O_NONBLOCK is set, more?). In the other error cases we should simply leave things alone and accept it - it is the way unix always has been, and we have survived. If we have a drive returning I/O errors (on writes), do we really expect that earlier data written will have been written correctly? Do you want to rely upon that? It might have been possible once, when drives were stupid, and simply wrote sectors in the order presented, but with modern drives, with internal caches, which write the data in any order they like, when they like, and do block remapping when a sector goes bad, I wouldn't trust anything on the drive once it starts saying write failed. Pretending that the first 8K of a 16KB write worked, and there was an I/O error after that is folly. It may easily have been that the 2nd 8K block was written, and the first one gave up in error, eventually. Some of the data intended to be written may have been written, but we have no sane way to work out what (again, entire new interfaces could allow the info to be returned, but to what point? Who would ever write code to make use of that info?) It's even worse for the remaining cases, where the error is caused by broken software (either a broken kernel doing insane things, or a broken application asking to write data from memory it does not own, etc). Nothing can be assumed reliable in cases like that. So, let's all forget fanciful interface redesigns, fix whatever we need to fix to make things work the way they are supposed to work (if there is anything) and leave the rest as "the world just broke" type territory. kre