Re: Using mmap(2) in sort(1) instead of temp files

2024-04-05 Thread David Holland
On Fri, Apr 05, 2024 at 06:48:05AM +, David Holland wrote:
 >- read or write errors on memory mapped files result in SIGSEGV,
 >  which is annoying to deal with and does actually turn up in the
 >  field sometimes (*);

Oops, looks like I forgot to populate the asterisk.

(*) https://gnats.netbsd.org/11904

-- 
David A. Holland
dholl...@netbsd.org


Re: Using mmap(2) in sort(1) instead of temp files

2024-04-05 Thread David Holland
On Wed, Apr 03, 2024 at 05:40:47PM +, Ice Cream wrote:
 > I'm trying to speed up sort(1) by using mmap(2) instead of temp
 > files.
 > 
 > ftmp() (see code below) is called in the sort functions to create and
 > return a temp file. mkstemp() is used to create the temp file, then
 > the file pointer (returned by fdopen) is returned to the sort functions
 > for use. I'm trying to understand where and how mmap should come
 > into the picture here, and how to implement this feature.

I expect the intent was to mmap the temporary files to avoid the
overhead incurred by stdio. (As opposed to just allocating memory,
which as Mouse points out carries problems.)

I'm not sure this is actually a good idea. Using raw file handles
instead of stdio FILEs might provide some speedup (depending on the
write patterns and how big the blocks written are) but it's never been
entirely clear that mmap is actually substantively faster than using
raw file handles. Meanwhile, there are several disadvantages:
   - as Mouse pointed out, you need to know the size in advance;
   - read or write errors on memory mapped files result in SIGSEGV,
 which is annoying to deal with and does actually turn up in the
 field sometimes (*);
   - even if you apply MADV_SEQUENTIAL with madvise(2) the mmap
 interface can't really do as good a job of prefetching;
   - on 32-bit platforms the size is limited.
 
FWIW.

 > PS: It was mentioned in the TODO file
 > > speed up sort(1) by using mmap(2) rather than temp files
 
I can't find this reference :-(

-- 
David A. Holland
dholl...@netbsd.org


Re: Perceivable time differences [was Re: PSA: Clock drift and pkgin]

2023-12-30 Thread David Holland
On Sun, Dec 31, 2023 at 02:54:50AM +0100, Johnny Billquist wrote:
 > Ok. I oversimplified.
 > 
 > If I remember right, the point was that something sub 200ms is perceived by
 > the brain as being "instananeous" response. It don't mean that one cannot
 > discern shorter times, just that from an action-reaction point of view,
 > anything below 200ms is "good enough".

The usual figure cited is 100 ms, not 200, but yeah.

it is instructive to look at the stopwatch function on a digital
watch; you can easily see the tenths counting but not the 100ths.

-- 
David A. Holland
dholl...@netbsd.org


Re: Proposal: Restore malloc(9) interface

2023-12-30 Thread David Holland
On Sat, Dec 30, 2023 at 10:44:52PM +, Taylor R Campbell wrote:
 > Note: I am NOT proposing any substantive changes to the implementation
 > of the allocator -- I'm just proposing that we go back to the old
 > _interface_, using the new pool-cache-based _implementation_, and to
 > add lightweight per-CPU, per-tag usage counting to the malloc and free
 > paths.

Can we just add tags to kmem(9)? At this point that seems like a path
of lesser resistance, and also it avoids having a standard function
name with a nonstandard interface.

(Also, let's make the tags be typed as pointers instead of an enum)

-- 
David A. Holland
dholl...@netbsd.org


Re: veriexec(4) maintenance

2023-08-04 Thread David Holland
On Thu, Aug 03, 2023 at 03:03:06PM +0930, Brett Lymn wrote:
 > > If anyone cares about veriexec(4) and would like to volunteer to take
 > > this on, we can discuss what needs to be done and how to proceed.
 > 
 > I will take this on since I was originally responsible for the mess.  It
 > does play fast and loose with synchronisation and should be fixed.
 > 
 > I have some fixes in my private tree that went some way to addressing
 > the issues but they have been languishing for many many years because I
 > had an issue of a deadlock that would occaisionally occur.  At the time
 > I couldn't work out where the deadlock was coming from but maybe with
 > some expert help and the improved tools I can nut it out.

I was looking at this last weekend and for veriexec I have one fix and
some comments/notes, which I should probably commit.

fileassoc I would recommend flushing and replacing with a proper
specificdata implementation for vnodes, which there should be no
particular barrier to. fileassoc is trying to be that but it's both
bust and doing it wrong. :-|

-- 
David A. Holland
dholl...@netbsd.org


Re: kcmp(2)

2023-07-20 Thread David Holland
On Tue, Jul 18, 2023 at 09:32:42PM -0700, Jason Thorpe wrote:
 > > Don't cloner instances differ in minor number? If not, shouldn't they?
 > 
 > Not that I?m aware of.  They result in a new file object with a new
 > private data pointer, but they don?t change the minor number and I
 > don?t see why forcing them to do so would be such a good idea.
 > What if you had a single driver (that consumes a major # slot) that
 > wanted to provide two cloning interfaces?  If each clone got its
 > own minor #, then you?d be artificially limiting how many could be
 > created.

Well, as noted in this thread, traditionally you can tell when two
files are the same by examining stat results. And the cloner mechanism
replaced an older scheme where you had to pick the number of instances
you wanted, and unless I'm misremembering badly in that world each had
to have its own minor number. So it's not an unreasonable proposition.

(And I don't see how two cloning interfaces per driver is different
from one; either way you have the same number of minor numbers
available.)

That said, it almost certainly isn't important...

-- 
David A. Holland
dholl...@netbsd.org


Re: kcmp(2)

2023-07-18 Thread David Holland
On Tue, Jul 18, 2023 at 03:25:02PM -0700, Jason Thorpe wrote:
 > That *might* work in this particular case, but it would not work
 > for e.g. a cloning device where you get additional descriptors via
 > dup() or whatever.

Don't cloner instances differ in minor number? If not, shouldn't they?

-- 
David A. Holland
dholl...@netbsd.org


Re: PROPOSAL: config_* with device_t references

2023-05-10 Thread David Holland
On Wed, May 10, 2023 at 01:08:27AM +, Taylor R Campbell wrote:
 > I propose to add new config_*_acquire/release functions:
 >  dev = config_found_acquire(...);
 > ...
 >  config_detach_release(dev);

Seems reasonable...

(note, haven't read the diff carefully yet)

-- 
David A. Holland
dholl...@netbsd.org


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-10 Thread David Holland
On Wed, May 10, 2023 at 10:11:31AM +, Taylor R Campbell wrote:
 > > > (In general, erroring in I/O is a whole additional can of worms; it's
 > > > wrong to not report back however much work happened before the error
 > > > when it can't be undone, but also impossible to both report work and
 > > > raise an error.  [...])
 > > 
 > > To the extent that it's impossible, it's impossible only because the
 > > APIs provided by the kernel to userland don't have any way to represent
 > > such a thing.  This would border on trivial to fix, except that it
 > > would be difficult to get much userland code to use the resulting APIs
 > > because of their lack of portability to other UNIX variants.

Since write(2) is one of the oldest interfaces in Unix the chances of
any change taking hold are vanishingly small...

 > write/writev/pwrite(2) could return the number of bytes it actually
 > did transfer before the fault, unless it is zero in which case it
 > could fail with EFAULT.

Yes; this is really the only correct option. The same goes for EIO,
and also other conditions that can arise in the middle of an I/O, like
EDQUOT, or given nfs, ECONNRESET. Ultimately if you actually do some
I/O the count reflecting what you did needs to be returned. Otherwise
any application response other than "give up entirely" can result in
duplicating or losing data.

Since there are at least some errors that aren't sticky (that is,
might not repeat if the application restarts the rest of the I/O, and
not repeating doesn't automatically mean that the condition cleared
and no longer matters), you really also want to not lose them. As I
recall the conclusion from past discussions is that the only really
suitable approach, messy though it is, is to stash the error in the
struct file (or somewhere) and then return it immediately on the next
operation. This applies at least to EIO and maybe others.

(For EFAULT, if it doesn't repeat it means the condition cleared, and
this is expected in the case of a generational G/C's write barriers
and similar usages. So just dropping it in the case of partial success
is ok.)

Note though that returning short counts without an error can also
violate expectations about short counts, interruption, and fast vs.
slow devices, but none of that is very well defined so it is almost
certainly a better compromise than returning an error after
successfully doing I/O.

The question that comes to my mind is whether it's better as an
interface for writes (and reads) to return errors along with a
uio_resid change, or whether we should distinguish failure of the
operation (stuff like ENXIO/ENODEV or possibly ESTALE) from failure of
the I/O, and return the latter separately in struct uio.

The latter seems kind of preferable to me (because it maintains the
expectation that if you get a failure return nothing happened) but
it's a big architectural change.

All of this is not _independent_ of fixing uiomove callers, because
they're where EIO gets generated and some of them have some
flexibility, and so we need a model for the error handling to update
them correctly, but it's largely orthogonal to the original problem of
incorrectly rolling back partial uiomoves. :-(

(Examples of flexibility: if you copy 32 bytes out of a uio into a
scratch buffer and then get EFAULT, you haven't actually done anything
yet, so you can leave uio_resid alone and let the enclosing write call
ultimately fail with EFAULT. But if you copied the same 32 bytes over
a buffer-cache buffer, for example, you can't undo that so you have to
report it. Conversely, if you read 32 bytes out of a tty you can't put
it back and have to report it, but if you read 32 bytes out of a
buffer cache buffer it'll still be there later and you haven't
actually done anything yet.)

 > Strikes me as a bug that it doesn't do this for EFAULT -- and I'm not
 > sure it's right to fail after partial progress with _any_ error code,
 > except possibly EIO if metadata updates failed.

It's not. (and I don't think EIO is an exception)

 > I don't see clear guidance in POSIX on the subject.

shocking :-)

 > The attached test program successfully writes 65536 bytes to a pipe in
 > a single writev(2) call that fails and reports no progress, in a
 > NetBSD 10ish kernel:

That's definitely a bug...

-- 
David A. Holland
dholl...@netbsd.org


Re: PROPOSAL: Split uiomove into uiopeek, uioskip

2023-05-09 Thread David Holland
On Tue, May 09, 2023 at 09:21:38PM +, Taylor R Campbell wrote:
 > I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which
 > together are equivalent to successful uiomove(buf, n, uio).
 > 
 > This allows a driver to separately:
 > 1. transfer data into a buffer (uiopeek) first, and then
 > 2. advance the uio (uioskip) only after determining how much of the
 >data the driver was able to actually process.
 > 
 > Thoughts?

I also prefer the name "uioadvance", because that seems like it
characterizes the operation better.

Otherwise, seems fine.

 > (a) Is ttwrite violating a contract by advancing the uio's iovs past
 > the number of bytes it consumed of uio_resid?

Yes, IMO. That's supposed to be an invariant of the data structure,
and it shouldn't get broken, even on error.

 > P.S.  I suspect this issue will also apply in the EFAULT case.  If you
 >   try to uiomove (say) 2 bytes in separate iov entries, and the
 >   first iov doesn't fault but the second iov does, uiomove will
 >   advance the uio by one byte but return error.

That also seems totally wrong.

(In general, erroring in I/O is a whole additional can of worms; it's
wrong to not report back however much work happened before the error
when it can't be undone, but also impossible to both report work and
raise an error. And unfortunately in the presence of things like
generational garbage collectors, it's unsafe to assume that it doesn't
matter because EFAULT is fatal/unrecoverable.)

-- 
David A. Holland
dholl...@netbsd.org


Re: Per-descriptor state

2023-05-07 Thread David Holland
On Mon, May 08, 2023 at 12:36:21AM +, David Holland wrote:
 > I... completely failed to recognize it as an emoji and assumed it was
 > supposed to be -9 to forcibly shut something down.

emoji. smiley. yeah

-- 
David A. Holland
dholl...@netbsd.org


Re: Per-descriptor state

2023-05-07 Thread David Holland
On Mon, May 08, 2023 at 02:09:07AM +0200, Johnny Billquist wrote:
 > > > not nearly as
 > > > horrible as, say, the hack I once implemented where calling
 > > > wait4(0x456d756c,(int *)0x61746f72,0x4d616769,(struct rusage 
 > > > *)0x633a2d29)
 > >
 > > ^^
 > >   > would produce suitably magic effects.  (This was on a 32-bit machine.)
 > > 
 > > surely you mean 39!
 > 
 > Hmm?
 > 
 > "EmulatorMagic:-)" don't do it for you, you want "EmulatorMagic:=)" ?

I... completely failed to recognize it as an emoji and assumed it was
supposed to be -9 to forcibly shut something down.

:-|

-- 
David A. Holland
dholl...@netbsd.org


Re: Per-descriptor state

2023-05-07 Thread David Holland
On Fri, May 05, 2023 at 09:44:25PM -0400, Mouse wrote:
 > >>> But I kind of think it'd be preferable to make a way to clone a
 > >>> second independent struct file for the same socket than to start
 > >>> mucking with per-descriptor state.
 > >> [...] it should be easy to extend dup3() to make that possible [...]
 > > If we were to add such a thing it should be called something else,
 > > not dup, because it's a fundamentally different operation from dup
 > > and we don't need people confusing them.
 > 
 > Different?  Some.  But not very.  It _is_ closely related to dup().  I
 > don't think dup3() would be a horrible way to do it - 

There's already a dup3(). But ignoring that... yes, it's quite
significant, it constructs a materially different kind of additional
reference. For everyone who knows the internal structure of
references, which is probably everyone reading this list, it's easy to
understand what the semantics are and what the difference between the
two operations; for everyone else, which is most of the world and most
of the people who will ever use the call, it is _not_ clear and if
the two cases aren't made overtly distinct they'll routinely mix them
up and get in trouble. Even if they are made overtly distinct you'll
still get bozos making wrong proclamations about them in places like
stackoverflow.

*puts on his "taught this to undergrads for many years" hat*

 > not nearly as
 > horrible as, say, the hack I once implemented where calling
 > wait4(0x456d756c,(int *)0x61746f72,0x4d616769,(struct rusage *)0x633a2d29)
  ^^
 > would produce suitably magic effects.  (This was on a 32-bit machine.)

surely you mean 39!

 > But, honestly, when I saw the idea my reaction was to make it a new
 > operation to fcntl.  F_REOPEN, say, since it's creating new per-open
 > state.  Or, if you want to be all overloady, how about
 > open(0,O_REOPEN,existingfd)?  It _is_ creating new per-open state, so
 > open is in _some_ sense right.
 > 
 > My choice, for what it's worth, would be fcntl, dup3 second, with
 > O_REOPEN a distant third.

fcntl is inherently terrible and shouldn't get new ops or use cases
(there is absolutely no reason to prefer it to a new syscall except
for conservation of syscall numbers, which isn't a useful thing).

Should be something like reopen(fd, flags).

Or it could be open("/dev/fd/%d", flags); except that neither the
pathname nor the semantics of /dev/fd are reliably portable.


(...passing a fd as a 3rd argument to open is really horrible; that's
a mode_t, not a fd)

-- 
David A. Holland
dholl...@netbsd.org


Re: Per-descriptor state

2023-05-07 Thread David Holland
On Thu, May 04, 2023 at 08:40:49AM -0400, Mouse wrote:
 > >> I can't think of any at all; to begin with it's limited to forks
 > >> that don't exec
 > [...]
 > 
 > Well, except for libraries that open fds internally, without exposing
 > them to the calling code.  Depending on the user case, they may want
 > them closed if the application forks.
 >
 > [...]
 > 
 > Because one wants to exec a child process, maybe?

If the fork is going to exec, O_CLOEXEC is sufficient; I can't think
of any realistic situation where having a second reference to the same
file for at most few extra milliseconds can actually hurt anything.

-- 
David A. Holland
dholl...@netbsd.org


Re: Per-descriptor state

2023-05-03 Thread David Holland
On Sun, Apr 30, 2023 at 10:50:51PM +0700, Robert Elz wrote:
 > Date:Sun, 30 Apr 2023 05:25:41 +
 > From:    David Holland 
 > Message-ID:  
 > 
 >   | Close-on-fork is apparently either coming or already here, not sure
 >   | which, but it's also per-descriptor.
 > 
 > We don't have it, but it will be in Posix-8.   Largely inspired by the
 > needs of threaded programs (without lots of critical sections, one cannot
 > otherwise open anything if another thread might fork, there's no
 > way to avoid race conditions, hence O_CLOFORK on open ... not sure if
 > anyone has thought of a way to add it to socket() - that doesn't look
 > to be trivial, though it might be possible to abuse one of the params
 > it has - probably domain - and add flags in upper bits ... while having
 > it able to be set/reset via fcntl is useful, to work, it needs to be
 > able to be set atomically with the operation that creates the fd, and
 > having it default "on", which would work, would break almost all existing
 > non-trivial code).

Given that the trend is to just add extra calls with flags arguments
without even pausing to consider, I'd be surprised if nobody's banged
out a socket4() yet.

 >   | But I kind of think it'd be preferable to make a way to
 >   | clone a second independent struct file for the same socket than to
 >   | start mucking with per-descriptor state.
 > 
 > When I saw mouse's message, I was thinking the exact same thing,
 > and it should be easy to extend dup3() to make that possible

If we were to add such a thing it should be called something else, not
dup, because it's a fundamentally different operation from dup and we
don't need people confusing them.

 > - however
 > I'm not sure what effects that might have on the semantics of sockets
 > (what assumptions the current code might make about there being only
 > one struct file, with all it contains, for a socket).

Indeed :-|

-- 
David A. Holland
dholl...@netbsd.org


Re: Per-descriptor state

2023-05-03 Thread David Holland
On Sun, Apr 30, 2023 at 09:44:49AM -0400, Mouse wrote:
 > > Close-on-fork is apparently either coming or already here, not sure
 > > which, but it's also per-descriptor.
 > 
 > I should probably add that here, then, though use cases will likely be
 > rare.  I can think of only one program I wrote where it'd be useful; I
 > created a "close these fds post-fork" data structure internally.

I can't think of any at all; to begin with it's limited to forks that
don't exec, and unless just using it for convenience as you're
probably suggesting, it only applies when also using threads, and if
one's using threads why is one also using forks? So it seems like it's
limited to badly designed libraries that want to fork behind the
caller's back instead of setting up their forks at initialization
time. Or something.

 > > The thing is, per-descriptor state is a mess and it shouldn't be
 > > allowed to proliferate.  The reason: the descriptor is an array
 > > index.  There's plenty of room to store stuff in the object it looks
 > > up (that is, struct file) but storing stuff on the key side of the
 > > array is a problem.
 > 
 > (References to -current here really mean "filedesc.h 1.70 according to
 > cvsweb.netbsd.org".)
 >
 > [...]

I had forgotten how much crap we'd added on behalf of the silly broken
close scheme. So all those costs have been paid already. :-|

 > > (Then there's also another issue, which is that in nearly all cases
 > > nonblocking I/O is a workaround for interface bugs, e.g. nonblocking
 > > accept or open of dialout modem devices, or for structural problems
 > > in software that also needs to use the same file handle, like your
 > > original problem with curses. In the long run it's probably better to
 > > kill off the reasons to want to use nonblocking I/O at all.)
 > 
 > And replace nbio with...what?  Multiple threads doing blocking calls?
 > Or do you think everything should be nonblocking by default, with
 > blocking behaviour implemented userland-side?  Or am I completely
 > misinterpreting somehow?

select?

-- 
David A. Holland
dholl...@netbsd.org


Re: Per-descriptor state

2023-04-29 Thread David Holland
On Sun, Apr 30, 2023 at 12:49:03AM -0400, Mouse wrote:
 > This is pushing towards making it per-descriptor state.  At present,
 > the versions I have don't have anything but close-on-exec as true
 > per-descriptor state.  A quick look at 9.1 and cvsweb.netbsd.org
 > (which, mirabilu visu, actually works for me, unlike www.netbsd.org and
 > mail-index.netbsd.org) sys/sys/filedesc.h makes me think that that's
 > true of them as well.

Close-on-fork is apparently either coming or already here, not sure
which, but it's also per-descriptor.

The thing is, per-descriptor state is a mess and it shouldn't be
allowed to proliferate. The reason: the descriptor is an array index.
There's plenty of room to store stuff in the object it looks up (that
is, struct file) but storing stuff on the key side of the array is a
problem. For a couple bits you can mask them off from the pointer
(though even that's abusive); more than that and suddenly you need to
double the size of the fdtable so it contains an extra machine word
for random state bits as well as the pointer to the struct file.
Granted, it's not like this is going to make any machine (that can run
Unix at all) tip over from KVA exhaustion, but it's untidy, messy,
wasteful, and generally offensive to cleanliness.

In general I think there is very little reason to have per-descriptor
(rather than per-open) state and the only really valid use cases are
those that pertain specifically to individual descriptor numbers,
which in turn is pretty much limited to close behavior like
close-on-exec.

Granted, that's for files, where if you want two independent opens of
the same file it's mostly not difficult to get them. For sockets that
isn't a thing. But I kind of think it'd be preferable to make a way to
clone a second independent struct file for the same socket than to
start mucking with per-descriptor state.

(Then there's also another issue, which is that in nearly all cases
nonblocking I/O is a workaround for interface bugs, e.g. nonblocking
accept or open of dialout modem devices, or for structural problems in
software that also needs to use the same file handle, like your
original problem with curses. In the long run it's probably better to
kill off the reasons to want to use nonblocking I/O at all.)


(also, "mirabile visu")
-- 
David A. Holland
dholl...@netbsd.org


Re: flock(2): locking against itself?

2023-03-19 Thread David Holland
On Sun, Mar 19, 2023 at 01:49:54PM +0700, Robert Elz wrote:
 >   | Except they aren't.  They're on open file table entries, something
 >   | remarkably difficult to describe in a way that doesn't just refer to
 >   | the kernel-internal mechanism behind it
 > 
 > Yes.  The terminology in this area really sucks, but that's why
 > I mentioned 'kernel file*' in my message.   POSIX distinguishes
 > file descriptors and file descriptions, but you have to be reading
 > very carefully to even notice the difference - ok for a standards
 > doc perhaps, not for a man page or e-mail message.
 > 
 > Given the lack of well understood terminology, it is not easy to
 > do better.  That, I assume, is what led to that paragraph in the
 > NOTES section - an attempt to explain better just where the locks
 > fit, without getting into kernel internals (the access model the
 > kernel provides, that is, fd, file*, vnode, data, really needs to
 > be understood in order to do any non-trivial file related
 > operations).

"They're per-open"

...which is not actually difficult to understand since it's the same
as the seek pointer behavior; that is, seek pointers are per-open too.

-- 
David A. Holland
dholl...@netbsd.org


Re: crash in timerfd building pandoc / ghc94 related

2023-02-07 Thread David Holland
On Mon, Feb 06, 2023 at 09:19:26PM -0500, Mouse wrote:
 > Perhaps the simplest is
 > 
 > dd if=/dev/urandom bs=65536 of=/dev/mem
 > 
 > but there are others.
 > 
 > Yet I can't help feeling that there is some sense in which it *is* fair
 > to say that userland should never be able to crash the kernel.  I have
 > been mulling over this paradox for some time but have not come up with
 > an alternative phrasing that avoids the reasonable crashes while still
 > capturing a significant fraction of the useful meaning.

Arguably, /dev/mem just shouldn't exist... I'm not sure there are
actually reasonable crashes.

-- 
David A. Holland
dholl...@netbsd.org


Re: ATA TRIM?

2022-12-25 Thread David Holland
On Sun, Dec 25, 2022 at 10:27:49AM -0500, Mouse wrote:
 > >> According to that PDF, dholland is wrong.
 > > I fail to see a behaviour that would be allowed due to dholland@'s
 > > definition, but not according to the one you cited, nor the other way
 > > round.
 > 
 > A read returning the pre-TRIM contents.  Two of the options
 > specifically state "independent of the previously written value"; the
 > third is simply zero, which is also independent of the previously
 > written value.  dholland wrote
 > 
 > > The state of the data after TRIM is unspecified; you might read the
 > > old data, you might read zeros or ones, you might (I think) even read
 > > something else.
 > 
 > and, as I read that PDF, "you might read the old data" is specifically
 > disallowed.

I believe the drive is allowed to discard the request, in which case
you'll read the old data. I expect at that point the block is not
"trimmed" but ... that's a detail.

I know at least some drives will ignore single-sector trims.

-- 
David A. Holland
dholl...@netbsd.org


Re: ATA TRIM?

2022-12-12 Thread David Holland
On Mon, Dec 12, 2022 at 11:53:56PM +1100, matthew green wrote:
 > maybe port that tool back, it's also supposed to match the
 > linux command of the same name.  it's not in netbsd-9, but
 > last i tried, the interfaces the -current tool uses are
 > available in -9 kernels.

The trim/discard plumbing appeared in -7.

-- 
David A. Holland
dholl...@netbsd.org


Re: ATA TRIM?

2022-12-08 Thread David Holland
On Thu, Dec 08, 2022 at 11:44:59PM -0500, Mouse wrote:
 > Since the data on the device is still there afterwards, I don't think
 > [...]

The state of the data after TRIM is unspecified; you might read the
old data, you might read zeros or ones, you might (I think) even read
something else.

What you read, though, also doesn't indicate whether the device thinks
the blocks are available; that is, as far as I know it's legal for the
device to leave the mapping of those logical blocks to physical blocks
alone but mark the physical blocks for later reuse.

So it's very difficult to tell if it actually did anything.

The code in -9 does work, though, as far as we know.

oh, I also remember having to switch some machine's bios to cause the
controller to appear as ahcisata and not something else older. Don't
remember what the symptoms of not doing that were, though, and that
was like 10 years ago.

-- 
David A. Holland
dholl...@netbsd.org


Re: #pragma once

2022-10-15 Thread David Holland
On Sun, Oct 16, 2022 at 06:31:55AM +1100, matthew green wrote:
 > note that it is already used in libpcap, bind, libuv, and as a
 > test for xlint, in our tree, for consumed components.

hmm?

% grep 'pragma.*once' /usr/include/**.h
% 

also, while it does appear (once) inside libuv, that's within a
_MSC_VER conditional.

-- 
David A. Holland
dholl...@netbsd.org


Re: #pragma once

2022-10-15 Thread David Holland
On Sat, Oct 15, 2022 at 07:21:35PM +, Taylor R Campbell wrote:
 > [bcc tech-userlevel tech-toolchain, followups on tech-kern]
 > 
 > Traditionally to avoid problems with repeated inclusion of a header
 > file, you put #include guards around it, say in sys/dev/foo.h:
 > 
 > #ifndef  _SYS_DEV_FOO_H_
 > #define  _SYS_DEV_FOO_H_
 > 
 > ...
 > 
 > #endif   /* _SYS_DEV_FOO_H_ */
 > 
 > With newer compilers this can be replaced by a single line in the
 > header file:
 > 
 > #pragma once
 > 
 > It's nonstandard, but using  #pragma once  is maybe a bit less
 > error-prone -- don't have to have to pollute the namespace with
 > have-I-been-included macros, and I've made mistakes with copying &
 > pasting the per-file have-I-been-included macro into the wrong file.

I don't see any benefit given that it's maybe 5-10 minutes with sed to
write a script to detect wrong include guards.

Furthermore since we still don't have a clear separation between
internal and user-facing kernel headers, there's no significant place
we can safely/correctly use the pragma that will help much.

-- 
David A. Holland
dholl...@netbsd.org


Re: Open master pty (/dev/ptmx) non blocking

2022-09-27 Thread David Holland
On Fri, Sep 23, 2022 at 05:34:26PM -0400, David H. Gutteridge wrote:
 > It's not just O_NONBLOCK that can be expected/desired, vte wants to set
 > O_CLOEXEC as well. The Linux kernel accepts and applies both of those
 > flags in a posix_openpt(3) call.

There should be no path through open(2) where O_CLOEXEC doesn't work!
It is not a property of the underlying object. Or absolutely shouldn't
be, anyway.

-- 
David A. Holland
dholl...@netbsd.org


Re: fallocate for FFS

2022-09-27 Thread David Holland
On Mon, Sep 26, 2022 at 11:53:46PM +, Emmanuel Dreyfus wrote:
 > > > I will try to figure it out, since its not yet implemented the syscall 
 > > > is a
 > > > good way to start dev in BSD kernel.
 > > 
 > > I'm not sure about that; many people have started looking at it and
 > > not got anywhere.
 > 
 > It is true that adding a system call is an easy entry point to
 > learn about the kernel. But here the syscall is the easy part, the
 > real work is modifying FFS code to suport it, and that is a steep
 > learning curve.

The system call is already there, also.

-- 
David A. Holland
dholl...@netbsd.org


Re: fallocate for FFS

2022-09-26 Thread David Holland
On Mon, Sep 26, 2022 at 06:15:23PM +0200, Patryk wrote:
 > I will try to figure it out, since its not yet implemented the syscall is a
 > good way to start dev in BSD kernel.

I'm not sure about that; many people have started looking at it and
not got anywhere.

-- 
David A. Holland
dholl...@netbsd.org


Re: Open master pty (/dev/ptmx) non blocking

2022-09-23 Thread David Holland
On Fri, Sep 23, 2022 at 01:39:16PM +0200, Martin Husemann wrote:
 > > Then, shouldn't the open(2) (and posix_openpt(3)) at least fail with
 > > EINVAL or something if other flags are specified?
 > 
 > The man page says:
 > 
 >  Note that unlike implementations on some other operating systems,
 >  posix_openpt() does not return EINVAL if the value of oflag would be
 >  deemed invalid, instead it is simply ignored.  This means it is not
 >  possible to dynamically test which open(2) flags are possible to set, 
 > and
 >  apply a fallback if EINVAL is received.

That is, however, kind of a feeble excuse :-)

While my inclination would be to make it work, until someone wants to
figure out how to do that it seems straightforward to make O_NONBLOCK
fail:

Index: tty_ptm.c
===
RCS file: /cvsroot/src/sys/kern/tty_ptm.c,v
retrieving revision 1.43
diff -u -p -r1.43 tty_ptm.c
--- tty_ptm.c   29 Jun 2021 22:40:53 -  1.43
+++ tty_ptm.c   23 Sep 2022 20:12:07 -
@@ -338,6 +338,10 @@ ptmopen(dev_t dev, int flag, int mode, s
dev_t ttydev;
struct mount *mp;
 
+   if (flag & O_NONBLOCK) {
+   return EINVAL;
+   }
+
switch(minor(dev)) {
case 0: /* /dev/ptmx */
case 2: /* /emul/linux/dev/ptmx */


-- 
David A. Holland
dholl...@netbsd.org


Re: Can version bump up to 9.99.100?

2022-09-16 Thread David Holland
On Fri, Sep 16, 2022 at 07:00:23PM +0700, Robert Elz wrote:
 > That is, except for in pkgsrc, which is why I still
 > have a (very mild) concern about that one - it actually compares the
 > version numbers using its (until it gets changed) "Dewey" comparison
 > routines, and for those, 9.99.100 is uncharted territory.

No, it's not, pkgsrc-Dewey is well defined on arbitrarily large
numbers. In fact, that's in some sense the whole point of it relative
to using fixed-width fields.

The only problem that might arise is that someone might have used a
glob pattern of the form NetBSD-9.99.[7-9]* that will unexpectedly
stop working. These would appear because make doesn't know how to do
pkgsrc-Dewey computations internally. While it's possible that some of
these may exist, it's unlikely that there are many of them or that
they appear anywhere especially important.

(Patterns of the form NetBSD-[6-9]* do appear and people have been
hunting those down lately.)

-- 
David A. Holland
dholl...@netbsd.org


Re: Module autounload proposal: opt-in, not opt-out

2022-08-08 Thread David Holland
On Mon, Aug 08, 2022 at 07:18:05AM -0700, Paul Goyette wrote:
 > (personal note)
 > It really seems to me that the current module sub-systems is at
 > best a second-class capability.  I often get the feeling that
 > others don't really care about modules, until it's the only way
 > to provide something else (dtrace).  This proposal feels like
 > another nail in the modular coffin.  Rather than disabling (part
 > of) the module feature, we should find ways to improve testing
 > the feature.

I think there's a general recognition that it's a useful feature to
have, even though it has real costs (like LOCKDEBUG being so slow);
but on the other hand we're still dealing with the consequences of the
... missteps ... in the initial rollout, even though it was more than
a decade ago.

I can't even remember at this point what half of the technical
showstoppers that we were supposed to swallow were, and I know you've
fixed a lot of them, but there's still at least two(*) big ones left.
Meanwhile the absolute refusal of certain people to listen to concerns
or consider any plans but their own creates a lingering ... lack of
appetite for working on the topic, even though I think all or nearly
all those people have left the project and the circumstances have
changed considerably. This is certainly true for me, and I don't think
I'm the only one. So, mostly, nothing happens.

Note that despite all the calls to remove major pieces of
functionality in the intervening years, I don't think anyone's ever
proposed removing the module system.


(*) I'm thinking of the build scheme and resulting configuration
management problems, and lack of logging/audited of what gets loaded.
[One might think the latter would be simple but it isn't.]

-- 
David A. Holland
dholl...@netbsd.org


Re: Language-neutral interface specifications (research)

2022-07-13 Thread David Holland
On Tue, Jul 12, 2022 at 09:57:20AM -0400, Mouse wrote:
 > > a. Define the source-level API in a way that is detached from the
 > >C language.
 > 
 > This can be done, but only by attaching it to some other language
 > instead.  It doesn't even make sense to talk about an API without it
 > being an API for some specific language; an API is an Application
 > _Programming_ Interface, and you can't program without programming in
 > some language.

Eh. That's true, but only in a degenerate sense. Functions and data
types exist beyond their source language; that's why binary
compatibility is a thing and why foreign-function interfaces are a
thing. It makes sense to talk about functions and data types outside
of the context of any specific programming language; obviously
whatever you concoct to talk about them in is also a language, but
only in a degenerate sense.

What's the language of /usr/include/rpcsvc/yp.x? Not C. I'm sure it's
possible to generate Ada code from it, and there's a reasonable chance
someone might even even written that tool.

 > > c. Define data flow better than C allows: The struct stat*
 > >argument to stat() is only meant to move data out of the
 > >function.  The pathname pointer isn't captured by the function
 > >call.
 > 
 > Doesn't the restrict qualifier get you at least part of this?

Only a little. It tells you that the pathname doesn't(*) overlap the
stat buffer in memory.

Data direction isn't something you can express in C (though most IDLs
support it) and capture of references isn't at all.


(*) The definition of restrict in C99 is nerfed such that it actually
has extremely limited utility. I forget the details though, and I
don't know if it's been fixed in C11 or C17.

 > It sounds to me as though you want an ABI description language, not an
 > API description language.  A C - or C++, or any other language except
 > Ada - API is of minimal value for you, except (a) to the extent that an
 > ABI can be inferred from it or (b) to the extent that you are
 > generating code in that "other language".

What one wants is a description that can be used to provide compatible
API declarations for many languages, such that after passing through
their respective compilers the resulting ABIs are the same or at least
interoperate correctly.

 > Indeed, one could argue that an ABI is, essentially, an API for the
 > machine-code level.

Except for machine-level interfaces (e.g. calling convention specs),
which we also refer to as ABIs, ABIs are what you get when you run
APIs through a compiler.

Technically speaking though they're still not user-facing. If you
wanted to code to one of these interfaces at the machine level you'd
want assembly-level API definitions. That's what genassym does.

-- 
David A. Holland
dholl...@netbsd.org


Re: Language-neutral interface specifications (research)

2022-07-13 Thread David Holland
On Mon, Jul 11, 2022 at 08:45:09PM -0700, Jakob Stoklund Olesen wrote:
 > I saw this project on the wiki, and it reminds me of a problem I have
 > been trying to understand better:
 > https://wiki.netbsd.org/projects/project/language-neutral-interfaces/
 >
 > [...] This means I have to think about the difference between API
 > and ABI on POSIX platforms.

Indeed.

 > As I see it, there are three levels of interface definition to consider:

So, unlike Mouse I'd agree that there are three levels and it's useful
to think about three levels, but I think I'd divide them somewhat
differently.

There's the standards level (your level 1) - some structure exists and
has at least certain members, but not in any fixed order, there might
be other fields, the values of manifest constants are not specified,
etc.

There's the C implementation level (between that and your level 2)
where these details are all filled in; e.g. in NetBSD struct stat has
a specific list of members in a specific order, and they have specific
C data types even though the concrete representation of those types
may still vary.

Then there's the ABI level, which is what you get when you run the
implementation level through a compiler.

"API" generally means the standards level, because it's about what
applications (or in general, client software) can assume at the source
level. But there may be multiple standards-level manifestations of
notionally the same interface (e.g. C99  and POSIX 
are not the same; the latter has more stuff but also nails down some
things that are not specified in C99) not to mention multiple versions
(e.g. snprintf isn't in C89) so in most cases you need to be specific
about exactly what you're talking about.

However, ABIs only arise from compiling specific implementations of
APIs. In principle you could construct an abstract implementation and
compile it in a way that makes no binary-level assumptions that aren't
in the specification-level API contract, but nobody does that. (Partly
because it's hard, also doesn't in general seem useful.)

That's a digression though.

I think your level 2 and "C implementation level" are possibly trying
to describe the same thing, though.

That is, these are the declarations you need to grind to figure out
what the machine-level ABI is and therefore how to interact with
functions or data structures defined by the C API.

In general, if nobody's done anything else, you can get these by
parsing the C header files, and you can only get them by parsing the C
header files. This is why the typical interpreter's foreign function
interface involves glue modules written in C; they can include
whatever C headers and export functions and values in terms of the
interpreter's internal datatypes. If you've ever looked in Python or
Perl bindings for external libraries you've seen this.

If you want to do more than this, that is, assuming your language can
represent C types and functions adequately well to interact with them
directly, generally you need to parse the C headers and digest them
into your own language's form. Doing this on the fly when compiling is
a major headache (even if you're willing to fork an external cpp, you
still need a full C parser) and if you do it once when you build the
compiler and cache the results, those results can go out of date and
it's hard to know when they need to be regenerated and hard to make
sure end-user installs actually do regenerate them when needed.

The goal of the language-neutral interfaces project is to at least
eliminate the cpp and C parser aspect of this by providing definitions
in something easily parsed, and providing some system-management
guarantees about what you see being up to date.

The way the project's been envisioned is that the language-neutral
interface definitions would become the canonical description (that is,
the source) and the C headers would be generated from them.

Reading the C headers to generate the language-neutral interfaces is
also possible, but a lot more work to implement, since you need to
implement the aforementioned C parser. The only advantage of doing
things that way is that it becomes easier to merge and/or still useful
even when not merged -- for your specific use case, since you want
something to support your compiler, you might want it this way so you
can use it everywhere and not just on NetBSD. Or if the generator tool
ends up being something that doesn't belong in NetBSD base, and
therefore can't be merged, this is a way to make it still useful.

 > Ada actually has a standardized foreign function interface that can be
 > used to interface with C and Fortran. The problem is that it interfaces
 > to the Pure-C level, not the API level. I can't use it to call stat() in
 > a portable way.

I would expect that you could define Ada-level constants and an
Ada-level struct stat, and use the FFI to call a glue function that
includes both sets of definitions and translates between them (as
Python modules do) but 

Re: killed: out of swap

2022-06-14 Thread David Holland
On Tue, Jun 14, 2022 at 07:59:33PM +0200, Johnny Billquist wrote:
 > > What might be interesting is a way to influence the order in which
 > > processes are chosen to kill...
 > 
 > I don't see any realistic way of doing anything with that.
 > It's basically the first process that tries to allocate another page when
 > there are no more. There are no other processes at that moment in time that
 > have the problem, so why should any of them be considered?

There are certainly ways. The history of the linux oom-killer makes
for pretty good reading, actually.

You want to find the process that's actually causing the problem, not
some random process that happens to be the one to step in the hole.

NetBSD doesn't do this very well; usually in my experience an OOM
situation results in syslogd getting killed, which is both unhelpful
in terms of freeing up resources and actively bad for other reasons.
It's also not uncommon for the ax to fall on the X server, which does
usually end up releasing resources but is pretty much never what the
user wants.

-- 
David A. Holland
dholl...@netbsd.org


Re: Slightly off topic, question about git

2022-06-12 Thread David Holland
On Mon, Jun 06, 2022 at 02:40:29PM +0200, Gerhard Sittig wrote:
 > There is no problem with that I assume. From personal experience
 > I can tell that git takes some getting used to. But once you do
 > you don't want to go back. Seriously.

Every time I have to use the damn thing, I want to go back to hg.

(SCNR)

More seriously,

On Wed, Jun 08, 2022 at 06:45:07AM -0400, Mouse wrote:
 > > That git thinks of the whole content of the tree, and that a filter
 > > is applied to narrow the result set when you specify dirs or files,
 > > was mentioned before.  Changes your perspective.
 > 
 > Also breaks git for certain uses, though.  Much as I like git, there
 > are places where I'd like to use it but effectively can't because of
 > its insistence on owning an entire directory per repo as work tree.
 > This makes it impossible to, for example, keep one repo for ~/.cshrc, a
 > different repo for ~/.gdbinit, and a third for ~/.procmailrc, without
 > creating a directory somewhere for each one and playing games with
 > links.  But that's very different from its designed-for use case, so
 > it's not too surprising.

One of the things Someone(TM) should write is a modern replacement for
RCS.

Versioning single files is actually useful and RCS is... dated.

-- 
David A. Holland
dholl...@netbsd.org


Re: Nested functions [was Re: valgrind]

2022-03-25 Thread David Holland
On Thu, Mar 24, 2022 at 11:16:58PM -0400, Mouse wrote:
 > > The conclusion over the past ~25 years has been that there isn't;
 > > qsort and things like it work "well enough" and real uses for
 > > closures that really motivate the feature come up rarely enough that
 > > it doesn't happen.
 > 
 > Nested functions are not closures, or at least not what I know as
 > closures.  A nested function pointer (conceptually) goes invalid as
 > soon as anything it refers to goes out of scope, or at the latest as
 > soon as its smallest enclosing block exits (or possibly when its
 > smallest enclosing function returns).  The thing I know as a closure
 > would preserve the referred-to objects as long as the closure is
 > potentially callable.  This requires more reference tracking than C
 > typically makes possible.

There's a term for scope-restricted closures (that become invalid this
way) that I'm currently forgetting. They're still closures in most of
the important ways, and it's a useful concept for a language where
copying memory isn't free.

 > > There is no solution based on trampolines that'll pass security (or
 > > at least security-theatre) muster.  Unless maybe by doing something
 > > that's horrifying in other ways.
 > 
 > The safest alternative that comes to my mind is to have two stacks, one
 > for trampolines and one for everything else.  But that requires
 > something much like two stack pointers, including assist from the
 > setjmp/longjmp implementation and, if applicable, threads.

That's not s3kure! You can still get arbitrary code execution by
scribbling in it.

 > > (For example: you could declare a static limit on how many instances
 > > of the closure you'll ever produce, make a global array to stuff the
 > > data pointer in, and statically generate N trampoline entry points
 > > that read from that array and call the primary function.  But there
 > > are many other ways in which this is horrible.)
 > 
 > But there are use cases for which it is not a stupid implementation;
 > [...]

I think to keep it safe you need more code analysis than you're likely
to get with C code. But IDK.

-- 
David A. Holland
dholl...@netbsd.org


Re: Nested functions [was Re: valgrind]

2022-03-24 Thread David Holland
On Tue, Mar 22, 2022 at 09:16:36PM -0400, Mouse wrote:
 > Indeed, you can have different sizes for pointers to different object
 > types, too.  I _think_ pointers to different function types can have
 > different sizes, but I'm less certain of that.  (There would be little
 > point, since all function pointer types have to have the same
 > information content; see below.)

I don't think they can, but I wouldn't swear to it and, as you note,
it's a moot point.

 > > However there are, I fear, too many programs that somewhere convert a
 > > function pointer to (void *) and at that point things break.
 > 
 > That is a bug.  It always has been.  Such code is broken and it
 > deserves to be rednered _obviously_ broken so it can get fixed.

...and it already doesn't work on some platforms. I don't think
there's actually that much of this and finding it is a fairly
straightforward static analysis problem.

That said, moving to fat function pointers on machines that don't
already use them is a real ABI change and therefore a big deal; but it
could be done if there were a compelling argument to justify going
through all the associated dark rituals.

The conclusion over the past ~25 years has been that there isn't;
qsort and things like it work "well enough" and real uses for closures
that really motivate the feature come up rarely enough that it doesn't
happen. And that's why we are where we are.

(More than ~25 years ago people probably cared, or purported to care,
about an extra 4 bytes per function pointer enough to consider it not
worthwhile on those grounds.)

There is no solution based on trampolines that'll pass security (or at
least security-theatre) muster. Unless maybe by doing something that's
horrifying in other ways.

(For example: you could declare a static limit on how many instances
of the closure you'll ever produce, make a global array to stuff the
data pointer in, and statically generate N trampoline entry points
that read from that array and call the primary function. But there are
many other ways in which this is horrible.)

-- 
David A. Holland
dholl...@netbsd.org


Re: about langage neutral

2022-03-14 Thread David Holland
On Thu, Mar 10, 2022 at 11:10:13AM +0100, Jos? Bollo wrote:
 > What I have in mind is that there is no need for a specific IDL: the
 > language itself is the IDL and the code generator.
 > 
 > I can add that SCHEME interpreters are small piece of code (that you
 > don't have to change, pick it and use it). Compilers also exist.

To some extent the whole point of an IDL is being declarative and not
Turing-complete, so you can read the declarations and readily
translate them to a wide variety of languages. Writing a script that
spits out interface definitions in C (we already have several of
these) doesn't help produce interface definitions for other languages.

Just writing the script in Scheme instead of sh/awk or Python or
whatnot doesn't help much. Plus, we don't have a Scheme interpreter in
base.

If what you mean instead is to invent a new IDL and use S-expressions
as the input syntax to avoid needing a real parser, you're pulling on
the wrong end of the problem. Parsers are cheap; the semantics of the
IDL are the important part.

-- 
David A. Holland
dholl...@netbsd.org


Re: membar_enter semantics

2022-03-14 Thread David Holland
On Mon, Feb 14, 2022 at 08:11:57AM +, Taylor R Campbell wrote:
 > > On Fri, Feb 11, 2022 at 01:33:04PM +, Taylor R Campbell wrote:
 > >  > membar_enter is currently documented to be a store-before-load/store
 > >  > barrier: https://man.netbsd.org/NetBSD-9.2/membar_ops.3
 > >  > 
 > >  >  membar_enter()
 > >  >Any store preceding membar_enter() will happen before all 
 > > memory
 > >  >operations following it.
 > >  > 
 > >  > [...]
 > > 
 > > This problem (and others like it) are primarily caused by using silly
 > > names like "enter" and "exit" that don't mean anything. (That is: you
 > > could redefine enter to be any-before-any and exit to be
 > > read-before-read and it still wouldn't be obviously wrong to most
 > > people.)
 > 
 > The problem at hand is that one word in the documentation doesn't
 > reflect
 > 
 > - 15 years of implementation on x86, powerpc, and sparc,
 > - 15 years of usage, or
 > - sensible semantics for real-world use cases.
 > 
 > All the other NetBSD membars meet this -- as would the proposed
 > one-word documentation change.

My point was somewhat different: we ended up in this hole with the
docs having been wrong for who knows how long precisely because the
terminology is rubbish. If the name "membar_enter" meant anything
coherent that anyone could remember, this sentence would have been
obviously wrong.

I'm not arguing that we shouldn't fix this. I'm arguing that we should
deprecate the names "member_enter" and "membar_exit" because they're
failed terminology, and this issue is evidence of that.

 > > In general, things should be named by what they do, not by the
 > > context in which you're supposed to use them according to some
 > > cookbook thinking-averse approach.
 > 
 > Experience working in an environment where all possible membars are
 > available and named according to {load,store}-before-{load,store}
 > combinations tells me that it doesn't actually improve anything.
 > 
 > Instead, it just encourages people to overthink bad choices locally
 > without thinking about or documenting the two-sided _protocol_ between
 > threads, and then leave nonsensical barriers around because they
 > locally sounded good.

Perhaps so. But that beats people jumping to conclusions and using the
wrong things because they don't understand what they do and think a
semantically null name is telling them something useful.

Also, I'd say it is not about protocols so much as about data
structure invariants. The use of matching pairs is only meaningful in
specific circumstances where there's one writer and one reader, and
while one writer may be a reasonable assumption, IME one reader
often isn't.

And when you're thinking about it in terms of invariants, the exact
operations you're ordering to satisfy the invariant become much easier
to follow.

At a minimum we should deprecate the meaningless names "enter" and
"exit" and replace them with "acquire" and "release", since while the
latter names also don't inherently mean anything they've also got C99
and some years of research literature backing them up.

Meanwhile "producer" and "consumer" are equally meaningless and
equally failed terminology and the best available names are
"write_write" and "read_read". Unless it's the other way around.

-- 
David A. Holland
dholl...@netbsd.org


Re: membar_enter semantics

2022-03-14 Thread David Holland
(sorry, lost track of this)

On Mon, Feb 14, 2022 at 05:12:13AM +0100, Joerg Sonnenberger wrote:
 > Am Mon, Feb 14, 2022 at 02:45:46AM + schrieb David Holland:
 > > On Mon, Feb 14, 2022 at 03:12:29AM +0100, Joerg Sonnenberger wrote:
 > >  > Am Mon, Feb 14, 2022 at 02:01:13AM +0000 schrieb David Holland:
 > >  > > In this case I would argue that the names should be membar_load_any()
 > >  > > and membar_any_store().
 > >  > 
 > >  > Kind of like with the BUSDMA_* flags, it is not clear from that name in
 > >  > which direction they work either. As in: is it a barrier that stops the
 > >  > next load? Is it a barrier that ensures that a store is visible?
 > > 
 > > Given that English is left-to-right, and that memory barriers are
 > > about ordering memory operations, it seems a lot clearer than "enter".
 > 
 > I don't think that arguments works with the way barriers around read and
 > write operations are normally used. A read barrier is normally "ensure
 > that a read doesn't move before this point", where as write barrier is
 > normally "ensure that a write operation doesn't move beyond this point".
 > Note the opposite temporal direction. Not sure if there are sensible use
 > cases of the inverted directions, i.e. if we care about CPUs that can
 > reorder reads relative to later writes.

I don't follow this argument. The natural interpretation of
"membar_any_store" is "barrier between any and store", so for a "read"
barrier, that is, "membar_read_read", it's a barrier between reads and
reads.

I don't see how the motion direction of the anticipated leakage
applies, and I think getting the anticipated leakage involved in the
naming of the operation is a bad idea. (It would be another example of
naming things after when they're supposed to be used rather than what
they do.)

-- 
David A. Holland
dholl...@netbsd.org


Re: membar_enter semantics

2022-02-13 Thread David Holland
On Mon, Feb 14, 2022 at 03:12:29AM +0100, Joerg Sonnenberger wrote:
 > Am Mon, Feb 14, 2022 at 02:01:13AM + schrieb David Holland:
 > > In this case I would argue that the names should be membar_load_any()
 > > and membar_any_store().
 > 
 > Kind of like with the BUSDMA_* flags, it is not clear from that name in
 > which direction they work either. As in: is it a barrier that stops the
 > next load? Is it a barrier that ensures that a store is visible?

Given that English is left-to-right, and that memory barriers are
about ordering memory operations, it seems a lot clearer than "enter".

-- 
David A. Holland
dholl...@netbsd.org


Re: membar_enter semantics

2022-02-13 Thread David Holland
On Fri, Feb 11, 2022 at 01:33:04PM +, Taylor R Campbell wrote:
 > membar_enter is currently documented to be a store-before-load/store
 > barrier: https://man.netbsd.org/NetBSD-9.2/membar_ops.3
 > 
 >  membar_enter()
 >Any store preceding membar_enter() will happen before all memory
 >operations following it.
 > 
 > [...]

This problem (and others like it) are primarily caused by using silly
names like "enter" and "exit" that don't mean anything. (That is: you
could redefine enter to be any-before-any and exit to be
read-before-read and it still wouldn't be obviously wrong to most
people.)

In general, things should be named by what they do, not by the
context in which you're supposed to use them according to some
cookbook thinking-averse approach.

In this case I would argue that the names should be membar_load_any()
and membar_any_store(). I see the argument about matching pairs and
the virtue of it being easy to tell which pairs match, but I don't
think that's outweighed by understanding what the operations actually
*do*, and mucking with this stuff without understanding is virtually
guaranteed to fail.

I will grant that "acquire" and "release" in this context have gained
enough standalone barrier-related meaning that it might be ok to use
those instead of the purely descriptive names (though I'd still prefer
#define membar_acquire() membar_load_any() for that so it's still
clear what's going on with a glance at the header)... but "enter" and
"exit" mean nothing except for a vague association with our mutex ops.

Can we please rename them? Since that's a big job, it's reasonable to
begin by tidying the implementation and leaving the old names
available for compatibility.

 >   Exception: riscv membar_enter is  fence w,rw  ...but I'm not sure
 >   the code we have in riscv matters at the moment -- as far as I know
 >   a working userland hasn't gone out in any release yet.

Indeed, it hasn't. It is safe to just change that.

-- 
David A. Holland
dholl...@netbsd.org


Re: procfs files vs symlink

2022-01-11 Thread David Holland
On Tue, Jan 11, 2022 at 11:10:24AM +0100, Manuel Bouyer wrote:
 > The attached diff changes the procfs behavior to match the linux one, for
 > linux processes:
 > comore:/home/bouyer>ls -l /proc/self/fd/
 > total 1
 > crw--w  1 bouyer  tty5, 0 Jan 11 11:08 0
 > crw--w  1 bouyer  tty5, 0 Jan 11 11:08 1
 > crw--w  1 bouyer  tty5, 0 Jan 11 11:08 2
 > lr-xr-xr-x  1 bouyer  staff   512 Jan 11 11:08 3 -> /home/bouyer
 > 
 > ls: /proc/self/fd//4: Invalid argument
 > lr-xr-xr-x  1 bouyer  staff 0 Jan 11 11:08 4

What causes that EINVAL?

also beware -- the linux world expects regular files to have canonical
paths, and that's just not true elsewhere and can't really be papered
over.

-- 
David A. Holland
dholl...@netbsd.org


Re: Proposal: Deprecate (or rename) extsrc/

2022-01-10 Thread David Holland
On Sat, Jan 08, 2022 at 12:28:52PM -0800, Paul Goyette wrote:
 > > Count me in as well - the name completion collision has always annoyed me.
 > 
 > +2  :-)



-- 
David A. Holland
dholl...@netbsd.org


untyped device calls (was: Re: General device properties API)

2021-10-03 Thread David Holland
On Mon, Sep 20, 2021 at 09:03:12PM +, Taylor R Campbell wrote:
 > > In any case, here is a diff that adds the argument structure type
 > > checking.  I also added an awk program to generate the argument
 > > structures and all of the call binding stuff automatically from an
 > > interface description file.  So, if someone wants to deal with the
 > > additional module management complexity, they can change
 > > gendevcalls.awk, re-gen the header files, and all of the call sites
 > > will be updated automatically.
 > > 
 > >https://www.netbsd.org/~thorpej/device-call-typing-diffs.txt
 > 
 > This doesn't appear to do type-checking for formal parameters -- the
 > functions still take void *.
 > 
 > But, like I said before: I'd really like to establish what value this
 > abstraction mechanism brings us in the first place -- still not clear
 > to me.

I am confused by this whole discussion (which is why it's taken me
this long to reply) -- I see discussion of link sets and I see a patch
that seems to just add more goop around the strings (the one above),
but the problem with typing is very simple:

In the original device_call commit, instead of adding a vtable with
one "virtual function" (function pointer) per virtual call, for some
reason we got a single virtual function that looks up other virtual
functions by string name, and an API that uses void pointers to call
them.

That is, instead of

devhandle->enumerate_children(dev, devhandle, callback, callback_arg)

(which one might dress up with a macro), we get, equally unwrapped:

struct device_enumerate_children_args args;
args.callback = callback;
args.callback = callback_arg;
call = devhandle->lookup_device_call(devhandle, "string", );
call(dev, handle2, );

which is both much messier and much less safe, because the string is
completely uncheckable and might be typo'd, and the args are not
typechecked.

This creates a level of indirection, but what purpose does this level
of indirection serve? Looking at the implementation, it sure seems
like "none at all".

Why not just make each device call an entry in an ops table like every
other virtual call system we have?

-- 
David A. Holland
dholl...@netbsd.org


Re: eventfd(2) and timerfd(2) APIs

2021-09-19 Thread David Holland
On Sun, Sep 19, 2021 at 07:43:32AM -0700, Jason Thorpe wrote:
 > >> | else plumbs it through either, but I'll go ahead and do so.
 > >> 
 > >> Please do.   What other things don't permit fcntl() to work?   We
 > >> should fix any of those.
 > > 
 > > Well, I?ll fix these 2 anyway.
 > 
 > ?and I just double-checked this as well? F_GETFL and F_SETFL
 > don?t need any specific hooks in the implementation; the flags are
 > all handled in common code.

Pretty much all of fcntl does (or should) happen at the fd level;
that's why it's different from ioctl.

In fact, I'd go so far as to say that anything in the fcntl path that
calls into the object is a bug...

-- 
David A. Holland
dholl...@netbsd.org


Re: Overhaul of I2C and SPI device autoconfiguration

2021-09-12 Thread David Holland
On Sat, Sep 11, 2021 at 03:43:24PM -0700, Jason Thorpe wrote:
 > The basic idea involves the device call mechanism I introduced a while ago.

You still haven't explained why that needs to be untyped and based on
uncheckable string constants.

-- 
David A. Holland
dholl...@netbsd.org


Re: General device properties API

2021-08-15 Thread David Holland
On Sun, Aug 15, 2021 at 09:10:48AM -0700, Jason Thorpe wrote:
 > > Why do we pass untyped strings around with this device_call mechanism?
 > > Was there any discussion about this beforehand?  Had I known it was
 > > proposed I would have objected to an untyped string method calling
 > > mechanism -- surely we can do better to detect typos and type errors
 > > at compile-time.
 > 
 > This complaint is at best tangentially related to the topic at
 > hand, and can be adjusted separately from this discussion.

Sure, but let's not forget to, ok? I was also not entirely pleased by
seeing new untyped things going in without discussion.

-- 
David A. Holland
dholl...@netbsd.org


Re: Some changes to autoconfiguration APIs

2021-08-04 Thread David Holland
On Wed, Aug 04, 2021 at 05:52:46PM -0700, Jason Thorpe wrote:
 > Old example:
 > 
 > c->c_dev = config_found(sc->sc_dev, , pciprint,
 > CFARG_SUBMATCH, config_stdsubmatch,
 > CFARG_LOCATORS, locs,
 > CFARG_DEVHANDLE, devhandle,
 > CFARG_EOL);
 > 
 > New example:
 > 
 > c->c_dev = config_found(sc->sc_dev, , pciprint, 
 > CFARGS(.submatch = config_stdsubmatch,
 >.locators = locs, 
 >.devhandle = devhandle));
 > 
 > Acceptable?

Seems like a definite improvement as long as .submatch and whatnot are
typed (and not (void *)).

-- 
David A. Holland
dholl...@netbsd.org


Re: VOP_STRATEGY for devices and pipes

2021-07-19 Thread David Holland
On Thu, Jul 15, 2021 at 11:00:36AM -0700, Chuck Silvers wrote:
 > setting an extattr on a device node in UFS is already disallowed.
 > ffs_setextattr() has this check:
 > 
 > if (ap->a_vp->v_type == VCHR || ap->a_vp->v_type == VBLK)
 > return (EOPNOTSUPP);
 > 
 > I have not checked other file systems to see if they have an
 > equivalent check.

I don't think anything else does extattrs. tmpfs doesn't, for example.

 > I think we should continue to support extattrs on fifos for two reasons:
 >  - extattrs are used to implement ACLs, and it is useful to be able
 >to set an ACL on a fifo.
 >  - it is good to remain more compatible with FreeBSD in this area.

Reasonable enough.

of course it's also useful to be able to set an ACL on a device...
but not now.

-- 
David A. Holland
dholl...@netbsd.org


Re: VOP_STRATEGY for devices and pipes

2021-07-15 Thread David Holland
On Thu, Jul 15, 2021 at 06:26:55PM +0200, Rhialto wrote:
 > > Seems to me the best/safest thing to do for now would be to prohibit
 > > the extended-attr on devices and fifos.
 > 
 > If I have a normal file system, and in a directory is an entry with an
 > inode for a fifo (or a device), and I set an extended attribute on that
 > inode, then surely it will get stored in the file system, and not in the
 > fifo (or device)?
 > 
 > So if the fifo (or device) code gets involved in this, then the bug is
 > there, I'd think?

Well... yes, in theory.

The way it actually works right now is that device and fifo vnodes
belong half to the filesystem and half not -- every filesystem on
which devices and fifos can be created (which is not all of them) has
two additional vnode ops tables for device and fifo vnodes
respectively, which redirect most but not all of the ops to the code
in miscfs/specfs and miscfs/fifofs. So things like chmod (VOP_SETATTR)
go through the filesystem code and things like read and write go to
the device or fifo.

Buffers associated with a file are hung on that file's vnode. This is
overloaded for devices such that buffers for the _device_ are hung on
the device vnode... it works that way because the buffer cache is
indexed by vnode and offset. (The buffer cache is both virtually and
physically indexed: you can have buffers for an offset in a file, and
buffers for a raw disk address, and the latter use the disk device's
vnode as part of the key.)

We've had assorted problems in the past caused by this overloading; if
you remember when wapbl was rolled out there were assorted problems
when wapbl was enabled on the root fs; this was caused by calling the
root fs's operations for device buffers that came from some other
filesystem, which then tried to do wapbl operations where no wapbl
goop existed and croaked.

We think those issues have pretty much all been found and fixed, but
I'm wary of adding new ones.

One of the goals of the device plumbing changes I proposed a couple
months ago is to get rid of this overloading by making operating
device vnodes separate from the on-disk special file vnodes. But
that's a big invasive change and not happening in the near future.

-- 
David A. Holland
dholl...@netbsd.org


VOP_STRATEGY for devices and pipes

2021-07-14 Thread David Holland
I noticed that there was a loose function ffsext_strategy lying
around, mentioned it to Christos, and he committed this:

 > Modified Files:
 >  src/sys/ufs/ffs: ffs_vnops.c
 > 
 > Log Message:
 > Hook up ffsext_strategy to fifos. Pointed out by dholland@

which is well and good (actually it isn't, it blew up a vnode tables
cleanup patch I was preparing, but never mind that for now) but:

If this is needed for fifos, isn't it also needed for devices?

The ffs extended attribute blocks are stored in the buffer cache as
negative offsets into the file they belong to (distinct, one hopes,
from the negative offsets used for indirect blocks), which means that
sooner or later they'll get written out, and when they do that happens
via VOP_STRATEGY on the vnode they belong to.

For fifos, without the above change, this will panic; VOP_STRATEGY on
all fifos was/is genfs_badop. So probably trying to set extended
attributes on a fifo will crash.

For devices, though, the same thing will happen: buffers at negative
offsets will be hung on the device, but if not intercepted by ffs
(currently they aren't) they'll be handled by spec_strategy and
treated as physical disk blocks. Which if we're lucky, will crash, and
if not, possibly write to negative offsets in one disk partition which
are positive offsets in the previous one, and corrupt the filesystem.

So I think ffs also needs an ffsspec_strategy that does the same
interception.

Except, I'm not convinced that it's a good idea to try to mix the
filesystem's metadata blocks with the device's own raw blocks like
this. I don't know of anything that tries to use negative offsets on a
device to mean something special, but I don't know there isn't
anything either, in which case combining the two would make a horrible
mess. But even absent that, it seems like having these entirely
different blocks queueing in the same place is a bad idea begging for
trouble.

So I think what I would like to do is, for now at least (the device
plumbing changes I posted about a couple months back would eventually
change the situation) just prohibit extended attributes on device
nodes entirely.

(I would not be opposed to also prohibiting them on fifos and
reverting this change for the time being, if only because it means I
wouldn't have to redo the patches I was hoping to commit this week...)

Thoughts? Also, am I missing something?

-- 
David A. Holland
dholl...@netbsd.org


Re: kern.maxlockf for byte range lock limit

2021-07-12 Thread David Holland
On Mon, Jul 12, 2021 at 08:03:54AM +, Emmanuel Dreyfus wrote:
 > On Fri, Jul 09, 2021 at 08:52:08PM +0000, David Holland wrote:
 > > This was discussed somewhere briefly (probably on chat) a couple weeks
 > > ago, with the conclusion that we should just make the limit something
 > > like twice the maximum number of open files and forget about it. No
 > > real need to add more knobs.
 > 
 > Here is a patch that ties it to maxfiles, with documentation. Tying
 > to open files would be more complex. We could also tie to
 > curproc->p_rlimit[RLIMIT_NOFILE].rlim_cur if that is more relevant.

Well, that was the idea; make it some factor times the current open
file limit or something like that. Not sure why the existing limit is
apparently per-user rather than per-process or what that's supposed to
accomplish. These lock objects are not exactly large so it's not
necessary to be tightfisted with them.


note typo:
 > unprivilegied

-- 
David A. Holland
dholl...@netbsd.org


Re: kern.maxlockf for byte range lock limit

2021-07-09 Thread David Holland
On Fri, Jul 09, 2021 at 03:27:57PM +, Emmanuel Dreyfus wrote:
 > I recently hit a bug where mariadb import failed befause of 
 > maxlocksperuid limit:a
 > http://mail-index.netbsd.org/netbsd-users/2021/07/09/msg027330.html
 > 
 > Attached is a patch that documents ENOMEM for fcntl() and flock()
 > and that adds a sysctl kern.maxlockf to make the limit configurable.

This was discussed somewhere briefly (probably on chat) a couple weeks
ago, with the conclusion that we should just make the limit something
like twice the maximum number of open files and forget about it. No
real need to add more knobs.

-- 
David A. Holland
dholl...@netbsd.org


Re: vn_open, EDUPFD, EMOVEFD

2021-06-28 Thread David Holland
On Tue, Jun 29, 2021 at 05:07:10AM -, Christos Zoulas wrote:
 > >* Does anyone know why dev/kloader.c calls namei and then vn_open on
 > >the same path? I remember seeing this before and leaving it alone
 > >because nobody I could find was sure what the deal was, but it's
 > >unlikely to work as-is and increasingly likely to break over time...
 > 
 > Just fix it, and if something breaks we'll put it back.

Ok, I will do exactly that :-) but separately.

 > >-   NDINIT(, LOOKUP, NOFOLLOW, pb);
 > >-   error = vn_open(, filemode, createmode);
 > >+   error = vn_open(NULL, pb, 0, filemode, createmode, vpp, NULL, NULL);
 > 
 > This is the only NOFOLLOW NDINIT case, should that be 'createmode |
 > O_NOFOLLOW'?

Yes it should, thanks. (Except filemode, not createmode.) Too many
mostly-the-same changes...

 > >-   NDINIT(, CREATE, LOCKPARENT, pb);
 > >
 > >/*
 > > * Since we do not hold ulfs_extattr_uepm_lock anymore,
 > > * another thread may race with us for backend creation,
 > >-* but only one can succeed here thanks to O_EXCL
 > >+* but only one can succeed here thanks to O_EXCL.
 > >+*
 > >+* backing_vp is the backing store. 
 > > */
 > >-   error = vn_open(, O_CREAT|O_EXCL|O_RDWR, 0600);
 > >+   error = vn_open(NULL, pb, 0, O_CREAT|O_EXCL|O_RDWR, 0600,
 > >+   _vp, NULL, NULL);
 > 
 > I guess O_CREAT will do the LOCKPARENT later...

Indeed.

 > I would have preferred if EMOVEFD/EDUPFD were gc'ed as part of the patch,
 > because there is a lot of ugliness left.

So would I, but it's a big deal (requires changing every [cb]devsw...)

-- 
David A. Holland
dholl...@netbsd.org


vn_open, EDUPFD, EMOVEFD

2021-06-28 Thread David Holland
A couple days ago it came to my attention that the hack used for
cloning devices (returning a magic error code and stuffing a file
descriptor number in a magic field of struct lwp) is also used by
/dev/stderr, and more importantly, every caller of vn_open in the
kernel (there are many) needs to know about this hack to avoid
returning internal negative errnos to users. Needless to say, none of
them (besides do_open) actually do, plus this is gross.

The following patch does not clean out the hack, but establishes a
containment perimeter inside vn_open; vn_open now returns either a
vnode or a file descriptor number. (And, along with the file
descriptor number, a flag saying whether it was the EDUPFD or EMOVEFD
case.) Callers of vn_open not prepared to deal with file descriptors,
which is all of them besides do_open, can pass NULL for the file
descriptor return parameter, in which case vn_open will instead return
EOPNOTSUPP.

It also takes advantage of having rearranged vn_open's signature to
stop exposing struct nameidata, which is a plus.

There are now eight arguments to vn_open, which is a lot, but it's
still simpler than struct nameidata:
at_dvp (directory vnode for openat() starting point, NULL ok)   
pb (pathbuf)
nmode (additional namei flags)
fmode (open flags, converted from O_* to F*)
cmode (creation permissions)
ret_vp (pointer for returned vnode)
ret_domove (pointer for returned is-EMOVEFD flag)
ret_fd (pointer for returned file descriptor)

I have also rearranged the interface to fd_dupopen, which uses the
vn_open results, to match, and so that its return value argument is
last like it ought to be. This was chosen to be incompatible (as
opposed to just moving int arguments around, which isn't safe) in case
there were other calls to it somewhere. (There weren't.)

This patch is missing the doc change to vnsubr(9) but I'll deal with
that before committing.

It also builds but hasn't been tested yet; test results are pending.

Requires a kernel bump because a number of the callers of vn_open are
in modules.

Comments? Objections? Tentative plan is to commit this along with the
VOP_PARSEPATH changes (proposed years back), which also require a
kernel bump, plus some other pending kernel bump stuff other folks
have, in the next few days.

Note that cleaning out the hack entirely would be expensive (because a
similar alternative return value scheme is needed in the open methods
of struct cdevsw and bdevsw) so I'm not going to attempt it yet; plus
it's a natural part of the long-term device plumbing rearrangement I
posted about a few weeks back, and it makes more sense to do it as
part of that than separately.

Oh also, note that a couple of callers of vn_open were passing
LOCKPARENT and relying on vn_open to ignore it. That's been
rectified.

And before I forget: currently uses of vn_open that yield EMOVEFD when
the caller isn't prepared to handle it leak the fd produced. This
patch doesn't change that; doing so without creating new layer
violations would be a major nuisance. Avoiding this behavior sensibly
requires a complete cleanup of the hack, which prompts layering
violations because it's itself a layer violation.

* Does anyone know why dev/kloader.c calls namei and then vn_open on
the same path? I remember seeing this before and leaving it alone
because nobody I could find was sure what the deal was, but it's
unlikely to work as-is and increasingly likely to break over time...


diff -r 4c2d0a182ef8 external/cddl/osnet/sys/sys/vnode.h
--- a/external/cddl/osnet/sys/sys/vnode.h   Sun Jun 27 18:13:54 2021 -0400
+++ b/external/cddl/osnet/sys/sys/vnode.h   Mon Jun 28 11:05:45 2021 -0400
@@ -239,7 +239,6 @@
 vnode_t **vpp, enum create crwhy, mode_t umask)
 {
struct pathbuf *pb;
-   struct nameidata nd;
int error;
 
ASSERT(seg == UIO_SYSSPACE);
@@ -248,11 +247,9 @@
ASSERT(umask == 0);
 
pb = pathbuf_create(pnamep);
-   NDINIT(, LOOKUP, NOFOLLOW, pb);
-   error = vn_open(, filemode, createmode);
+   error = vn_open(NULL, pb, 0, filemode, createmode, vpp, NULL, NULL);
if (error == 0) {
-   VOP_UNLOCK(nd.ni_vp, 0);
-   *vpp = nd.ni_vp;
+   VOP_UNLOCK(*vpp, 0);
}
pathbuf_destroy(pb);
return (error);
diff -r 4c2d0a182ef8 sys/dev/firmload.c
--- a/sys/dev/firmload.cSun Jun 27 18:13:54 2021 -0400
+++ b/sys/dev/firmload.cMon Jun 28 11:05:45 2021 -0400
@@ -204,7 +204,6 @@
 firmware_open(const char *drvname, const char *imgname, firmware_handle_t *fhp)
 {
struct pathbuf *pb;
-   struct nameidata nd;
struct vattr va;
char *pnbuf, *path, *prefix;
firmware_handle_t fh;
@@ -236,8 +235,7 @@
error = ENOMEM;
break;
}
-   NDINIT(, LOOKUP, FOLLOW | NOCHROOT, pb);
-   error = vn_open(, 

PR 56275 (pthread_cond_timedwait is broken)

2021-06-24 Thread David Holland
I ran into problems with pthread_cond_timedwait today, which I filed
in gnats because I don't have any more time to look at it for the
moment. The short summary is that I was getting all manner of broken
behavior (e.g. spurious wakeups that appear impossible from code
examination) and my best guess is that the waiter queues are getting
crosslinked somehow.

Also the test program never completes, and I don't think it's because
it's wrong (after the initialization part, all the mutex ops are
paired and all the condvar waits have timeouts)... so either
something's causing it to deadlock or _lwp_park is not timing out when
it should.

-- 
David A. Holland
dholl...@netbsd.org


Re: maximum limit of files open with O_EXLOCK

2021-06-20 Thread David Holland
On Sat, Jun 19, 2021 at 08:12:38AM +, nia wrote:
 > The Zig developer found the kernel limit:
 > https://nxr.netbsd.org/xref/src/sys/kern/vfs_lockf.c#116
 > 
 > but it doesn't seem to be adjustable through sysctl.
 > I wonder if it should be.

I wonder if the logic should be changed to allow one "free" lock per
open file, since those are already limited, and it's not like the
amounts of memory involved are large.

-- 
David A. Holland
dholl...@netbsd.org


Re: Devices.

2021-06-02 Thread David Holland
On Tue, Jun 01, 2021 at 10:08:55AM -0700, Brian Buhrow wrote:
 >  hello David.  What I don't see in your proposal is a way of
 > implementing a dynamic device filesystem.

As I already posted about once, that's irrelevant to what I'm
proposing.

A working devfs that can be used in all places where device nodes
might be needed would mean that the device vnode apparatus in ffs and
every other fs could be removed, yes, but that does not _solve_ the
problem that puts it there, it only centralizes it. Meanwhile it
addresses none of the other issues.

Also, devfs as popularly envisioned remains a bad idea. As I said
before, it's an automounter, not a file system, and automounters are
accursed.

If you (or anyone else) really wants it to happen, I recommend sorting
out _in detail_ how to configure it and where (and how) to store the
persistent state it needs to remain compatible with expectations. For
example, suppose there are two (different) chroots as well as the
"real" /dev and someone inserts a USB microphone, with the intent that
one of the chroots (which they're using as a sandbox for a popular
videoconferencing app) can use it and the other (which is running code
that's potentially overtly hostile) cannot. How do you decide the
user/group and mode for the new device node in each of the three
instances of /dev? In what language do you write the policy that
describes what's supposed to happen? And where does the decision logic
execute? Where is the policy stored? How do you get it to reload when
needed? None of this is trivial.

Also decide for certain whether the objects your devfs automounter
adds to the namespace are in fact ordinary device special files with
hardcoded major/minor numbers (this is a lot simpler, but defeats much
of the point) or should be something more flexible, like dynamically
generated vnodes that connect directly to device drivers.

Unlike most people talking about it (here and elsewhere) I have
actually implemented this (in a research system) and my conclusion
from that experience was that it was a mistake. I don't propose to
waste my time doing it again.


 > The way we handle ZFS is just goofy

As far as I know this is largely or entirely the fault of the ZFS
developers for doing their own thing instead of sticking to the
established/conventional abstractions for devices and file systems.

-- 
David A. Holland
dholl...@netbsd.org


Re: Devices.

2021-05-29 Thread David Holland
On Sat, May 29, 2021 at 04:17:13PM -0700, John Nemeth wrote:
 >  We should really get with the times and create a devfs.  I
 > know that there are people that disagree with this (likely including
 > you), but the archaic device node system causes a lot of headaches
 > and it's time that we joined the 21st century.  Anything done with
 > devices should be done with idea of a devfs in mind.  Yes, devfs
 > like things have caused a lot of problems on other operating systems,
 > but I think we have enough brain power and enough real world examples
 > to be able to not repeat the mistakes of the past.

The problem with the devfs everyone wants is that it's an automounter,
not a file system, which means that (a) everybody who tries to
implement it as a file system discovers that the abstractions don't
match up and ends up in varying degrees of trouble as a result, and
(b) automounters seem to be inherently awful and devfs implementations
thus are too.

The stuff I'm proposing doesn't affect those issues, and is likely to
make any implementation easier/simpler: it reduces the amount of work
a fs with device special files on it needs to do, and it also provides
the means to look up and "start" any given device or device instance
as its own interface disentangled from the filesystem code.

IMO the only significant problem that devfs solves is storing magic
numbers (device major numbers) on disk, and I think that's readily
(and better) solved by storing driver names instead of major numbers.
Otherwise it's mostly a matter of not cluttering /dev with entries for
hardware you don't have (nice but not a big deal) and not needing to
run MAKEDEV in various circumstances (definitely useful, but not
exactly critical)... all of which would be all very well if a clean
implementation existed, but given that this does not seem to be the
case it doesn't seem to me to be worthwhile.

-- 
David A. Holland
dholl...@netbsd.org


Re: Devices.

2021-05-29 Thread David Holland
On Sun, May 30, 2021 at 12:00:16AM +0200, Johnny Billquist wrote:
 > On 2021-05-29 22:26, David Holland wrote:
 > > There are a number of infelicities in the way we currently handle the
 > > I/O plumbing for devices in the kernel. These include:
 > 
 > [...]
 > 
 > Just looking/thinking about the ioctl part - you say abolish it inside the
 > kernel. So does that mean that we keep the ioctl() interface for user
 > programs, but we'd create a system function that will translate ioctl()
 > calls from user programs to different function calls based on class and
 > function.

Yes, that's the idea. Getting rid of it at the system call interface
is a much harder problem (between compat, binary emulation, standards,
masses of existing 3rd-party code that calls ioctl explicitly, etc.)
and probably not worth biting off at all, let alone right now.

(Hopefully this also addresses Paul's email.)

This was all based on the experience of adding discard and adding the
dispatching for it as a first-class [bc]devsw op rather than an ioctl:
it was a pain because it ultimately required touching _every_ driver,
not just the ones that needed to support it, but it was far less of a
mess than plumbing it as an ioctl would have been.

 > Which means any time anyone wants to add some new kind of function to a
 > device, you'd have to adjust the ioctl() system call, the class, all
 > devices of that class, and of course your specific driver which you want to
 > do something with the hardware.

That is correct; but normally when you do that (like with discard) you
want it to exist on every device of that class, even if to begin with
most of the implementations are "return ENOSYS".

It is unlikely that ioctl can actually be made to go away entirely
within the kernel or even the vnode side of the kernel (the networking
code is also full of ioctls that this set of plumbing changes would
not affect at all) so there'd probably still be a place to stick
really ad hoc things. But mostly it's better to avoid those,
especially because they gradually accrete into standard interfaces. :-|

Note that even if it does become possible to kill off ioctl at the
driver level it's not going to happen anytime soon, and that there
will also likely need to be some kind of "misc" driver class as well.
I would expect the process of moving in this direction to begin by
carving off obvious device classes (e.g. disks, audio) and then
migrating their ioctls; at some point we'll get to a stage where the
stuff that's left is all atypical and it becomes a waste of time to
try to make up a framework that fits it all.


... something I should have mentioned in the original mail:
centralizing ioctl dispatch also allows having a single copy of any
processing code for the arguments, which isn't a big deal for ordinary
ioctls but should definitely help with robustness as soon as there are
pointers involved or compat32 issues.

-- 
David A. Holland
dholl...@netbsd.org


Re: Devices.

2021-05-29 Thread David Holland
On Sat, May 29, 2021 at 05:41:38PM -0400, Mouse wrote:
 > There is, however, one thing I think is missing in your rework
 > proposal.  I see nowhere to fit in a one-off driver for idiosyncratic
 > hardware - or as a pseudo-device interface to idiosyncratic kernel
 > code.  I've personally done enough instances of each that I would find
 > it close to crippling if support for them were to be lost.

It's unlikely to be possible to avoid having a device class called
"miscellaneous" that such things could be stuffed into, because, as
you note, they exist.

 > > For disks, which for historical reasons live in both cdevsw and
 > > bdevsw, both entries would point at the same disk_dev.
 > 
 > I would suggest getting rid of the bdev/cdev distinction.  It is, as
 > you say, a historical artifact, and IMO it is not serving anyone at
 > this point.

It is deeply baked into the system call API and into POSIX, so it's
not going anywhere. It's been proposed that we should stop having
block devices, which would have the same net effect; I have no strong
opinion on that and it doesn't need to be part of this set of changes.

 > > (2) Abolish ioctl inside the kernel, or at least within the device
 > > tree.
 > 
 > This would, I think be possibly the hardest part of dealing with
 > idiosyncratic drivers such as I sketched above.  Some of the drivers
 > I've done use ioctls, some use I/O on minor sub-numbers (eg, the low
 > three bits of the minor number select a functional interface whereas
 > the rest of the bits are the "true" minor number), but most of them
 > need something of the sort.

Getting rid of it altogether is likely too optimistic for reality. See
next email (upcoming).

 > > A third question: how does this affect interfaces?
 > 
 > As in, network interfaces?  Good question.  I think they should be
 > device nodes in the filesystem *somehow*.

That's probably true, but they currently aren't and the plumbing above
them is unrelated to the VFS device plumbing, so for the time being
it's a separate issue.

Disentangling the current situation with device special files on
filesystems will make it easier to manifest interfaces on disk if we
ultimately want that.

-- 
David A. Holland
dholl...@netbsd.org


Devices.

2021-05-29 Thread David Holland
There are a number of infelicities in the way we currently handle the
I/O plumbing for devices in the kernel. These include:
   - cloning devices exist but as currently implemented violate
 layering abstractions;
   - every file system needs to have cutpaste vnode ops tables for
 device vnodes;
   - the split between block and character devices was never
 particularly well anchored in reality (e.g. tapes) but has aged
 poorly since as new classes of devices have appeared;
   - because we don't distinguish between different classes of devices
 nearly all device ops travel through the system as ioctls;
   - because of the ensuing complexity, dispatching of ioctls is a
 mess; there are many cases where ioctl handlers match some ioctls
 specifically and pass anything else on, which introduces
 opportunities for various kinds of bugs;
   - adding a new device-level operation to cdevsw or bdevsw requires
 touching every driver, including those it is completely
 irrelevant to;
   - if you have multiple sets of device nodes on a system (e.g. in
 chroots) operations that affect the device nodes themselves can
 behave differently or strangely depending on which copy you touch
 (we have had multiple generations of hacks to mitigate this, and
 none have been completely satisfactory);
   - because we don't distinguish device classes in any modern sense
 of the term they cannot be addressed or reasoned about in system
 config or things like kauth policies;
   - and probably other things I haven't thought of.

I've been mumbling on and off for a long time about various parts of
this problem, and I think it's time to propose a unified architecture
for a solution. Note that the changes required are nontrivial and this
is not going to happen all at once (or anytime soon); the goal of
blathering about it is to try to reach agreement on a place that we
want to get to eventually... and also, to smoke out any places where
the proposed architecture won't actually work or is inconsistent with
what happens on the ground.

There are four major interconnected sets of changes I have in mind to
address these problems.

(1) Create explicit device classes. This would be adding a layer of
indirection between struct cdevsw/bdevsw and drivers; so e.g. a mouse
driver would, instead of declaring a struct cdevsw, declare a struct
mouse_dev containing operations on mice, and the cdevsw entry would
point at this. For disks, which for historical reasons live in both
cdevsw and bdevsw, both entries would point at the same disk_dev.

(2) Abolish ioctl inside the kernel, or at least within the device
tree. Given separate device classes, each operation needed can be made
its own operation on that device class (that is, a function pointer in
struct foo_dev) with the ensuing large increase in clarity about what
the operations are and where they need to be implemented. Plus this
way we get type safety for the arguments.

(3) Rearrange the way operations dispatch to devices. The traditional
model is that opening a device gives you a device vnode, and device
ops are dispatched to the device driver by looking up the major number
and indirecting through either the cdevsw or bdevsw table, and passing
the minor number as an argument. (This throws away all information
about how or when the device was opened, which is why cloners needed
to do something different.) The proposed method is that device vnodes
resolve the identity of the device when first loaded, and at runtime
point to the e.g. struct disk_dev rather than remembering the major
number; and for cloners they point to a device instance structure
created when the device is opened, which holds the per-instance data
the cloner needs. (It's not clear to me right now if only cloners
should get instance structures, or if for uniformity it makes sense to
allocate them for every driver, or if it should be a property of
certain device classes -- the overhead of a allocating a handful of
extra small structures isn't important, so it's mostly a code
complexity question.) Operations on devices go to the device vnode and
are then sent on to the driver directly; the cdevsw and bdevsw tables
are used only when devices are first looked up. All ioctls are turned
into explicit device-level operations in the device vnode's ioctl op.

(3a) Further down the line it might make sense to make devices _not_
vnodes but instead make them different instances of struct file
(either one for all devices or even one for each device class) -- this
would move ioctl dispatching up a layer, which would be an
improvement. But I don't think this needs to be part of the initial
plan. Plus 

(4) Make device vnodes fs-independent and rearrange how looking them
up works. Get rid of the extra ops table for devices that every fs has
to have (and also the one for fifos); instead, make fs-level special
file vnodes mostly-inert objects that don't support anything much
besides getattr 

Re: fsync error reporting

2021-02-19 Thread David Holland
On Fri, Feb 19, 2021 at 08:47:16AM -0500, Greg Troxel wrote:
 > I see our man page addresses this with FDISKSYNC.   It sounds like you
 > aren't proposing to change this (makes sense), but there's the pesky
 > issue of errors within the disk when writing from cache to media.
 > Perhaps those are unreportable.

Well, maybe I am. As I said in the original post, punting this issue
to applications via an obscure interface that most won't know about
doesn't seem like the right approach.

But it's pretty much a separate issue.

-- 
David A. Holland
dholl...@netbsd.org


Re: fsync error reporting

2021-02-19 Thread David Holland
On Fri, Feb 19, 2021 at 08:33:03AM -0500, Greg Troxel wrote:
 > Maybe I'm way off in space, but I'd like to see us be careful about
 > 
 >   1) operating system has a succcessful return from a write transaction to
 >   a disk controller (perhaps via a controller that has a write-back
 >   cache)
 > 
 >   2) operating system has been told by the controller that the write has
 >   actually completed to stable storage (guaranteed even if OS crashes or
 >   power fails, so actually written or perhaps in battery-backed cache)
 > 
 >   A) for stacked filesystems like raid, cgd, and for things like NFS,
 >   there's basically and e2e ack of the above condition.

Disk controllers don't in general tell you (2); all you can really do
is flush the cache, and sometimes if you tell them to flush the cache
they just ignore it anyway. Not to mention that, consumer-grade SSDs
are likely to turn to swiss cheese on a power failure (regardless of
flushing) if they aren't fully quiescent, and sometimes maybe even
then.

There is not much we can do about this, but we can at least make the
reporting chain as far as the disk controller and back, which is (1),
work reliably.

-- 
David A. Holland
dholl...@netbsd.org


Re: fsync error reporting

2021-02-18 Thread David Holland
On Thu, Feb 18, 2021 at 11:00:15PM -0500, Mouse wrote:
 > > (3) I think the drawbacks of reporting user 1's I/O errors to user 2
 > > [...] mean that we should guarantee that I/O errors from *your*
 > > writes should be reported by *your* call to fsync.  [...]
 > 
 > > (3a) I don't think it's necessary to guarantee that I/O errors from
 > > other people's writes won't _also_ be reported by your fsync call,
 > > but I think any natural implementation that supports the prior
 > > guarantee will also have this property.
 > 
 > I'm not so sure.
 > 
 > A opens F
 > B opens F
 > A write()s 10 bytes at offset 10, thereby dirtying the block at offset 0
 > B write()s 10 bytes at offset 30, thereby dirtying the block at offset 0
 > Kernel now pushes the block at offset 0 and gets a hardware error
 > 
 > Now: which of the writes errored and thus should get an error at next
 > fsync?  A's?  B's?  Both?  If B's write is instead at offset 10, so it
 > completely overwrites A's, does that change your answer?
 > 
 > I think the only sane answer to the first question is "both".  This
 > then leaves open the question of how to ensure _both_ fsync()s error.
 > I don't see any particularly "natural" way to do that that won't also
 > report errors due to someone else's write.  Maybe I'm just missing
 > something.

Well, if both A and B update a block and then it errors on writeback,
both their writes failed, regardless of whether B overwrote A's data
entirely or not. So I think reporting to both is correct. If they
write to different blocks, though, they won't see each others' errors
under the scheme I proposed.

 > > everything that process wrote is on disk,
 > 
 > That is probably unattainable, since I've seen it plausibly asserted
 > that some disks lie, reporting that writes are on the media when this
 > is not actually true.

Indeed. What I meant to say is that everything has been sent to disk,
as opposed to being accidentally skipped in the cache because the
buffer was busy, which will currently happen on some of the fsync
paths.

That's why flushing the disk-level caches was a separate point.

 > However, I think the kernel can be excused for believing a lie from the
 > device in that regard, especially since there really isn't much
 > alternative.

Right.

 > > (8) I'm not convinced that there's any real value in reporting
 > > exactly what blocks failed.
 > 
 > Is there any interface to do so via?

No; we'd have to make one up, which doesn't seem worthwhile.

-- 
David A. Holland
dholl...@netbsd.org


fsync error reporting

2021-02-18 Thread David Holland
I have been going through various code paths (hence all the posts and
commits about obscure details) with an eye toward documenting what we
do and do not guarantee about files to applications... and I've gotten
to fsync.

In an ideal world, if you write some data and later when it gets
written to disk the disk says EIO, this will be reported to the
application as an EIO return when (if) the application calls fsync.
This allows, for example, databases to detect when their transaction
commits are not actually committed, and take appropriate steps like
failing over to another replica.

In the world we currently have, the chances of this working are
identically zero; the error reporting chain for fsync and for write
errors in general is completely borked.

There are several places where errors are discarded (e.g. any errors
reported by putpages are flatly ignored by vflushbuf) and other places
where errors are collected or not depending on circumstances; e.g.
again in vflushbuf I/O errors on metadata blocks are noticed only for
buffers attached to the disk vnode and not the file vnode. (And while
the comment says this identifies indirect blocks, I don't think that's
actually true in general either.)

Some of the current code paths that do check for error also give up as
soon as an error happens and skip trying to fsync the rest of the
file. This is probably not helpful; it's not necessarily clear that
it's completely wrong (since as kre points out, today once a disk
starts reporting EIO further behavior is probably undefined) but
that depends on circumstances. It's not entirely implausible that a
RAID might emit some transient EIOs during failover and then become
fully operable again a moment or two later. Also, ISCSI.

Also, some of the code paths skip over busy buffers, and not all of
them retry, meaning that blocks can be skipped, which is a fairly
serious bug if it happens.

Meanwhile, as appeared in the other thread about write errors, it
would be nice if separate processes writing to the same file either
didn't receive notification about each others' I/O errors, or if they
all received notifications about all I/O errors. (That is, it would be
useful to avoid the world where my I/O error gets reported to you and
then not to me.)

And currently there's a problem that the only way to flush the
underlying hardware-level caches is to call fsync_range and pass
FDISKSYNC. This might be POSIX (is it? man page doesn't say so) but it
doesn't necessarily seem helpful or the right way to go about handling
this issue.

Documenting what currently happens is pointless, so in what I'm
writing I would like instead to document the model we'd like to have,
and for the time being append a large asterisk that says none of it
actually works.

My thoughts on the subject are:

(1) We should guarantee that any disk-level write error that doesn't
result in an immediate EIO return from write (and as I was saying
elsewhere, I don't think that will ever happen for regular files
without O_DIRECT) results in an eventual EIO return from fsync.

(2) Even when fsync ends up returning EIO it should sync as much of
the file as it can; that is, unlike typical error paths EIO should not
result in immediately stopping and unwinding.

(3) I think the drawbacks of reporting user 1's I/O errors to user 2
(especially when you can fsync after opening with O_READONLY, so there
isn't necessarily a huge amount of trust involved) mean that we should
guarantee that I/O errors from *your* writes should be reported by
*your* call to fsync. I think it's sufficient to make this per-open
(rather than per-file-per-lwp or whatever) so the machinery can live
in struct file.

(3a) I don't think it's necessary to guarantee that I/O errors from
other people's writes won't _also_ be reported by your fsync call, but
I think any natural implementation that supports the prior guarantee
will also have this property.

(4) I'm not sure what to do about disk-level caches but I think
punting the issue to application software (especially via an obscure
and apparently not standard interface) isn't the right approach.

(5) I'm not sure if outstanding unreported I/O errors should cause
close to fail with EIO or if they should just be dropped. Given how
useless error handling on close is, it's probably a moot question.

(6) We do absolutely need to guarantee that when fsync returns
everything that process wrote is on disk, even if some other process
is in the middle of modifying some of the same buffers.

(7) I don't know if I/O errors should be counted or just latched (that
is, if there are six I/O errors, do you get six successive EIOs from
fsync, or just one?) but we should decide and commit to one model or
the other. Given per-open reporting I think latched is sufficient.

(8) I'm not convinced that there's any real value in reporting exactly
what blocks failed. Most applications have nothing useful they can do
with the information, and most of the rest (e.g. database 

Re: fsync_range and O_RDONLY

2021-02-18 Thread David Holland
On Thu, Feb 18, 2021 at 06:56:00PM -0500, Greg Troxel wrote:
 > > And report any errors to me, so if you're a database and I'm feeling
 > > nasty I can maybe mess with you that way. So I'm not sure it's a great
 > > idea.
 > >
 > > Right now fsync error reporting is a trainwreck though.
 > 
 > I think that's the real problem; if I open for write and fsync, then I
 > should get status back that lets me know about my writes, regardless of
 > who else asked for sync.   Once that's fixed, then the 'others asking
 > for sync' is much less of a big deal.
 > I know, ENOPATCH.

Yeah, that's highly nontrivial. More on that in a moment...

-- 
David A. Holland
dholl...@netbsd.org


Re: partial failures in write(2) (and read(2))

2021-02-18 Thread David Holland
On Tue, Feb 16, 2021 at 05:29:00PM +0700, Robert Elz wrote:
 > We could, of course, invent new interfaces (a write variant with an
 > extra pointer to length written arg perhaps, or where the length arg
 > is a pointer to a size_t and that is read and then written with either
 > the amount written, or the amount not written).
 > 
 > But I don't believe that any of this is needed, or desirable.

Right, I think succeeding with a short count is preferable in all
cases where anyone actually cares what happened.

 > We should first make sure that we do what POSIX requires, and simply
 > return a short write count (and no error) in the cases where that
 > should happen (out of space, over quota, exceeding file size limit,
 > and writing any more would block and O_NONBLOCK is set, more?).

As far as I can tell these errors are not currently handled in this
way, except maybe the EWOULDBLOCK case.

(And there's one other: signal delivery after writing some data to a
slow device. But that already works correctly.)

 > In the other error cases we should simply leave things alone and
 > accept it - it is the way unix always has been, and we have survived.
 > If we have a drive returning I/O errors (on writes), do we really
 > expect that earlier data written will have been written correctly?

Since writes to regular files will always go into the cache and not (I
think ever, absent O_DIRECT) be written to disk directly, I don't
think that case actually arises. Instead it will be filtering through
the completely broken fsync error reporting chain. (More on that
elsewhere.)

However, for reads... if you read part of a file and then get EIO
because the disk is going bad, it's reasonably likely that the part
you did get is ok, and moreover, if what you're trying to do is rescue
data from a dying disk, chances are you _do_ want it, even if there's
a moderate chance of it being corrupted. So I kind of think the EIO
case should succeed with a short count too.

As for EFAULT, I was testing with that because it's easy to test, but
I agree that it isn't particularly useful to continue, the one thing
I'm not sure of being possible interactions with generational garbage
collectors.

 > So, let's all forget fanciful interface redesigns, fix whatever we
 > need to fix to make things work the way they are supposed to work
 > (if there is anything) and leave the rest as "the world just broke"
 > type territory.

I'm pretty sure the only on-the-fly error that _does_ work in this
sense (in the sense of being converted to success with a short count)
is EINTR.

-- 
David A. Holland
dholl...@netbsd.org


Re: fsync_range and O_RDONLY

2021-02-18 Thread David Holland
On Wed, Feb 17, 2021 at 01:17:14PM -0500, Greg Troxel wrote:
 > > Last year, fdatasync() was changed to allow syncing files opened
 > > read-only, because that ceased to be prohibited by POSIX and something
 > > apparently depended on it.
 > 
 > I have a dim memory of this and mongodb.
 > 
 > > However, fsync_range() was not also so changed. Should it have been?
 > > It's now inconsistent with fsync and fdatasync and it seems like it's
 > > meant to be a strict superset of them.
 > 
 > It seems like it might as well be.  I would expect this to only really
 > sync the file's metadata, same as the others, but I do not feel like I
 > really understand this.

Well, if you have it open for write and I have it open for read, and I
fsync it, it'll sync your changes.

And report any errors to me, so if you're a database and I'm feeling
nasty I can maybe mess with you that way. So I'm not sure it's a great
idea.

Right now fsync error reporting is a trainwreck though.

-- 
David A. Holland
dholl...@netbsd.org


fsync_range and O_RDONLY

2021-02-17 Thread David Holland
Last year, fdatasync() was changed to allow syncing files opened
read-only, because that ceased to be prohibited by POSIX and something
apparently depended on it.

However, fsync_range() was not also so changed. Should it have been?
It's now inconsistent with fsync and fdatasync and it seems like it's
meant to be a strict superset of them.

-- 
David A. Holland
dholl...@netbsd.org


partial failures in write(2) (and read(2))

2021-02-05 Thread David Holland
(This came up in chat, and since there was no agreement at all there
it seems it ought to be discussed here.)

It is possible for write() calls to fail partway through, after
already having written some data. We do not currently document the
behavior under these circumstances (though we should), and some
experimentation suggests that at least some of the behavior violates
the principle of least surprise.

Basically, it is not feasible to check for and report all possible
errors ahead of time, nor in general is it possible or even desirable
to unwind portions of a write that have already been completed, which
means that if a failure occurs partway through a write there are two
reasonable choices for proceeding:
   (a) return success with a short count reporting how much data has
   already been written;
   (b) return failure.

In case (a) the error gets lost unless additional steps are taken
(which as far as I know we currently have no support for); in case (b)
the fact that some data was written gets lost, potentially leading to
corrupted output. Neither of these outcomes is optimal, but optimal
(detecting all errors beforehand, or rolling back the data already
written) isn't on the table.

It seems to me that for most errors (a) is preferable, since correctly
written user software will detect the short count, retry with the rest
of the data, and hit the error case directly, but it seems not
everyone agrees with me.

The cases that exist (going by the errors documented in write(2)) are:
   ENOSPC/EDQUOT (disk fills during the I/O)
   EFBIG (file size exceeds a limit)
   EFAULT (invalid user memory)
   EIO (hardware error)
   EPIPE (pipe gets closed during the I/O)

In the first three cases it's notionally possible to check for the
error case beforehand, but it doesn't actually work because the
activities of other processes or threads while the I/O is in progress
can invalidate the results of any check. (Also, for EFAULT the check
is expensive.)

Some of the same cases (particularly EFAULT and EIO) exist for read.
(Note that while for ordinary files stopping a partial read,
discarding the results, and returning failure is harmless, this is not
the case for pipes, ttys, and sockets, so it also matters for read.)

We were experimenting with the EFAULT behavior by using mprotect() to
deny access to part of a buffer and then writing the whole buffer out.
The results so far (with sufficiently large buffers):

   - for pipes, ttys, and probably everything that uses ordinary
 uiomove, the data in the accessible part of the buffer is written
 out and the call fails with EFAULT.

   - for regular files on ffs and probably most things that use
 uiomove_ubc, the data in the accessible part of the buffer is
 written, the call fails with EFAULT, and the size of the file is
 reverted to what it was at the start.

   - nobody's tested sockets yet, I think.

   - in all cases the mtime is updated.

The size reversion does unwind the common case, but in other cases it
produces bizarre behavior; e.g. if you have a 1M file and you write 2M
to it and then fault, the 1M of the file is replaced with the first 1M
of what you wrote and the rest is discarded; plus given that the call
failed most users' first instinct would be to assume that nothing was
written.

The behavior is probably the same for the other errors, though I
haven't looked and it's definitely possible that ENOSPC/EDQUOT are
handled more carefully.

Anyhow, if you've made it this far, the actual question is: is the
current behavior really what we want? (Whether or not it's technically
correct, or happens to be consistent with the exact wording in the man
pages, various aspects of it seem undesirable.)

ISTM that for all these cases except EIO it's sufficient to return
success with a short count and let the user code retry with the rest
of its data. For EIO I think it's best to do that and also retain the
error somewhere for the next write attempt.

-- 
David A. Holland
dholl...@netbsd.org


Re: Reparenting processes?

2021-01-09 Thread David Holland
On Sat, Jan 09, 2021 at 08:33:27AM -0500, Mouse wrote:
 > Waking up this thread from about a month ago
 >
 > [...]
 >
 > This makes me wonder if perhaps login sessions should have their
 > stdin/stdout/stderr set up on /dev/tty instead of the actual ctty
 > device.  But if that's done, will the `real' ctty device even be open?
 > I think /dev/tty doesn't keep an open on the underlying tty device, nor
 > can I see any easy way for it to do so, since /dev/tty itself doesn't
 > get a close call until no process in the whole system, not just no
 > process with that ctty, has a descriptor open on /dev/tty.
 > 
 > I'm not convinced there is a single right answer.  I'm writing to ask
 > if anyone here has any thoughts to share on these issues.

Interesting. Need to think about that...

-- 
David A. Holland
dholl...@netbsd.org


Re: thundering pipe herds

2020-11-23 Thread David Holland
On Tue, Nov 24, 2020 at 09:09:59AM +1100, matthew green wrote:
 > > @@ -395,9 +401,8 @@ pipeunlock(struct pipe *pipe)
 > >KASSERT(pipe->pipe_state & PIPE_LOCKFL);
 > >  
 > >pipe->pipe_state &= ~PIPE_LOCKFL;
 > > -  if (pipe->pipe_state & PIPE_LWANT) {
 > > -  pipe->pipe_state &= ~PIPE_LWANT;
 > > -  cv_broadcast(>pipe_lkcv);
 > > +  if (pipe->pipe_waiters > 0) {
 > > +  cv_signal(>pipe_lkcv);
 > >}
 > >  }
 > 
 > this part misses the while loop from the freebsd version i think?

...shouldn't, that would turn it back into broadcast.

(except not even that, because the decrement doesn't happen until
wakeup so it'd loop forever)

-- 
David A. Holland
dholl...@netbsd.org


thundering pipe herds

2020-11-23 Thread David Holland
mjg at freebsd says that their pipes have a thundering herd problem
that manifests in make's token pipe stuff at high -j values. We have
the same logic, and it's easily fixed.

Completely untested patch follows; does anyone object to this assuming
that it works (possibly after minor changes)?

(pipe_waiters is inserted where it is so it occupies what's currently
struct padding on 64-bit machines)



Index: sys/pipe.h
===
RCS file: /cvsroot/src/sys/sys/pipe.h,v
retrieving revision 1.36
diff -u -p -r1.36 pipe.h
--- sys/pipe.h  22 Aug 2018 01:05:24 -  1.36
+++ sys/pipe.h  23 Nov 2020 21:23:07 -
@@ -93,8 +93,7 @@ struct pipemapping {
 #define PIPE_DIRECTR   0x080   /* Pipe direct read request (setup complete) */
 #definePIPE_LOCKFL 0x100   /* Process has exclusive access to
   pointers/data. */
-#definePIPE_LWANT  0x200   /* Process wants exclusive access to
-  pointers/data. */
+/* unused  0x200   */
 #definePIPE_RESTART0x400   /* Return ERESTART to blocked syscalls 
*/
 
 /*
@@ -114,6 +113,7 @@ struct pipe {
struct  timespec pipe_mtime;/* time of last modify */
struct  timespec pipe_btime;/* time of creation */
pid_t   pipe_pgid;  /* process group for sigio */
+   u_int   pipe_waiters;   /* number of waiters pending */
struct  pipe *pipe_peer;/* link with other direction */
u_int   pipe_state; /* pipe status info */
int pipe_busy;  /* busy flag, to handle rundown */
Index: kern/sys_pipe.c
===
RCS file: /cvsroot/src/sys/kern/sys_pipe.c,v
retrieving revision 1.148
diff -u -p -r1.148 sys_pipe.c
--- kern/sys_pipe.c 26 Apr 2019 17:24:23 -  1.148
+++ kern/sys_pipe.c 23 Nov 2020 21:23:07 -
@@ -371,13 +371,19 @@ pipelock(struct pipe *pipe, bool catch_p
KASSERT(mutex_owned(pipe->pipe_lock));
 
while (pipe->pipe_state & PIPE_LOCKFL) {
-   pipe->pipe_state |= PIPE_LWANT;
+   pipe->pipe_waiters++;
+   KASSERT(pipe->pipe_waiters != 0); /* just in case */
if (catch_p) {
error = cv_wait_sig(>pipe_lkcv, pipe->pipe_lock);
-   if (error != 0)
+   if (error != 0) {
+   KASSERT(pipe->pipe_waiters > 0);
+   pipe->pipe_waiters--;
return error;
+   }
} else
cv_wait(>pipe_lkcv, pipe->pipe_lock);
+   KASSERT(pipe->pipe_waiters > 0);
+   pipe->pipe_waiters--;
}
 
pipe->pipe_state |= PIPE_LOCKFL;
@@ -395,9 +401,8 @@ pipeunlock(struct pipe *pipe)
KASSERT(pipe->pipe_state & PIPE_LOCKFL);
 
pipe->pipe_state &= ~PIPE_LOCKFL;
-   if (pipe->pipe_state & PIPE_LWANT) {
-   pipe->pipe_state &= ~PIPE_LWANT;
-   cv_broadcast(>pipe_lkcv);
+   if (pipe->pipe_waiters > 0) {
+   cv_signal(>pipe_lkcv);
}
 }
 


-- 
David A. Holland
dholl...@netbsd.org


Re: Funded project(s) to improve Linux emulation

2020-11-12 Thread David Holland
On Thu, Nov 12, 2020 at 12:49:13PM +, nia wrote:
 > > Many years ago, I made it so that mixer operations
 > > would fall though and complete successfully (can't find the commit now).
 > > 
 > > [...]
 > 
 > Thanks. Sounds pretty similar to some problems I observed in native
 > libossaudio applications in the past.

I hope this is fully documented in the code so someone doesn't
accidentally "fix" it sometime.

-- 
David A. Holland
dholl...@netbsd.org


Re: amd64 9.1, pre-wscons text colours?

2020-11-10 Thread David Holland
On Wed, Nov 11, 2020 at 05:20:53AM -, Michael van Elst wrote:
 > >But WS_KERNEL_FG=WSCOL_LIGHT_BROWN (I wanted yellow, that looks like
 > >the closeset available approximation) and WS_KERNEL_BG=WSCOL_BLUE?
 > >Once wscons switches, everything is as I would expect - but, before
 > >that, it's white on black!
 > 
 > >So there's clearly something I don't understand going on.  What?
 > 
 > 
 > VGA has only 8 colors and an intensity attribute. You cannot select
 > the "light" ANSI colors. The VGA driver will fail attempts to
 > allocate such color attributes, so you get white on black instead.
 > 
 > if (__predict_false((unsigned int)fg >= sizeof(fgansitopc) ||
 > (unsigned int)bg >= sizeof(bgansitopc)))
 > return (EINVAL);
 > 
 > 
 > When DRM takes over, such hardware limitations no longer apply.
 > The framebuffer code can handle more colors.

VGA hardware has 16 colors in text mode, eight each bright and dim.
The "intensity attribute" is the upper of the four color selection
bits and picks the bright colors.

You can't use the bright colors as _backgrounds_ without poking a
register somewhere: by default the top background color bit causes
blinking instead. It's reasonable we might not have code to do that,
just in case someone tries to boot on a machine with an EGA card or
something that doesn't have the register.

But if we can't use all 16 foreground colors, it's because of a driver
problem; it's not a hardware limitation.

-- 
David A. Holland
dholl...@netbsd.org


Re: amd64 9.1, pre-wscons text colours?

2020-11-10 Thread David Holland
On Tue, Nov 10, 2020 at 06:39:53PM -0500, Mouse wrote:
 > NetBSD/amd64, 9.1.
 > 
 > What controls the colours used for console text output before wscons
 > takes over?  When I build a kernel with WS_KERNEL_FG=WSCOL_GREEN and no
 > WS_KERNEL_BG, I see green on black, both before and after the switch
 > from text mode to text-in-graphics mode.  WS_KERNEL_FG=WSCOL_BLACK and
 > WS_KERNEL_BG=WSCOL_BLACK, and both are black-on-black.
 > 
 > But WS_KERNEL_FG=WSCOL_LIGHT_BROWN (I wanted yellow, that looks like
 > the closeset available approximation) and WS_KERNEL_BG=WSCOL_BLUE?
 > Once wscons switches, everything is as I would expect - but, before
 > that, it's white on black!
 > 
 > So there's clearly something I don't understand going on.  What?

That's all wscons; drmkms taking over isn't the same step as wscons
taking over. But anyway: my guess is that you can't use the bright
colors. I vaguely recall some issue with that long ago...

-- 
David A. Holland
dholl...@netbsd.org


Re: autoloading compat43 on tty ioctls

2020-10-11 Thread David Holland
On Sat, Oct 10, 2020 at 03:54:32PM -, Christos Zoulas wrote:
 > Aside for the TIOCGSID bug which I am about to fix (it is in tty_43.c
 > and is used in libc tcgetsid(), all the compat tty ioctls are defined
 > in /usr/src/sys/sys/ioctl_compat.h... We can empty that file and try
 > to build the tree :-), but I am guessing things will break. Also a lot
 > of pkgsrc will break too. It is not 4.3 applications that break it is
 > applications that still use the 4.3 terminal api's.

I _think_ I killed off the last sgtty.h user in pkgsrc a few years
ago, but that's only the really old interface. There are probably
still things using the ioctls.

rooting them out would probably be a good exercise but not exactly a
high priority...

-- 
David A. Holland
dholl...@netbsd.org


Re: [PATCH 0/2] Delete CIRCLEQ

2020-10-11 Thread David Holland
On Mon, Oct 12, 2020 at 01:23:15PM +1100, matthew green wrote:
 > > Switch the last user (ypserv) from CIRCLEQ to TAILQ.
 > > This is inspired by analogous refactoring from OpenBSD:
 > > https://github.com/openbsd/src/commit/d53c0cf4d32fdbd8b42debfba068f1b0efa423dc#diff-8d0a4fbb89658213ebf314ff188581d7
 > > 
 > > Remove the CIRCLEQ API completely from the system headers and document
 > > this fact in the QUEUE(3) man-page.
 > 
 > why?  queue.h may be used by more than src.
 > 
 > i don't see any benefit except forcing working code to
 > be changed, possibly introducing bugs.
 > 
 > please leave it alone.

It's been deprecated since -7, we can remove it for -10.

Anyway, please at least do fix ypserv.

-- 
David A. Holland
dholl...@netbsd.org


Re: wait(2) and SIGCHLD

2020-09-09 Thread David Holland
On Sat, Aug 15, 2020 at 07:57:26PM -0400, Terry Moore wrote:
 > >> I would say so, especially since that would mean the child's parent is
 > >> no longer the process that forked it (which could break other use
 > >> cases).
 > >
 > > That depends on how you implement detaching, but I suppose ultimately
 > > it's important for getppid() to revert to 1 at the point the parent
 > > exits (neither before, nor after, nor never) so some kind of linkage
 > > needs to remain.
 > >
 > > Bah.
 > >
 > > I guess it's time to invent yet another different interface to
 > > fork-and-really-detach.
 > 
 > No time to experiment today, but from the descriptions it sounds as if a
 > double fork would work,
 > with the child exiting immediately after forking the grandchild? Kind of
 > unpleasant, but nothing
 > new needed?

(For the record: yes, forking twice works, that's more or less the
standard approach; but it's comparatively expensive.)

-- 
David A. Holland
dholl...@netbsd.org


Re: fsck updating but not fixing filesystem

2020-09-09 Thread David Holland
On Fri, Aug 28, 2020 at 12:02:09PM -, Christos Zoulas wrote:
 > David Holland   wrote:
 > > > Sounds like there is an in interesting fuzzing project in there for
 > > > someone - make a filesystem mage and the repeatedly damage it, then
 > > > see if fsck can fix it, then if you get a rump panic when moving
 > > > everything around, and then re-run fsck to see if it indicates any new
 > > > issues :)
 > >
 > > One can do that, but given that there are lots of edge cases and many
 > > of them will be hard to reach, formal verification might be more
 > > effective.
 > 
 > I think we should fix all filesystems to pass:
 > 
 > https://www.netbsd.org/~riastradh/tmp/dirconc.c
 > 
 > Then we can think about formal verification :-)

You know I wrote the original version of that, right? :-)

-- 
David A. Holland
dholl...@netbsd.org


Re: fsck updating but not fixing filesystem

2020-08-27 Thread David Holland
On Mon, Aug 24, 2020 at 12:01:27PM +0100, David Brownlee wrote:
 > > > I think the general consensus is that ffs can be inconsistent it ways
 > > > fsck is unable to detect.
 > >
 > > ...much less fix.  Yes.  When I was doing the program that eventually
 > > got massaged into resize_ffs, during development I had some filesystems
 > > that were definitely corrupted but that fsck was happy with.  (I rather
 > > wish I'd saved some of them as test cases, but I didn't.)

I also once after abusing rename got a filesystem where fsck -y didn't
converge; I had to newfs it.

 > Sounds like there is an in interesting fuzzing project in there for
 > someone - make a filesystem mage and the repeatedly damage it, then
 > see if fsck can fix it, then if you get a rump panic when moving
 > everything around, and then re-run fsck to see if it indicates any new
 > issues :)

One can do that, but given that there are lots of edge cases and many
of them will be hard to reach, formal verification might be more
effective.

-- 
David A. Holland
dholl...@netbsd.org


Re: fsck updating but not fixing filesystem

2020-08-23 Thread David Holland
On Sun, Aug 23, 2020 at 08:14:31PM +0100, David Brownlee wrote:
 > One of the rsync processes hung, and upon reboot fsck checked the
 > filesystem and marked it clean, but after a while it happened again,
 > and then again a third time.
 > 
 > This time I've run fsck -f repeatedly and each time it marks the
 > filesystem as clean, but the next run finds another issue.
 > 
 > This is netbsd-9 amd64 stable from nyftp, DELL, PERC H710P controller,
 > running RAID1.

Are you sure the raid is clean? If it's not you can get bizarre
behavior like this depending on which side of it any given read is
serviced from. (That is: any given fsck run will see some of one
version and some of the other and make some changes, which may or may
not be consistent with what it sees the next time, and it all might
converge or might not...)

-- 
David A. Holland
dholl...@netbsd.org


Re: wait(2) and SIGCHLD

2020-08-15 Thread David Holland
On Sat, Aug 15, 2020 at 07:24:01AM -0400, Mouse wrote:
 > >>> What I observe is that a process that explicitly ignores SIGCHLD
 > >>> (SIG_IGN), then forks a child which exits, when wait()ing for the
 > >>> child, gets ECHILD (i.e., wait returns -1 and errno is ECHILD).
 > >> And the ECHILD return is delayed until all children have terminated
 > > Huh, I hadn't realized (or expected) that.  So I guess it's wrong to
 > > implement this by just detaching the child up front...?
 > 
 > I would say so, especially since that would mean the child's parent is
 > no longer the process that forked it (which could break other use
 > cases).

That depends on how you implement detaching, but I suppose ultimately
it's important for getppid() to revert to 1 at the point the parent
exits (neither before, nor after, nor never) so some kind of linkage
needs to remain.

Bah.

I guess it's time to invent yet another different interface to
fork-and-really-detach.

 > > I'm guessing also then that it's the signal setting when the child
 > > exits that matters; I had always thought it was the signal setting
 > > when the child was forked.
 > 
 > Oh, interesting point.
 > 
 > Yes, in a test I just did [...]

Yup, me too.

-- 
David A. Holland
dholl...@netbsd.org


Re: wait(2) and SIGCHLD

2020-08-15 Thread David Holland
On Fri, Aug 14, 2020 at 03:51:12PM -0400, Mouse wrote:
 > > What I observe is that a process that explicitly ignores SIGCHLD
 > > (SIG_IGN), then forks a child which exits, when wait()ing for the
 > > child, gets ECHILD (i.e., wait returns -1 and errno is ECHILD).
 > 
 > And the ECHILD return is delayed until all children have terminated
 > (ie, until there are no outstanding children, until the ECHILD return
 > is accurate).  That's important for some use cases.

Huh, I hadn't realized (or expected) that. So I guess it's wrong to
implement this by just detaching the child up front...?

I'm guessing also then that it's the signal setting when the child
exits that matters; I had always thought it was the signal setting
when the child was forked.

"signals are a semantic cesspool"

-- 
David A. Holland
dholl...@netbsd.org


Re: Proposal to enable WAPBL by default for 10.0

2020-07-27 Thread David Holland
On Sun, Jul 26, 2020 at 11:20:37PM +, m...@netbsd.org wrote:
 > > To be explicit:
 > > 
 > > It is the same underly problem either way, and it is worse in practice
 > > with WAPBL than without because WAPBL ffs runs faster than non-WAPBL
 > > ffs and thus accumulates more unwritten blocks.
 > 
 > It looks like this difference is because FFS doesn't flush the disk
 > cache often, but if WAPBL is enabled, it does on every log write.

That would cause WAPBL to generate fewer unwritten blocks, which isn't
consistent with the observed results. (Or maybe it is, and without
this effect WAPBL would be even worse.)

But this is unlikely to be an issue in most cases, because data that
makes it to the disk-level cache is not lost just because the kernel
panics. You have to turn off the power for that.

-- 
David A. Holland
dholl...@netbsd.org


Re: Proposal to enable WAPBL by default for 10.0

2020-07-23 Thread David Holland
On Thu, Jul 23, 2020 at 07:45:08AM +0200, Micha? G?rny wrote:
 > > > Rationale: the default filesystem (FFS) without WAPBL is more prone to
 > > > data loss.
 > > 
 > > It is not, unfortunately. We had WAPBL on by default some time back
 > > and eventually switched it off.
 > > 
 > > The problem is that because it still doesn't do anything about
 > > journaling or preserving file contents, but runs a lot faster, it
 > > loses more data when interrupted.
 > 
 > How does that compare to the level of damage non-journaled FFS takes?

To be explicit:

It is the same underly problem either way, and it is worse in practice
with WAPBL than without because WAPBL ffs runs faster than non-WAPBL
ffs and thus accumulates more unwritten blocks.

-- 
David A. Holland
dholl...@netbsd.org


Re: Proposal to enable WAPBL by default for 10.0

2020-07-22 Thread David Holland
On Thu, Jul 23, 2020 at 05:17:33AM +, David Holland wrote:
 > The problem is that because it still doesn't do anything about
 > journaling or preserving file contents, but runs a lot faster, it
 > loses more data when interrupted.

Note since someone already asked: that should be read as "(journaling
or preserving) file contents".

-- 
David A. Holland
dholl...@netbsd.org


Re: Proposal to enable WAPBL by default for 10.0

2020-07-22 Thread David Holland
On Wed, Jul 22, 2020 at 11:24:16PM +0200, Kamil Rytarowski wrote:
 > I propose to enable WAPBL ("log" in fstab(5)) by default for 10.0 and newer.
 > [...]
 >
 > Rationale: the default filesystem (FFS) without WAPBL is more prone to
 > data loss.

It is not, unfortunately. We had WAPBL on by default some time back
and eventually switched it off.

The problem is that because it still doesn't do anything about
journaling or preserving file contents, but runs a lot faster, it
loses more data when interrupted.

-- 
David A. Holland
dholl...@netbsd.org


Re: pg_jobc going negative?

2020-07-18 Thread David Holland
On Fri, Jul 10, 2020 at 08:41:40PM +0700, Robert Elz wrote:
 > I have spent a little time looking at this now, and I think
 > it is just all a mess.

Indeed...

 > As best I can work out, and someone correct me if I'm wrong,
 > the whole purpose of pg_jobc is so that orphanpg() can be called
 > when a process group is orphaned (no longer has a session leader).

Yes.

A process group is orphaned if there's no longer a way to do job
control on it directly -- the most common case is when you get hung up
on and your login shell/session leader goes away, but it can also
happen if you make chains of forks and some of them exit. For example,
if you start an interactive subshell, run some stuff in the
background, then exit it, the background jobs are now orphaned.

Detecting orphaned process groups is vexing because it's a graph
connectivity problem and you don't want to/can't (for locking reasons)
go chugging around the graph explicitly.

 > I see 3 ways forward...   simply drop the KASSERT the way that FreeBSD
 > have done, and let things return to the semi-broken but rarely bothersome
 > code that was there before.   That's not really a good idea, as the
 > sanitizers apparently find problems with the code the way it was (not
 > too surprising, deleting the KASSERT won't fix the bugs, it just stops
 > noticing them explicitly).

Indeed.

 > Or, we could properly define what pg_jobc is counting, and then make sure
 > that it counts whatever that is properly - is incremented in all the
 > appropriate places, and decremented properly as well.   Currently
 > the comment about it in proc.h is:
 > /*
 >  * Number of processes qualifying
 >  * pgrp for job control 
 >  */
 > which makes it clear that it is a reference counter (not necessarily
 > counting the number of something which exists, so that something can be 
 > deleted, but it is counting references to processes).   Unfortunately
 > I have no idea what "qualifying pgrp for job control" is supposed to mean.

A processes is job-controlled by its parent process, and a process
group is job-controlled by any parent of processes in it. So a process
group can be job-controlled if some process in it has a parent in a
different process group that can also be job-controlled, going back to
the session leader. (Provided that the session has a tty, anyway.)

This becomes complicated because although normal process group usage
is completely hierarchical, it's possible to place processes in
process groups such that the graph of parent relationships ceases to
be a tree.

The code is trying to count how many processes in the process group
have a parent in a different process group that is not itself
orphaned. If this count reaches zero, the process group has become
orphaned.

I can't remember if it's possible to set up process groups where the
parent graph is cyclic. I think it might be; or at least, I don't
remember any reason why not that wouldn't also force the graph to be a
tree. But it's been a while with all this stuff and I may be
forgetting something.

Anyway, the method should work as long as the graph remains acyclic.
The implementation is probably missing cases where it should be
adjusting the count.

 > That could be done, but it seems like a lot of work to me, and not easy.

Unfortunately, I think it's the only viable way forard.

 > Another (more radical) approach would be to simply drop orphanpg()
 > completely, and thus no longer need pg_jobc at all.   The system
 > wouldn't be bothered by this at all - all orphanpg() does is locate
 > stopped members of the process group, and send then SIGCONT (to restart)
 > and SIGHUP (to probably kill them - or at least inform them they their
 > environment has changed).

Unfortunately, removing it violates POSIX, which prescribes all this
stuff.

 > Third, and the option I'd suggest, is to revert to more basic principles,
 > remove the pg_jobc attempt to detect when a session leader has exited,
 > or changed to a different process group, and instead at candidate events
 > (any time a process leaves a process group, for any reason) check if that
 > process was the session leader, and if it is, clean up any now orphaned
 > stopped processes.   This is likely to be slower than the current attempt,
 > but is much easier to make correct (and much less likely to cause problems,
 > other than dangling orphaned stopped processes, if incorrect).

I don't think that's sufficient, because it doesn't take into account
the cases where jobs become disconnected from the session leader
without the session leader exiting. Which is easy to arrange: su, then
vi, ^Z, exit. :-/

-- 
David A. Holland
dholl...@netbsd.org


Re: digest so far? (Re: makesyscalls (moving forward))

2020-06-15 Thread David Holland
On Tue, Jun 16, 2020 at 12:21:33AM +0100, Roy Marples wrote:
 > On 16/06/2020 00:07, David Holland wrote:
 > > Kamil thinks that. I don't see why. Compiling data into tools just
 > > complicates everything.
 > 
 > libterminfo.so has some terms compiled into it if the databases are
 > unreadable for any reason. A handy fallback.
 > 
 > dhcpcd also embedds dhcpcd.conf definitions for all RFC DHCP/DHCPv6 and
 > applicable ND options. There is an option to install it as an external file
 > (not enabled in NetBSD), but then it's just another moving part to
 > maintain.

Those aren't build-time tools and have much more severe operational
requirements :-)

-- 
David A. Holland
dholl...@netbsd.org


Re: makesyscalls (moving forward)

2020-06-15 Thread David Holland
On Mon, Jun 15, 2020 at 10:56:04PM +, David Holland wrote:
 > A less glib example: line 3186 of vfs_syscalls.c, in stat or more
 > precisely sys___stat50, has a handwritten call to copyout() to
 > transfer the stat results back to userspace.

To amplify:

Currently syscalls.master says:

439  STD  RUMP  { int|sys|50|stat(const char *path, struct stat *ub); }

Currently that generates the following objects related to handling
system calls:

   --

struct sys___stat50_args {
syscallarg(const char *) path;
syscallarg(struct stat *) ub;
};

struct sysent sysent[] = {
...
{
.sy_narg = sizeof(struct sys___stat50_args) / sizeof(register_t),
.sy_argsize = sizeof(struct sys___stat50_args),
.sy_flags = SYCALL_ARG_PTR,
.sy_call = (sy_call_t *)sys___stat50
},  /* 439 = __stat50 */
...

   --

and everything behond that is handwritten. The current structure of
the handwritten code is:

   sys___stat50 -> do_sys_statat -> namei and vn_stat

where sys___lstat50, sys_fstatat (recall that POSIX misnamed it, it's
what you'd expect to be called "statat") and several compat entry
points via do_sys_stat share do_sys_statat.

sys___stat50, sys___lstat50, sys_fstatat, and the various compat entry
points handle the copyout() for the stat buffer. The copyin() for the
path happens in do_sys_statat.

In a world where all this is generated, the entry point in
vfs_syscalls.c is do_sys_statat, and it receives only kernel
pointers. The system call table points to an autogenerated
sys___stat50 that looks something like this:

int
sys___stat50(struct lwp *l,
const struct sys___stat50_args *uap,
/*
syscallarg(const char *) path;
syscallarg(struct stat *) ub;
*/
register_t *retval)
{
struct pathbuf *pb;
struct stat sb;
int error;

error = pathbuf_copyin(SCARG(uap, path), );
if (error) {
return error;
}

error = do_sys_statat(l, AT_FDCWD, pb, FOLLOW, );
if (error) {
pathbuf_destroy(pb);
return error;
}

error = copyout(, SCARG(uap, ub), sizeof(sb));

pathbuf_destroy(pb);
return error;
}

sys___lstat50 would be the same except that it passes NOFOLLOW, and so
would the various compat entry points except that they'd also contain
a call to translate the output stat structure before copying it out.

One disadvantage is that there's now one copy of the pathbuf_copyin
call for each entry point instead of a single shared one; but (a) I'm
not convinced this is important and (b) it could be avoided in the
code generator if we really wanted to.

However, there are a number of advantages:
   - userspace pointers are not exposed to or handled by handwritten
 code and the chances of mishandling them (which is a security
 issue) decreases sharply;
   - the code generator is far less likely to produce wrong or missing
 error path code;
   - once this is all in place, nobody needs to think about this code
 again.

The current syscalls.master can't express everything needed to do
this; I would expect the declaration to look something like

   err statat(fd fd, pathname path, struct stat *OUT ub, atflags flags);
   err stat(pathname path, struct stat *OUT ub) =
  statat(AT_FDCWD, path, ub, 0);

then for the table entry, something like

   439 nb50 rump stat

not to mention

   38 compat43 modular stat
   188 compat12 modular stat
   278 compat30 nb13 modular stat
   387 compat50 nb30 modular rump stat

and in the compat_ultrix table,

   38 ultrix modular stat

as there have been quite a few versions of stat over the years.

Note that it distinguishes "err" and the particular type of flags from
"int", knows what a pathname is, etc., all of which is necessary or at
least helpful for generating the code one wants in various
circumstances.

There are probably some wrinkles with the compat declarations that I'm
not on top of yet, but it'll become clear later. (This has to be done
by incrementally replacing handwritten with generated code, or it'll
never work properly; the system and the interactions of all the compat
stuff is too complicated.)

-- 
David A. Holland
dholl...@netbsd.org


Re: digest so far? (Re: makesyscalls (moving forward))

2020-06-15 Thread David Holland
On Mon, Jun 15, 2020 at 10:15:09PM +0200, Reinoud Zandijk wrote:
 > It would be great if that can be done for the ioctls dispatch code
 > as well and cover their copyin/copyout stuff, its now all over the
 > place; I think that would save a lot of hidden bugs. Is there a
 > specification that resembles the syscalls.master for ioctls?
 > Shouldn't there be one?

There is not and there should be, that's part of the goal.

 > Reading the discussion, I got the idea that the preferred place to
 > store the definitions for userland usage is internally in the
 > compiled makesyscalls; compiled with the source tables in
 > BSDSRCDIR/sys and not externally on disc.

Kamil thinks that. I don't see why. Compiling data into tools just
complicates everything.

Expect the installed description file to be something generated during
the system build, not a copy of syscalls.master.

Whether it matches the kernel is a red herring. If you have updated
the system properly, it will match /usr/include, and that is expected
to be consistent with the kernel the same way that it always has been.

-- 
David A. Holland
dholl...@netbsd.org


Re: makesyscalls (moving forward)

2020-06-15 Thread David Holland
On Mon, Jun 15, 2020 at 09:19:19AM +0200, Martin Husemann wrote:
 > > It seems to me that all of the following is mechanical and should be
 > > automatically generated, beyond what makesyscalls already does:
 > >- all the code that calls copyin/copyout
 > 
 > It is probably too early and I had too few coffee - but could
 > you point me at an example line of code that does copyin/copyout for
 > syscall args that you think should be replaced with automatically
 > generated code? How much of that generated code would be not from a verbatim
 > C block in the syscall description file?

Glib answer:
   % grep -w copyin kern/*.c | wc -l
151

A less glib example: line 3186 of vfs_syscalls.c, in stat or more
precisely sys___stat50, has a handwritten call to copyout() to
transfer the stat results back to userspace.

None of it needs to be verbatim blocks in the description file. The
last time I did this the code generator covered everything except argv
blocks, and that was with a pretty simpleminded awk script.

Generating this code has been standard practice in research systems
since the 90s.

The headline benefit of generating it rather than writing and
maintaining it by hand (besides the obvious, lower maintenance costs)
is that it becomes much harder to accidentally mix user and kernel
pointers.

 > >- compat32 translation for all syscalls and all ioctls
 > 
 > Tricky, but maybe doable. Not sure it will work for all.

I'm pretty sure it will. It's just a matter of reading the type
declarations and generating code that moves from 32 to 64 and back.
The hard part is getting all the type declarations in place so that
the tool can know what needs to be done, but having that is worthwhile
and part of the point of this exercise.

 > >- compat_otheros translation as well
 > 
 > I have no idea how that would work (or what exactly you mean).

A lot of it is the same kind of logic as compat32: for example the
first half of ultrix_sys_open() is about translating the Ultrix
representation of the arguments to the NetBSD representation. Other
parts are the same as the native NetBSD call handling code, except
using the Ultrix types. Some parts probably can't be done
automatically, like the second part of ultrix_sys_open() that
implements the vintage controlling tty behavior.

 > Rump and anything that needs to serialize/deserialize syscalls are
 > different beasts, and they could benefit from a common syscall
 > "protocol" definition, and maybe in the end it could turn out that
 > we do want to make that description the master source of our own
 > syscall definitions.

The demarshaling done with copyout and the marshaling/demarshaling
done to serialize system calls into a byte stream are the same kind of
thing, even though the representations are different; if you can
generate one, you can generate the others.

 > I have no idea how sanitizers fit in here.

They need wrapper functions for system calls, which need to know much
the same things as marshaling/demarshaling code.

 > Maybe start with the basics and explain things from ground up
 > before diverging into the hard issues (like the name and install
 > location).

Well... sure except now I'm not sure where to start. Didn't want to
start with lengthy explanations of things everyone already knows. :-/

-- 
David A. Holland
dholl...@netbsd.org


Re: makesyscalls (moving forward)

2020-06-14 Thread David Holland
On Sun, Jun 14, 2020 at 02:21:07PM -0700, Paul Goyette wrote:
 > > (2) What is the installed syscall description file called and where
 > > does it go? It is likely to be derived from (i.e., not the same as)
 > > syscalls.master. It isn't a C header file so it doesn't belong in
 > > /usr/include. It isn't necessarily particularly human-readable. My
 > > suggestion would be to add a directory in /usr/share for API
 > > descriptions and put it there, something like maybe
 > > /usr/share/api/syscalls.def.
 > 
 > Perhaps /usr/share/sys/syscalls.def ?
 > 
 > I'd suspect we might find more .../sys/... stuff (compared to the
 > amound of .../api... stuff) in the future which could minimize the
 > lonliness of syscalls.def  :)  Or even maybe /usr/share/kern/...
 > might work.

I was thinking that other machine-readable API definitions might
appear in the future, e.g. for parts of libc and for other base
libraries.

Wherever we put it now, it's going to be lonely to begin with :-/

-- 
David A. Holland
dholl...@netbsd.org


Re: makesyscalls (moving forward)

2020-06-14 Thread David Holland
On Sun, Jun 14, 2020 at 11:59:54PM +0200, Johnny Billquist wrote:
 > > "usr.bin/makesyscalls" sounds good to me.
 > 
 > Uh? usr.bin is where stuff for /usr/bin is located, right? Anything there
 > should be pretty normal tools that any user might be interested in. Don't
 > seem to me as makesyscalls would be a tool like that?
 > 
 > Possibly some sbin thing, but in all honestly, wouldn't this make more
 > sense to have somewhere under sys? Don't we have some other tools and bits
 > which are specific for kernel and library building?

So this was, in fact, a discussion I was intending to provoke, because
right now we have no place to put tools that are part of the system
build but don't make any sense to install in /usr, and that seems like
a gap.

However, Kamil convinced me that there will be external users of this
thing. As I mentioned in the original mail, apparently the llvm
sanitizers already have a netbsd-specific script that tries to read
the existing syscalls.master, and any future sanitizer-like thing will
need something similar if we don't provide a facility. Having
3rd-party sources like this groping in our tree trying to read an
internal file whose format and semantics we don't support is not the
right way.

Meanwhile it doesn't belong in sbin because it doesn't require root,
nor does doing something useful with it require root, and it doesn't
need to be on /, so... usr.bin. Unless we think libexec is reasonable,
but if 3rd-party code is going to be running it we really want it on
the $PATH, so...

-- 
David A. Holland
dholl...@netbsd.org


makesyscalls (moving forward)

2020-06-14 Thread David Holland
As I mentioned a few days ago the reason I was prodding
makesyscalls.sh is that I've been looking at generating more of the
system call argument handling code.

It seems to me that all of the following is mechanical and should be
automatically generated, beyond what makesyscalls already does:
   - all the code that calls copyin/copyout
   - compat32 translation for all syscalls and all ioctls
   - compat_otheros translation as well

The first of these has been routine in research OSes for 25+ years and
offers no particular difficulties other than it'll take a fair amount
of coding. The second and third follow directly from a good solution
to the first, modulo some semantic concerns about schema translation
that I believe aren't relevant for the things we need to do. (As in,
anything semantically complicated is going to end up being written by
hand anyway and the translation itself is about different
representations of the same data.)

Past experience (this will not be the first or I think even the second
syscall marshalling generator I've had the pleasure of writing) says
is that doing this well requires real compiler tooling and
infrastructure. The copyin/copyout code can be, but should not be,
generated with just an awk script. Going beyond that requires a real
tool with sources that get compiled.

Meanwhile looking at the large amount of cutpaste in the current
makesyscalls.sh, the number of similar output files already generated,
and talking with Kamil about what's needed by sanitizers, makes me
think that at least part of the generation should be driven by output
templates. That is, you run some tool and you feed it an output
template that describes one set of things and you get the in-kernel
copyin/copyout code. Then, you feed it some other template and out
comes the rump version, a third template and you get sanitizer
wrappers, etc. etc. I will need to think some about how to do this
effectively (it is not as straightforward as just spewing code out)
but I don't think it is going to be particularly difficult.

However, it means that the tool needs to be installed in the system
and so does the master description file for at least NetBSD's own
system calls. (Currently, it seems that the llvm sanitizers build
reads syscalls.master itself from a source tree, so it needs a system
source tree to build, which is bad.)

This raises two points that need to be bikeshedded:

(1) What's the new tool called, and where does it live in the tree?
"usr.bin/makesyscalls" is fine with me but ymmv.

(2) What is the installed syscall description file called and where
does it go? It is likely to be derived from (i.e., not the same as)
syscalls.master. It isn't a C header file so it doesn't belong in
/usr/include. It isn't necessarily particularly human-readable. My
suggestion would be to add a directory in /usr/share for API
descriptions and put it there, something like maybe
/usr/share/api/syscalls.def.


Note that the eventual tool that gets committed will be written in C,
but the development version will probably be written in something with
stronger types. For this reason I expect to not be committing
development versions until things are pretty much done.

-- 
David A. Holland
dholl...@netbsd.org


Re: makesyscalls

2020-06-14 Thread David Holland
On Fri, Jun 12, 2020 at 04:40:28PM +0200, Reinoud Zandijk wrote:
 > > Yes, it can be rewritten in C as a subsequent step. *After* quite a
 > > bit of tidying. And no, I'm not doing that now. Among other problems,
 > > compiling it requires bikeshedding where to put it in the tree. Feel
 > > free to sort that out.
 > 
 > I'd say 'C'. If the specification is read in, sanity checking can
 > be added on the read in datastructure too wich is hard to do in awk
 > and `friends'. Things like missing compat definitions, missing
 > syscall numbers etc can be printed out to be noted as non-existent
 > etc. I don't know about structure versioning and system calls as
 > for compat but generating them seems the right way to go.

C is not a great language for writing code generators, but it's what
we've got. There's nothing better that we want in base or that I'd be
willing to suggest accepting an external dependence for.

Anyway, the conclusion seems to me that I'm not going to check in the
Python version of the script.

-- 
David A. Holland
dholl...@netbsd.org


sys/rump/kern/lib/libsys_*/syscalls.master

2020-06-14 Thread David Holland
Does anyone know what the files sys/rump/kern/lib/libsys_*/syscalls.master 
are for and/or how they're used?

They don't seem to have been built with makesyscalls; there's some
suggestion that makesyscalls might be run on them on the fly during a
build (which would be wrong)... but the makefile logic for maybe doing
that doesn't seem to be used.

However, this is rump and the only things that are obvious are
misleading.

(The * in libsys_* expands to: linux, cygwin, sunos, for whatever
that's worth.)

-- 
David A. Holland
dholl...@netbsd.org


Re: makesyscalls

2020-06-10 Thread David Holland
On Wed, Jun 10, 2020 at 01:25:03AM -0400, Thor Lancelot Simon wrote:
 > Could you translate your prototype into a
 > different language, one that's less basically hostile to our build system
 > and its goals and benefits?

Like which one? You removed the part of the post that explained that
there aren't other reasonable choices.

Yes, it can be rewritten in C as a subsequent step. *After* quite a
bit of tidying. And no, I'm not doing that now. Among other problems,
compiling it requires bikeshedding where to put it in the tree. Feel
free to sort that out.

As for lua: it has the same headline issues as awk, namely it doesn't
enforce function arguments and doesn't require that variables are
declared before use. Again, I see no reason to think it'll be any more
maintainable.

-- 
David A. Holland
dholl...@netbsd.org


  1   2   3   4   5   6   7   8   >