from:"Robert Elz"

Re: poll(): IN/OUT vs {RD,WR}NORM

2024-05-28 Thread Robert Elz

Date:Tue, 28 May 2024 22:46:09 -0400 (EDT)
From:Mouse 
Message-ID:  <202405290246.waa17...@stone.rodents-montreal.org>

  | I question whether it actually works except by accident; see RFC 6093.

I hadn't seen that one before, I stopped seriously following the IETF
around the end of the last millennium, it was becoming way too commercially
based (decisions no longer based purely on technical merit) with way way
too much bureaucracy.

Aside from the middlebox problem (of which I wasn't aware -- and IMO
anything in the middle of the internet which touches anything above the
IP layer is simply broken - I know NAT forces force a little of that,
but NAT is also simply broken) there is nothing new in there, except
that their change from the Hosts Requirement solution for the off-by-one
issue was the wrong way to go.   The HR group discussed that at length,
using the "last byte of the urgent data" is safe, using that + 1 is not,
in that a system receiving urgent data which believes it should be +0
will be waiting for one more byte to arrive, which might never be sent,
if the transmitter is using +1.   On the other hand, if the transmitter
uses +0 and the receiver is expecting it to be +1, all that happens is
that the U bit turns off one byte sooner, all the urgent data is still
there and available to be read (of course, if anything is pretending this
is one byte of out of bound data they fail either way - but that, as that RFC
says, is simply broken).   (The uninteresting cases when both sender and
transmitter use the same concept aren't worthy of mention).

Of course, if essentially the whole internet has settled on the +1 version
(the original specification, instead of the example code) then perhaps
that change may have been warranted - I certainly haven't surveyed anything
to see which way various systems actually do it, and I expect a lot of
the original systems are long gone by now.

  | But only a few implementors paid any attention, it appears.

Does the BSD stack not do this the way that HR defined things?   I thought
that was changed way way back, before CSRG stopped generating the code.

  | But the facility it provides is of little-to-no use.  I can't recall
  | anything other than TELNET that actually uses it,

TELNET and those protocols based upon it (SMTP and FTP command at least).
SMTP has no actual use for urgent data, and never sends any, but FTP can
in some circumstances I believe (very ancient unreliable memory).

  | Furthermore, given that probably the most popular API to TCP, sockets,
  | botched it by trying to turn it into an out-of-band data stream,

Yes, that was broken.

  | then botched it further by pointing the urgent sequence number to
  | the wrong place,

In fairness, when that was done, it wasn't clear it was wrong - that
all long predated anyone even being aware that there were two different
meanings in the TCP spec, people just used whichever of them was most
convenient (in terms of how it was expressed, not which is easier to
implement) and ignored the other completely.   That's why it took
decades to get fixed - no-one knew that the spec was broken for a long
time.

Further, if used properly, it really doesn't matter much, the application
is intended to recognise the urgent data by its content in the data stream,
all the U bit (& urgent pointer) should be doing is giving it a boot up
the read stream to suggest that it should consume more quickly than it
otherwise would.  Whether than indication stops one byte earlier or later
should not really matter.

The text in that RFC about multiple urgent sequences also misses that I
think - all that matters is that as long as there is urgent data coming,
the application should be aware of that and modify its behaviour to read
more rapidly than it otherwise might (if it never delays reading from the
network, always receives & processes packets as soon as they arrive, which
for example, systems which do remote end echo need to do) then it doesn't
need to pay attention to the U bit at all).

If there are multiple sequences that demand speedy processing, each should
be processed when it is encountered, and if that affects what is done
with other, "normal" data that is also being read quickly, that's just an
aspect of the application protocol.

kre

ps: I am not suggesting that anyone go design new protocols to use urgent
data, just that the system isn't nearly as broken as some people like to
claim.

Re: poll(): IN/OUT vs {RD,WR}NORM

2024-05-28 Thread Robert Elz

Date:Tue, 28 May 2024 11:03:02 +0200
From:Johnny Billquist 
Message-ID:  <3853e930-4e77-4f6d-8a73-ec826a067...@softjar.se>

  | This is a bit offtopic, but anyway...

So it is, but anyway...

[Quoting Mouse:]
  | > TCP's urgent pointer is well defined.  It is not, however, an
  | > out-of-band data stream,

That's correct.

  | > However, the urgent pointer is close to useless in today's network, in
  | > that there are few-to-no use cases that it is actually useful for.

That's probably correct too.  It is however still used (and still works)
in telnet - though that is not a frequently used application any more.

[end Mouse quotes]

  | It was always useless. The original design clearly had an idea that they 
  | wanted to get something, but it was never clear exactly what that 
  | something was, and even less clear how the urgent pointer would provide it.

That's incorrect.   It is quite clear what was wanted, and aside from a
possible off by one in the original wording, was quite clear in how it
worked, and it did work.

The U bit in the header simply tells the receiver that there is some
data in the data stream (which is not sent out of band) that it probably
should see as soon as it can, and (perhaps, this depends upon the application)
that temporarily suspending any time consuming processing of the intervening
data (such as passing commands to a shell to be executed) would be a good
idea, until the "urgent" data has been processed.

The urgent pointer simply indicates where in the data stream the receiver
needs to have processed to have encountered the urgent data.  It does not
(and never did) "point to" the urgent data.   [That's where the off by one
occurred, there were two references to it, one suggested that the urgent
pointer would reference the final byte of what is considered urgent, the
other that it would reference one beyond that, that is, the first byte beyond
the urgent data.   This was corrected in the Hosts Requirements RFCs, somewhere
in the mid 80's if I remember them, roughly.]   The actual data considered
as urgent could be any number of bytes leading up to that, depending upon
the application protocol.   The application was expected to be able to
detect that, provided it actually saw it in the stream - the U bit (which
would remain set in every packet until one was sent containing no data
that included or preceded any of the urgent data) just allows the receiver
to know that something is coming which it might want to look for - but it
is entirely up to the application protocol design to decide how it is to
be recognised, and what should be done because of it ("nothing" could be
a reasonable answer in some cases).

That is all very simple, and works very well, particularly on high
latency or lossy networks, as long as you're not expecting "urgent"
to mean "out of band" or "arrive quickly" or anything else like that.

It is (was) mostly use with telnet to handle things like interrupts,
where the telnet server would have received a command line, sent that
to the shell (command interpreter) to be processed, and is now waiting
for that to be complete before reading the next command - essentially
using the network, and the sender, as buffering so that it does not need
to grow indefinitely big buffers if the sender just keeps on sending
more and more.

In this situation, if the sender tries to abort a command, when someone
or something realises that it will never finish by itself, then (given that
TCP has no out of band data, which vastly decreases its complexity, and
by so doing increases its reliability) there's no way for the sender to
communicate with the server to convey a "stop that now" message.   And do
remember that all this was designed before unix existed (before RFC's existed,
you need to go back to the original IEN's) when operating systems didn't
work like unix does - it was possible that only one telnet connection
could be made to a destination host (not a TCP or telnet restriction, but
imposed by the OS not providing any kind of parallel processing or 
multi-tasking), so simply connecting again and killing the errant process
wasn't necessarily possible.   Character echo was often done by the
client, not by sending the echoed characters back from the server.
A very different world to the one we're used to.

The U bit (and the urgent pointer which is just a necessary accessory,
not the principle feature) allowed this to be handled.   When the client
had something that needed attention to send, it would send that as "urgent"
data.  But that would just go in sequence with previously sent data (which
in the case of telnet, where the receive window doesn't often fill, was
probably already in the network somewhere) - however the U bit can be set
in the header of every packet transmitted, including retransmits of earlier
data, or even in an in sequence, no data, packet, and will be - with the
sender sending a duplicate, or empty, packet if needed to g

Re: MNT Reform2 USB LCP flash

2024-01-26 Thread Robert Elz

Date:Fri, 26 Jan 2024 09:26:38 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | Fortunately the drive geometry isn't really used anywhere. All
  | accesses just use the logical block addresses.

I have been meaning to suggest for ages that we remove all the
geometry nonsense from everywhere in the kernel, except those
drivers that actually need it - those should be responsible for
converting block numbers to CHS in a way that works for thej,
if they really need it (ancient ide drives before LBA addressing,
vax massbus drives, sun xd drives ... anything like that which
almost no-one has seen in decades).

It is just bizarre to see ssd and even nvme 'drives' claiming
to have cylinders and heads!

kre

Re: MNT Reform2 USB LCP flash

2024-01-26 Thread Robert Elz

If you are able, try building a kernel with the patch below.

I suspect this should probably apply without too many problems
to any reasonably modern NetBSD kernel version, patch is to
src/sys/dev/scsipi/sd.c

If patch(1) won't just work on your kernel sources, just
edit that fike, search for "fabricating" then a few lines
under that you'll see where secs & heads are arbitrarily
set, then cyls = size/(64*32) ... when size (not the actual
var name) is < 64*32 (eg: if it happened to be 68), it is
easy to see how cyls gets to be 0.   Just test for that,
and set cyls = 1 if the division made 0.

Not sure yet if that will be a complete solution to the
problem, this kind of issue may occur elsewhere,  but it
should be a start.

kre



Index: sd.c
===
RCS file: /cvsroot/src/sys/dev/scsipi/sd.c,v
retrieving revision 1.335
diff -u -r1.335 sd.c
--- sd.c28 Aug 2022 10:26:37 -  1.335
+++ sd.c26 Jan 2024 08:38:34 -
@@ -1769,6 +1769,8 @@
dp->heads = 64;
dp->sectors = 32;
dp->cyls = dp->disksize / (64 * 32);
+   if (dp->cyls == 0)  /* very small devices */
+   dp->cyls = 1;   /* round up # cyls */
}
dp->rot_rate = 3600;

Re: PSA: Clock drift and pkgin

2023-12-24 Thread Robert Elz

Date:Sun, 24 Dec 2023 13:49:53 +0100
From:Johnny Billquist 
Message-ID:  

  | In my opinion, all of these POSIX calls that take a time argument should 
  | really have been done the same as clock_gettime(), in that you specify 
  | what clock it should be based on.

The next version of POSIX will contain pthread_cond_clockwait() which is
just like pthread_cond_timedwait() but has a clock_id parameter.

  | As it is now, it is (or should be according to POSIX) unconditionally 
  | CLOCK_REALTIME.

Not sure about the current released standard, and too lazy to look ...
but in the coming one that's not true either:

The pthread_cond_timedwait() function shall be equivalent to
pthread_cond_clockwait(), except that it lacks the clock_id argument.
The clock to measure abstime against shall instead come from the
condition variable's clock attribute which can be set by
pthread_condattr_setclock() prior to the condition variable's
creation. If no clock attribute has been set, the default shall be
CLOCK_REALTIME.

kre

Re: kern.boottime drift after boot?

2023-10-10 Thread Robert Elz

Date:Tue, 10 Oct 2023 12:42:48 +0100
From:David Brownlee 
Message-ID:  

  | I have a system which records the output of "sysctl -n kern.boottime"
  | as part of a dhcpcd-exit.hook to ensure some processing only occurs
  | once per boot.

Cron's @reboot might help with that, that's its purpose.  See crontab(5)

  |  kern.boottime (KERN_BOOTTIME)
  |  A struct timespec structure is returned.  This structure 
contains
  |  the time that the system was booted.  That time is defined (for
  |  this purpose) to be the time at which the kernel first started
  |  accumulating clock ticks.

That's correct, the issue is that the kernel doesn't really know what the
time is, early in the boot sequence, it just takes a guess based either
upon the RTC if the system has one (those tend not to be very accurate),
or the last mod time of the root filesystem (much less accurate) otherwise.

As Crystal said, as the system time is corrected, the kernel can form a
better idea of what the time actually was when the system booted, based upon
the corrections that are being made to the current time of day.

kern.boottime always contains the time that the system believes it was
booted, as best it knows what that was.   The man page section you qouted
above is correct, and doesn't need updating.

kre

Re: CVS commit: src

2023-09-26 Thread Robert Elz

Date:Sun, 17 Sep 2023 20:07:39 +
From:"Greg Oster" 
Message-ID:  <20230917200739.b9dadf...@cvs.netbsd.org>

  | Implement hot removal of spares and components.  From manu@.
  |
  | Implement a long desired feature of automatically incorporating
  | a used spare into the array after a reconstruct.

I haven't looked at the changes, but is/was there anything in there
(or one of the other recent changes) which allows for spares to remain
as spares in autoconfigured raid sets after a reboot ?   That is, to be
recorded in the filesystem (mostly empty for a spare I assume) that it
is one, and be detected at boot time.

Also, pie in the sky request - any ability for a spare to be able to
function as a spare for multiple different raid sets (ones with components
of at least a similar size I'd presume, but as long as the spare is big
enough, it shouldn't matter if some raid sets use smaller components) ?

kre

Re: GPT attributes in dkwedge [PATCH]

2023-09-25 Thread Robert Elz

Date:Mon, 25 Sep 2023 09:45:25 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | There can be multiple EFI system partitions on a drive,

That was my understanding from reading the spec.

  | but it sometimes confuses software,

What can I say to that...

  | some boot procedures will only handle the first ESP.

See my reply to RVP's reply just below.

  | But you "should" be able to select an ESP in UEFI just like you would
  | select a boot device in BIOS.

There's no problem picking the correct ESP from which to load efibootx64.efi
(or whatever the magic name is) - that part works fine, and if I have multiple
ESPs with NetBSD's efiboot in them I can pick whichever one I want (I think,
which probably gives me another option for testing updated efiboot code).

It is just boot.cfg which is giving problems.

r...@sdf.org said:
  | The first EFI partition. Without a bootme etc., flag, the NetBSD UEFI
  | bootloader tries to read /EFI/NetBSD/boot.cfg from there, first. 

I wondered about that when I saw Edgar's first message earlier, it
suddenly occurred to me that I normally never go near that ESP (that one
was set up by the system builders to boot wintrash - a subject which has
no interest to me, but I have retained intact (just reduced the size of the
wintrash partition) so it can be used for repairs, which I suspect is what
just happened.

So, I went and checked, but it turns out that I was (apparently) very
thorough when I was scattering boot.cfg files around, I discovered that
I have:

./EFI/Boot/boot.cfg
./EFI/NetBSD/boot.cfg
./EFI/boot.cfg
./boot.cfg

(that is relative to /mnt where I briefly temporarily mounted that parition).

None of those are used...  (there's no NetBSD boot code there, EFI/Boot
is a wintrash boot directory, but the boot.cfg file doesn't seem to bother 
that).

  | I've hacked my bootloader locally to make it try a /EFI/NetBSD/boot.cfg on
  | the partition where it itself was loaded from

I'd suggest that's what it should always attempt first, absent other info,
rather than being a local hack for you.

kre

Re: GPT attributes in dkwedge [PATCH]

2023-09-25 Thread Robert Elz

Date:Mon, 25 Sep 2023 05:57:49 +
From:Emmanuel Dreyfus 
Message-ID:  

  | bootme.cfg is searched in EFI paririon /EFI/NetBSD/boot.cfg

Which EFI partition?   I think I have about 5 or 6, sprinkled around
various bootable devices (more than one on some).   None of them (boot.cfg
files) are used.   But in most of them I have (with a different banner)
boot.cfg files in their root directory, EFI directory, EFI/NetBSD
directory - and I think in a "boot" directory when one of those happens
to exist - maybe created by either wintrash (builders used) or linsux
(which I installed briefly in the very early days to test the system more
thoroughly - all removed now, except some vestigal boot nonsense that
sometimes gets in the way until I tell the firmware (again) to not use it,
as it has nothing to boot any more).

  | and root partition /boot.cfg. 

That's impossible - boot.cfg says where the root partition is to be
found, so boot.cfg has to be found first.   (But see just below for an
interpretation; and it is of much less consequence to me what is chosen
as root if boot.cfg doesn't say, one way or another, which filesystem that
should be, fallback defaults are of far less interest to me.)

I have several candidate roots, for NetBSD 10, current, whatever I
am testing this week, an experimental one (perhaps not working), ...
There's no sane way to load /boot.cfg from "it" (though I have one in at
least the root partition I use most of the time, which is my testing one,
which unsurprisingly, isn't used either).

However, if what you meant by:

  | Bootme tels bootstrap where to look root partition

is that it tells the bootstrap code where to locate boot.cfg,  whether
or not that also later happens to become the NetBSD root, via specification,
or as a default if there is no spec, then I think we are more or less in
agreement.

Of course I'd like it more if it worked, and I'll admit that updating
boot code is the thing I like to do least, as it is the most likely thing
I can do which will wedge the system and need real assistance (as in booting
from optical or USB, or something) to recover - so the efiboot that I am
currently using must be 16 months old now (late May 2022) - I have been
meaning to work out a safe way to update and try the current version - and
I actually have a fairly easy way now, when my system was repaired, I had a
new SSD added, a clone of the one which holds all of my potential roots,
and which has 2 EFI partitions, one of which boots NetBSD - ie: has the
NetBSD efiboot that actually gets used - but that one is connected to an
add in SATA controller, from which the firmware can't boot (it probably
could given an add on EFI driver to use, but that's above my pay grade).
All I need to do is swap the SATA cables to those two SSDs (after populating
the new one, of course) and I'll be able to boot from the new one - if that
fails, I can just swap back to get to the original one.

I know a lot of work has been done on efiboot since the one I am using, so,
before resorting to attempt to work out why it is unable to find any boot.cfg
for me (I don't really care which it finds, I can work with any of them) I
want to make sure the current version is still failing for me, and if so,
work out why (if not already fixed, it is probably something trivial).

kre

Re: GPT attributes in dkwedge [PATCH]

2023-09-24 Thread Robert Elz

Date:Sun, 24 Sep 2023 17:41:30 +
From:Taylor R Campbell 
Message-ID:  <20230924174130.481dd60...@jupiter.mumble.net>

  | Why would bootme be usually set on the EFI system partition?
  |
  | The documentation in gpt(8) needs to be clarified -- and I'm not sure
  | there's any other canonical reference about it in any of our
  | documentation -- but it sounds to me like it is supposed to be:
  |
  | (a) where NetBSD's efiboot finds the kernel, and/or
  | (b) what root partition the kernel will use.
  |
  | It's not clear what else the bootme flag would be for,

I'd always assumed it to be where efiboot should locate boot.cfg.
Where the kernel and root filesystems are located are in boot.cfg.

Note there isn't necessarily a "the" EFI system partition, there may
be more than one of them on the same drive, and I am by no means convinced
that our current efiboot (at least) is able to parse the variables
available to that tell it exactly where it was loaded from.

  | and it seems
  | that unless boot.cfg instructs the bootloader to pass a different root
  | device to the kernel with the `root' command, the same partition that
  | was used to find the kernel is a reasonable default choice of root
  | device.

Yes, but to me at least, that's way beyond anything bootme might be
used for.

  | The bootme flag certainly not used to tell the machine firmware where
  | to find efiboot -- it's is vendor-specific, a BSDism.

bootme is from FreeBSD orig I believe, yes, but from where the firmware
finds efiboot shouldn't be vendor specific, it should be fully specified
by EFI variables - we just don't bother to do that yet, and fall back on
the vendor code magically locating our efiboot code (which is why we're
forced to call it by that magic blessed name.)

  | It's not obviously where efiboot finds boot.cfg, since that's in
  | esp:/EFI/NetBSD/boot.cfg or,

And we correctly interpret that, always?   For me at least, nowhere I
put boot.cfg seems to work, all I ever get is the efiboot compiled in
default (the one that has the ascii-art flag, which your typical boot.cfg
file doesn't have - certainly none of mine do, and I have boot.cfg files
sprinkled all over the filesystem, each with its own unique banner, so if
one if ever used, I'll know which).

  | if not there, whatever parsebootconf
  | resolves unqualified `boot.cfg' into -- which may be the
  | bootme-flagged partition but it's not clear to me in a cursory search
  | that it has even looked for a bootme-flagged partition at the point it
  | needs to resolve boot.cfg.

That's entirely possible - in which case it probably needs to be fixed.

  | Whatever the purpose is, we need to have it documented clearly;

Agreed.

  | right now I can't share kre's certainty about what it is _not_ to be used
  | for.

That comes from there being other mechanisms to specify where root is,
and where the kernel is to come from - we don't need alternate far less
powerful mechanisms for that, which would only serve to confuse things.
Even if there's no boot.cfg file, efiboot has its built in defaults, which
are more useful than a single bit somewhere.

Then there's also the question of non EFI boots of a GPT partitioned drive.

kre

Re: GPT attributes in dkwedge [PATCH]

2023-09-19 Thread Robert Elz

Date:Mon, 18 Sep 2023 19:21:09 +0200
From:Martin Husemann 
Message-ID:  <20230918172109.ga4...@mail.duskware.de>

  | A fallback similar to the current implementation picking the first non-swap
  | partition would be usefull.

The first netbsd style partition (not just non-swap) - no point trying
to make a NTFS or something into a root.

For GPT we should probably create a ROOT (with a longer name) flag, and
look first for the first partition with that set, before falling back to
the first.   But BOOTME is not the right flag to use, leave it for where
the boot info is found.

kre

Re: GPT attributes in dkwedge [PATCH]

2023-09-17 Thread Robert Elz

Date:Sat, 16 Sep 2023 05:01:00 +
From:Emmanuel Dreyfus 
Message-ID:  

  | Initial proposal was to aad access to the bootme flag in dkwedge, 
  | which has been considered bad design, and I agreed with that. 

Yes, but that's not what was really wrong with the proposal.

  | The patch moves the GPT parser out of dkwedge so that it can
  | be used by other kernel componentx.

That part of it is fine.

  | dkwedge use it to do the
  | job it was doing before, and raiddframe use it to solve a 
  | raidframe-specific issue.
  |
  | Hence this is already "handled some other way" as once proposed.

No it isn't, I have looked at the latest patch now, and it contains:

Index: sys/dev/raidframe/rf_netbsdkintf.c
===
RCS file: /cvsroot/src/sys/dev/raidframe/rf_netbsdkintf.c,v
retrieving revision 1.412
diff -U4 -r1.412 rf_netbsdkintf.c
--- sys/dev/raidframe/rf_netbsdkintf.c  15 Jun 2023 09:15:54 -  1.412
+++ sys/dev/raidframe/rf_netbsdkintf.c  14 Sep 2023 06:31:23 -
[... some deleted]
+static int
+rf_gptroot_cb(struct gpt_ent *ent, int partnum, void *data)
+{
+   struct rf_gptroot_ctx *ctx = data;
+   static const struct uuid ent_type_ffs = GPT_ENT_TYPE_NETBSD_FFS;
+   struct uuid ptype_guid;
+
+   if (le64toh(ent->ent_attr) & GPT_ENT_ATTR_BOOTME) {
[... more deleted]

That is what you MUST NOT do, BOOTME has nothing whatever to do
with what is root.   That's the part that must be done some other way.

(The bit where the flag was copied into the wedge info was just a
layer violation, and easy to avoid, as your patch showed, that was
never the real issue.)

Fortunately, it seems (as demonstrated by later discussion) that the
"other way" already exists, and none of this is needed at all.

kre

Re: GPT attributes in dkwedge [PATCH]

2023-09-15 Thread Robert Elz

Date:Fri, 15 Sep 2023 22:46:47 +
From:Emmanuel Dreyfus 
Message-ID:  

  | You noted that latest patch does not introduce bootme stuff
  | into dkwaedge code, right?

I didn't attempt to read the patch, no.   Just the regular text
parts of the mail thread.

kre

Re: GPT attributes in dkwedgeq

2023-09-15 Thread Robert Elz

Date:Tue, 12 Sep 2023 07:21:10 +
From:Emmanuel Dreyfus 
Message-ID:  

  | Context: if a RAIDframe set contains a GPT, it does not honour the
  | bootme atrtribute when loking for the root partition. The current 
  | behavior hardcodes the use of the first partition.

Does it, really?I have been using a setup (since NetBSD 6 days)
with root in a GPT in a raid array, and for me, if the raid array is
raidN (for me, it happens to be raid7 for historical reasons, most of
raid0..6 no longer exist) then the raidframe code looked for a partition
labelled raidNa (in my case raid7a) to be the root.

Those names (the NAME= things in userland) are attached to wedges, and
can be located easily, and match what happens when a disklabel is
configured to generate wedges (though in that case it would be assuming
that the 'a' partition from the disklabel, inside the raid, is the one
you'd want to be root - but that matches what regular disklabel booting
assumes I think, it has been a long time since I booted that way).

If the raidNa hack has vanished sometime in the past decade or so, then
perhaps we should just restore it?If it hasn't, just document it
(which I think is one thing which was never done, I discovered it by
code reading).

If there is no GPT partition labelled raidNa in raidN (the raidframe to
hold the root) then some fallback is needed, and picking the first partition
is as good as anything else - you'd need a fallback if you ended up using
the bootme flag (please don't), in case none of the partitions has that set,
and again, the first partition seems reasonable.

kre

ps: it is possible to invent new GPT flags (system dependant ones - which
would only be interpreted when the GUID of the partition is generated by
a system which understands and defines that flag for the purpose - simply
ignoring the unknown flag bit on a partition from any unknown creator).

Re: GPT attributes in dkwedge [PATCH]

2023-09-15 Thread Robert Elz

Date:Fri, 15 Sep 2023 15:15:10 +
From:Emmanuel Dreyfus 
Message-ID:  

  | Ths user took care of setting bootme so that botstrap finds 
  | the kernel, and we should disregard this explicit setting
  | when mounting root? 

I agree with others, where you boot from has absolutely nothing to
do with what is the root partition (except by coincidence).   Nothing
except the boot code should ever look at bootme flags (nor bootonce,
which we don't support yet I think, or whatever the other related one
is that we also don't support).   All of that is for finding and loading
/boot (under whatever name the boot code in question finds it).   The
bootme flag on my systems is on the EFI partition, never anywhere else.

What happens after that needs to be handled some other way.

kre

Re: [PATCH] style(5): No struct typedefs

2023-07-11 Thread Robert Elz

I agree with some of what you are proposing, but disagree with much
of it.

Certainly simplifying our header file mess is important, but that's
not going to happen overnight.   One particular example of that below.

And using opaque etruct definitions where possible, rather than
void * in particular is certainly the right thing to do - I suspect
that the reason that we don't have nore of that now, is just
based upon the age of the original NetBSD code, compared to when
opaque structs became available - and then the tendency to copy
existing code styles, rather than "breaking tradition".

But I 100% disagree with the notion of declaring those opaque
struct types over and over again in each individual source file.
Declarations should appear exactly once - either in a source flle
if the object concerned is used only in that file.   Otherwise in
a header file.   By all means in a header file which doesn't do
much else - we used to avoid such things, as the cost of opening
and reading (over ane over again) lots of tiny header files was
getting to be a much too large fraction of the overall compilation
time, but compilers have gotten a lot smarter at doing that, and
a lot slower at doing other parts of the compilation, so this
shouldn't really be an issue any more.

I'm not even sure it is practically possible (in many cases anyway)
to do it as you're suggesting, as to use an opaque struct that way,
you're almost certainly using it as the parameter of one of more
public functions in your source file, in which case those need to
be declared in a header file, and so that neader file also needs
the opaque struct definition.   Then all that is needed (which is
needed anyway) is to include that header file.

Ideally that header does little else.

I'm tempted to say that an opaque struct declaration in a .c
file ought be treated suspiciously - I thought there might be
one use, where a file is providing a public interface using
an opaque struct pointer, and then lower down the file,
the implemenattion of those public functions using static
functions (so no access is possible except via the public
functions) with the complete struct definition occring between
the two halves of the code (and it would be kind of nice if C
had a way to say "all functions defined beyond this point are
static").  But even that does not need, or want, an opaque
declaration for a struct in the .c file, as that needs to be
in the header file which declares the public functions anyway,
and that needs to be included in the ..c file for type checking,
even if not otherwise needed for anyhing useful.

So I'd be tempted to have the style guide exokicitly say not
to use opaque struct declarations in .c files - with the caveat
that, as with all there, it is a guide, not a law, and when
appropriate, it can be ignored.

I also disagree on typedefs to structs, and while I don't particularly
like them much myself, even typedfs to pointers to structs.

Once you disabuse yourself of the idea that you can avoid using header
files by redeclaring opaque structs in .c files all over the place,
your argument against typedefs essentially avaporates - as it was,
as I understood it, largely that using a typedef requires using a
header file (true) - but since we are going to want that anyway,
might just as well have it say

struct foobar;
typedef struct fobar foo;
or
typedef struct foobar *foo_ptr;

(or both as appropriate).

One obvious reason for using typedefs in this way is when we
have a common object interface which is implemented entirely
differently on different architectures.

Eg: (and deliberately using an absurd example, to avoid people
trying to correct my misunderstandings of any real examples)
we might have a an an implementation defined type "dogleash".
On some architectures all the proberties of a dogleash, except
its length (which is expressed in mm, and no greater than 10m,
is 1 mm) so a dogleasy type is an int (all that is requied).
Another architexture is much more flexible, and requires the
materials (leather, chain, woven plastic fibre, ...) the colour,
the length, and the handle and attachment types to all be
specified - and so clearly a dogleash os going to be a struct there.

Then by your proposed quidelines, since typedefs for ints are
permitted, but typedefs for structs are not, we'd end up with
one of the following two abominations all over the place (in the
MI code)

#ifdef SIMPLE_LEASH
void windup(dogleash);
#else
void windup(struct dogleash);
#endif

or

#ifdef SIMPLE_LEASH
#define leashtype
#else
#define leashtype struct
#endif

void windup(leashtype dogleash);


and while the second of those is more pleasant to look at,
in isolation, if actually done that way it would quickly
become a nightmare to maintain - particularly if the most
commonly used implementation architectures (the ones people
mostly code and develop using) are the SIMPLE_LEASH type,
where leaving out the  "leashtype" word changes nothing
(a

Sanitizing (canonicalising) the block device name in mount_ffs ??

2023-05-27 Thread Robert Elz

I'm dual-posting this to tech-kern and tech-userlevel, as while it is
a userlevel issue, it could have kernel implications.   Please respect
the Reply-To and send replies only to tech-userlevel

You may have noticed that a recent change (mine) to the pathadj()
function (which converts an abritrary path name to its canonical form).
That function is not permitted to fail, but could.   Now instead of
failing, and returning (potential) nonsense, it exits if it cannot
do what it is required to do (usually it can).  In practice this
affects nothing real.

However, it affects some uses of rump - which sets up a "block device"
in a way that its name cannot be canonicalised.   It was relying upon
the way that pathadj() happens to work (based upon how realpath(3) works)
to make things function - pathadj() was issuing an error message, which
some rump using ATF tests were simply ignoring (deliberately).

Yesterday, I was trying to find a way to make this all work - unsuccessfully.

Today I am wondering why we need to bother?That is, not why we bother
with rump, not even why rump has to make its magic etfs work the way it
does.   But why we need to canonicalise the block device name for mount.

I have run a test with that simply removed from mount_ffs (some of the
other mount_xxx's might eventually want to follow, if we change this)
and the ATF tests that are currently failing work again.

It is rare for any use of mount to be given a path for the block device
being mounted to be changed by canonicalising it, so simply omitting that
would not often make any difference at all.

Currently I am seeing 4 reasonable choices...

1) just omit the pathadj() of the block device name, and just use whatever
the user says to use, unchanged.   I doubt anything would really be affected
by this, but it does make a difference if some user were to use
/./dev/../dev/wd0e
or  wd0e

where the latter is either a symlink to /dev/wd0e, or a copy of /dev/wd0e
or $PWD==/dev and it is simply a relative path.

Is anyone aware of anything which would break if we allowed such names to
be used - the dev_t that gets used for mounting is not going to change, not
even the vnode which is used - just the path name used to get to there ?

2) we could prohibit relative paths, or paths containing '.' or '..'
components - simply check the path and refuse the mount.

3) we could apply pathadj() (as it currently is) to the paths which choice 2
would prohibit (which won't affect the rump using ATF tests, which don't do
that).

4) we could change the pathadj() of the block device name to instead simply
call realpath(3), use the result if that succeeds (which is what happens now,
and in the past), but simply use the user's arg if it fails (which is what
will happen in the ATF test cases - using the original path is what is
needed there.)   Possibly issue a warning the way that pathadj() does.

I'd appreciate opinions, particularly if anyone knows of any reason that
any of these would be inappropriate as a solution.

kre

Re: Per-descriptor state

2023-04-30 Thread Robert Elz

Date:Sun, 30 Apr 2023 05:25:41 +
From:David Holland 
Message-ID:  

  | Close-on-fork is apparently either coming or already here, not sure
  | which, but it's also per-descriptor.

We don't have it, but it will be in Posix-8.   Largely inspired by the
needs of threaded programs (without lots of critical sections, one cannot
otherwise open anything if another thread might fork, there's no
way to avoid race conditions, hence O_CLOFORK on open ... not sure if
anyone has thought of a way to add it to socket() - that doesn't look
to be trivial, though it might be possible to abuse one of the params
it has - probably domain - and add flags in upper bits ... while having
it able to be set/reset via fcntl is useful, to work, it needs to be
able to be set atomically with the operation that creates the fd, and
having it default "on", which would work, would break almost all existing
non-trivial code).

  | But I kind of think it'd be preferable to make a way to
  | clone a second independent struct file for the same socket than to
  | start mucking with per-descriptor state.

When I saw mouse's message, I was thinking the exact same thing,
and it should be easy to extend dup3() to make that possible - however
I'm not sure what effects that might have on the semantics of sockets
(what assumptions the current code might make about there being only
one struct file, with all it contains, for a socket).

kre

Re: flock(2): locking against itself?

2023-03-19 Thread Robert Elz

Date:Sun, 19 Mar 2023 07:05:52 +
From:David Holland 
Message-ID:  

  | "They're per-open"

That's not bad for this level of description.

  | ...which is not actually difficult to understand since it's the same
  | as the seek pointer behavior; that is, seek pointers are per-open too.

and almost all the other transient attributes, that is distinct
from stable attributes like owner, group, permissions, which are
inode attributes.  In our current system I think just close on exec
is a per fd (transient) attribute though if we follow linux (I think)
and soon to be POSIX, and add close on fork, that would be another.

But this doesn't help with initial understanding, there is nothing
fundamental that requires things to be this way, both seek pointers,
and locks, could be per fd, or per inode (transitory, or long term).
They just aren't.   (There is no need to explain why, I know why,
but newcomers, not that mouse is one, don't necessarily).

Then to add to all of that we have fcntl locks, with their process
lifetime nonsense integrated, making a mess of the model, they're
basically per-open, but with that (kind of) per fd complication added
on top.

kre

Re: flock(2): locking against itself?

2023-03-19 Thread Robert Elz

Date:Sat, 18 Mar 2023 19:46:17 -0400 (EDT)
From:Mouse 
Message-ID:  <202303182346.taa01...@stone.rodents-montreal.org>

  | Except they aren't.  They're on open file table entries, something
  | remarkably difficult to describe in a way that doesn't just refer to
  | the kernel-internal mechanism behind it

Yes.  The terminology in this area really sucks, but that's why
I mentioned 'kernel file*' in my message.   POSIX distinguishes
file descriptors and file descriptions, but you have to be reading
very carefully to even notice the difference - ok for a standards
doc perhaps, not for a man page or e-mail message.

Given the lack of well understood terminology, it is not easy to
do better.  That, I assume, is what led to that paragraph in the
NOTES section - an attempt to explain better just where the locks
fit, without getting into kernel internals (the access model the
kernel provides, that is, fd, file*, vnode, data, really needs to
be understood in order to do any non-trivial file related
operations).

  | If they were truly on files, rather than open file
  | table entries, then it wouldn't matter whether my test program opened
  | the file once or twice, since it's the same file either way.

There you're thinking of vnode (or since it is a file, perhaps
more accurately, inode) operations.  The only operations possible
on the actual file are read/write/truncate.

The terminology sucks.   It has done for 50 years now, and in
that time nothing better has ever caught on, so hoping for
something tomorrow is probably forlorn.

  | Hm, okay, I can see how the second flock call in my test was taken as
  | an attempt to equalgrade

aka no-op.  Yes.

kre

Re: flock(2): locking against itself?

2023-03-18 Thread Robert Elz

Date:Sat, 18 Mar 2023 11:32:37 -0400 (EDT)
From:Mouse 
Message-ID:  <202303181532.laa29...@stone.rodents-montreal.org>

  | On examination, the manpages available to me (including the one at
  | http://man.netbsd.org/flock.2) turn out to say nothing to clarify this.

The man page (including the one on the web that you referenced) starts out:

 flock() applies or removes an advisory lock on the file associated with
 the file descriptor fd.

and then lower down, under NOTES:

 Locks are on files, not file descriptors.  That is, file descriptors
 duplicated through dup(2) or fork(2) do not result in multiple instances
 of a lock, but rather multiple references to a single lock.  If a process
 holding a lock on a file forks and the child explicitly unlocks the file,
 the parent will lose its lock.

Applying flock() to an already locked (of this kernel file*) file
is an attempt to upgrade, or downgrade (including unlock) the file,
and would only block on an upgrade attempt if some other file*
referencing the same file (ie: the product of a distinct open() etc)
is holding a lock that blocks the upgrade.

  | Is this expected behaviour, or is it a bug?

Expected

Always been this way.

kre

Re: proposed cpuctl modification

2023-03-09 Thread Robert Elz

Date:Thu, 9 Mar 2023 16:21:53 +0900
From:Masanobu SAITOH 
Message-ID:  <38ae66bd-1b37-c0ef-5a43-52e0c0a2a...@execsw.org>

  | Alder Lake-N? 4 E-cores share one microcode image. I have i7-12700 and it
  | has 4 E-cores. Those 4 cores share one microcode image.

Mine is an i9-12900KS which has 8 of them (2 groups of 4).

Thanks for the confirmation, that is what looked to be happening, but
I was just guessing from what I observed.  I just use intel processors
(and others on occasion) I don't even pretend to understand them.

  | I think your idea is the best. Thank you for your commit.

No problem.  It was not a difficult change to make!

  | Another solutions is that the kernel returns 0 instead of EEXIST if the
  | version number is the same as the running microcode's version.

Yes, I considered that one as well, but as you indicate, doing that just
loses information, and gains nothing - the same number of sys calls (ioctl's)
would be performed, all that would be saved is the check to see if the
error is EEXIST when that happens (ie: peanuts).

kre

ps: do your E-cores ever just turn themselves off?   On mine, occasionally,
and for no reason I can fathom, the BIOS reports there are none of them.
(and NetBSD doesn't see them either). They come back after a power cycle.
This is probably a BIOS issue, but ?

Re: Late MCU, was: proposed cpuctl modification

2023-03-04 Thread Robert Elz

Date:Fri, 3 Mar 2023 21:46:22 -0800
From:"William 'Cryo' Coldwell" 
Message-ID:  <7ce92f54-3746-4106-bd63-16e5e4cbc...@netbsd.org>

  | To throw some extra fun mixture into this discussion:  As of 5.19
  | Linux will no longer allow late microcode loading:
  |
  | ref: https://www.phoronix.com/news/Linux-5.19-Late-ucode-Loading

I suspect this message was not really aimed at me, but at tech-kern
my only specific knowledge about any of this is that I dislike spurious
(seeming) error messages during the boot sequence - and so intend to
make them go away (as there has been no objection, I will commit that
change soon).

But:
  | Any thoughts on the best approach to this? (boot? EFI?)

Best?  No idea.   But one approach might be to only start cpu0 in the
kernel during bootstrap, and then have the rest started by an rc.d
script, which could update microcode on them (if needed) first.

By moving cpuctl from /usr/sbin to /sbin and placing the firmware in
/libdata instead of /usr/pkg that could be made to happen very early
in the boot sequence (perhaps even before the fsck of / and rw mount).

I'm not sure how that would work wrt other things that have to happen
(like arranging interrupt routing) - as clearly the microcode needs to
be read from the filesystem (whether that's by the kernel, as now, or
by cpuctl, then passing a memory image of it to the kernel doesn't seem
as if it would make a lot of difference).

Longer term this could be coupled with more userland control of the
cpu configuration - all being done at once as part of the startup
sequence, but from userland code.

Whatever the issues really are (according to the Linux people) with
doing the microcode update as we now do it, even assuming that is more
or less the same as they do it, this should be safe, as code running
on a CPU has to do it, I don't see it can make any real difference
whether than is bios code, boot code, or early kernel code.

I don't much like the idea of extra magic blobs, or more hackery in
the boot code however.

kre

Re: proposed cpuctl modification

2023-03-03 Thread Robert Elz

Date:Fri, 03 Mar 2023 14:04:39 +1100
From:matthew green 
Message-ID:  <12620.1677812...@splode.eterna.com.au>

  | duh.  this is user error.

Oh.   Double duh...   "Me too".

kre

Re: proposed cpuctl modification

2023-03-02 Thread Robert Elz

And a correction, I missed uses of -v in arch/* where it seems to apply
only to output from "cpuctl identify", and mostly on aarch64 processors
(seems to be very little change on x86, and no changes at all elsewhere,
arm (32) sparc sparc64, and definitely nothing on anything else).

kre

Re: proposed cpuctl modification

2023-03-02 Thread Robert Elz

Date:Fri, 03 Mar 2023 09:25:29 +1100
From:matthew green 
Message-ID:  <14071.1677795...@splode.eterna.com.au>

  | we should do this as well, it should fairly simple.  we already
  | display the relevant info in "cpuctl identify 0" eg:

Yes, identify shows all of the relevant info (including ucode version)
so all of this can certainly be done - it was just more than I wanted
to do to just suppress the annoying error messages...

  | hmm, someone seems to have broken this recently?

Now you point it out, I am seeing that too.

  | that's odd, but we should handle it saner.

My processor is odd - but it is very fast (in some of the cores anyway).

  | this seems odd to me.  verbose should print additional info
  | but it shouldn't change the actual behaviour here.

When -v is given, the code is intended to operate exactly as it did
before (it will execute a few more instructions, but the effect should
be identical).

  | eg, if i were to use this with -v on your CPU, i'd want to
  | have it show it working on cpu0 and cpu4, and EEXIST for
  | the rest, and this appears that it will exit after cpu1.

That is what it has always done.   It exits after cpu0 as well.
And every other cpu - it only gets asked to update one cpu at a time.
In the Intel case, cpuctl is run on each cpu separately, not to attempt
to update all of them in one run (I'm not even sure that is possible,
or not with the code as it is now).

It has never printed anything (with or without -v) for a
successful microcode update - before my change, the -v flag
appears to do nothing at all, so I wasn't too worried about
borrowing it.   The kernel logs successful ucode updates however.

All the change does, is ignore the EEXIST (and exit 0, not that the
exit status seems to matter to anything, at least as used in rc.d)
when -v is not given (which rc.d does not do), rather than printing
an error and doing exit(1).

I don't have an AMD processor, where things are done differently,
to test this change on though.

kre

proposed cpuctl modification

2023-03-02 Thread Robert Elz

This message is about a proposed userland modification, but it seems
more kernelish to me, hence I am asking here on tech-kern, rather than
on tech-userlevel

When my system boots (intel cpu) it runs the intel-microcode (from pkgsrc)
microcode update.

Since it is an Intel cpu, that means running the ioctl to perform the
microcode update on every core.   The cpu has hyperthreading, and while
I sometimes am inclined to turn that off, I haven't so far.

Naturally, as they're really the same core, while the base core ucode
update works fine, the hyperthread companion always fails, as there the
microcode is already up to the expected version (surprise surprise, since
we just did that).I guess we could look and skip the update on
the hyperthread companion cores (pseudo-cores) but that's not what I am
proposing, partly because I'd have to work out how to do that, but also
because that by itself would only solve half the problem.

My processor also has 8 cores, with no hyperthreading, where it looks as
if internally, there's just 2 sets of ucode, one for the first 4, and one
for the second 4.   Updates of the other 3 of each group find the ucode
version already installed, and error out.

So, what I am proposing is to have cpuctl simply ignore EEXIST errors from
the update the microcode" ioctl. unless the -v option (which already exists)
is given.

Note that this isn't fixing any real problem - everything works as it is now,
and everything (from what I can tell) gets updated correctly.   The issue is
more cosmetic - the ucode update currently issues error messages for more than
half of my (apparent) cores, which looks ugly during the boot process, and
could lead to people wondering if there is a problem.

When I first installed the package from pgksrc to make this happen (a while
ago now) I immediately disabled it (in rc.conf) so it ran no more, as the
microcode version offered by the package was what my CPU came equipped with,
and every core complained about the version not actually being updated.

pkgsrc was updated to a newer version, with newer microcode, I installed
that, and enabled it again - and now I'm stuck with the ugly errors.

Perhaps someone has a better solution than this, perhaps checking which
microcode version is installed on each core, and not attemmpting to install
the update if it is the same version, but that's beyond my knowledge.
It also seems likely to me that it is simpler to just install anyway, and
ignore the "already there" error if it happens.   Probably.

My proposed patch is appended - it builds, installs, and seems to work
just fine (not too much of a surprise, it is a trivial change).

Opinions?

kre

Index: cpuctl.8
===
RCS file: /cvsroot/src/usr.sbin/cpuctl/cpuctl.8,v
retrieving revision 1.20
diff -u -r1.20 cpuctl.8
--- cpuctl.817 May 2019 23:51:35 -  1.20
+++ cpuctl.82 Mar 2023 21:36:06 -
@@ -72,6 +72,10 @@
 .Op Ar file
 .Xc
 This applies the microcode patch to CPUs.
+Unless
+.Fl v
+was given, errors indicating that the microcode
+already exists on the CPU in question are ignored.
 If
 .Ar cpu
 is not specified or \-1, all CPUs are updated.
Index: cpuctl.c
===
RCS file: /cvsroot/src/usr.sbin/cpuctl/cpuctl.c,v
retrieving revision 1.32
diff -u -r1.32 cpuctl.c
--- cpuctl.c1 Feb 2022 10:45:02 -   1.32
+++ cpuctl.c2 Mar 2023 21:36:06 -
@@ -247,7 +247,7 @@
cpuset_destroy(cpuset);
}
error = ioctl(fd, IOC_CPU_UCODE_APPLY, &uc);
-   if (error < 0) {
+   if (error < 0 && (verbose || errno != EEXIST)) {
if (uc.fwname[0])
err(EXIT_FAILURE, "%s", uc.fwname);
else

Re: Potential Improvements on Lua Support?

2023-03-01 Thread Robert Elz

Date:Wed, 1 Mar 2023 12:44:08 -0600
From:Qingyao Sun 
Message-ID:  <53774732-c592-43f5-af0f-8a1f6bb03...@icloud.com>

  | Also I am using the @icloud address hereafter as per kreâs preference.

It is not so much preference, as that you simply would never receive a
reply from me at a gmail address, so it is pointless me ever bothering
to generate one.   That would make any kind of mentoring kind of difficult.

  | It just came to me that we can do in-kernel machine learning

I suspect it is way too early to be making any decisions about implementation
techniques - first (assuming suitable mentors are available) we need
objectives, to know what the end result should be capable of.

  | That sounds interesting as well. Many ARM chips seem to have that
  | big.LITTLE technology, including Apple Silicon.

The x86 processor I am using to generate this mail is like that as well.
Ignoring hyperthreading (which half the cores can do), it has at least 3
different capabilities of cores (I'd need to go check more doc to see if
there might be more than that) - and the difference isn't trivial.

Hence my interest in that one in particular, but I also care about I/O
performance, which doesn't seem to be quite optimal at the minute.

  | Thanks for your help! Let's wait for some Lua wizard to jump in then.

Yes, though wizardry is probably not required, just knowledge.

kre

Re: Potential Improvements on Lua Support?

2023-03-01 Thread Robert Elz



I like this idea a lot more than the inetd/rc.d idea.   I am not sure that
"Improvements on Lua Support" is a good title, unless you're actually planning
on working on the kernel Lua implementation, and it doesn't sound like that.

I'd suggest something more like "Using Lua scripts to improve kernel heuristics"
or something like that.   That is, if that is what you are really proposing,
using your machine learning knowledge in areas where it might benefit the 
kernel.

I cannot help with Lua, I know it exists, but that's the extent of my knowledge
on that front.  I know something about filesystems, but not the details of our
current implementation.  I would however like to see this idea applied to CPU
scheduelling, particularly when there are different classes of CPUs available
(ie: they're not all identical to each other - while any can do any work, they
don't all do it equally well).I think we're currently lacking in that area.

For filesystems, my gut feeling is that a better task to take on might be
working out which cached data ought be ejected, and which retained, rather
than (or perhaps just before) working on prefetching algorithms (particularly
as "drives" get much faster, and delays to fetch blocks are reducing 
considerably).

So, I'm prepared to help with this, with two caveats.   First, you need to also
find someone knowledgable (enough) with the kernel Lua implementation and use
to assist with that (and Lua scripting) who is willing, and available, to assist
with this.  Some occasional filesystem/uvm and schedueller assistance might be
useful as well.  And second, you need to receive e-mail somewhere other than at
gmail - gmail refuses mail from me (mail I normally send, rather than mail this
way, from a NetBSD host, which is not at all convenient to send), and even 
forwarding
mail from me to them from some other mailbox is likely to fail.   I have zero 
interest
in bowing to gmail's stupid requirements to avoid that problem, so if I am to 
assist
you need to be able to receive mail somewhere else.   Probably at UT I'd guess.

kre

Re: NetBSD 10.0 BETA kernel testing: framebuffer

2023-01-22 Thread Robert Elz

Date:Sun, 22 Jan 2023 20:27:24 +0100
From:tlaro...@polynum.com
Message-ID:  

  | +Zone  kernel: Available graphics memory: 9007199254079374 KiB

I see something like that too, but while it is obviously absurd,
I'm not sure that it actually does any harm (maybe) - my system
mostly works -- though I am still using wsfb - the last time I
tried to start X with nouveau and no X server config at all
(a week or so ago) the kernel crashed very soon after.

In every case I have looked that big number has been (when converted
to bytes, which the actual value being printed is - the output simply
divides by 2^10 (ie: >>10) for our convenience, a value of the same
general form, in your case

   9007199254079374 KiB == 9223372036177278976 bytes == 0x7FFFD79E3800

To me that suggests that probably something has a 64 bit value set to
MAXINT, and then writes a 32 bit value on top of it (and then treats that
as a 64 bit value).   The top 32 bits being 0x7FFF seems always there.

It could also be doing a read of a 64 bit value from hardware, where
most (or all) of the upper 32 bits don't really exist, and simply float,
which isn't being masked - but it seems very unlikely an issue like that
would affect multiple different graphics board types (from different
manufacturers).

I took a quick look in the kernel, and while I could find where this
value exists, and is printed, attempting to track down what sets it
eluded me.   It looks to be via a function referenced by a structure,
but I couldn't find anything which looked like it might be calling it
(it may be hidden in a macro or something.)

Since the same thing happens with all different video drivers, it is
unlikely to be in those (though it could be a common, cut&paste buggy
code type issue).

kre

Good news from POSIX (sanity, finally, in one area)

2022-10-27 Thread Robert Elz

POSIX is finally removing the inane requirement that all of the
standard utilities built into the shell (except for the special
builtin utilities - those are ones like break : set ...) be also
implemented as file system commands.

We have never bothered with that requirement, and have consistently
refused to do so (even though handling it is, and always has been,
relatively trivial - at least on systems supporting #! executable
scripts) as the requirement was so completely useless.

For the built in commands this matters to (cd, umask, ...) this
requirement is now being deleted, the only command that's generally
required to be built in to work fully, which will require a file
system equivalent, is kill(1) which isn't a problem at all (having
that one available to exec can actually be useful, and it has also
always been available that way).

See:
https://austingroupbugs.net/view.php?id=1600

This isn't yet in any draft (not even an unavailable one), but will be.

kre

ps: of course, removing the requirement doesn't preclude systems from
providing any (or all) of the relevant commands as file system (execable)
commands, if they want to.

Re: #pragma once

2022-10-24 Thread Robert Elz

Date:Sun, 23 Oct 2022 09:50:20 +
From:Taylor R Campbell 
Message-ID:  <20221023095027.eb8ef60...@jupiter.mumble.net>

  | I wasn't able to find a clear statement of the semantics anywhere:
  | Is it keyed by (dev,ino), by pathname, by some kind of normalized
  | pathname, by file content?  gcc's own documentation is very sparse and
  | just describes it as nonportable.

I had not weighed in on this before, but that issue was the biggest
problem I could see with this.

That and that using guards allows having mutually exclusive (but
different) include files ... though of course they could always be
used for any such cases even if the #pragma scheme became the normal
way.

  | So on balance,  #pragma once  doesn't seem worthwhile to me today, and
  | I think we should stick to traditional include guards for now.

I agree.

kre

Re: Can't mount root partition after rebuilding kernel with DKWEDGE_METHOD_MBR

2022-09-28 Thread Robert Elz

This isn't really a tech-kern question, but never mind.

You shouldn't need wedges at all, and I'd advise against trying
to force them to work (it is probably possible, but far more work
than you need to do).

Go back to your original kernel, and when that's running, do

disklabel wd0

On NetBSD systems, when GPT isn't used, there is *always* an in-kernel
disklabel.   When there's a disklabel on the disc, the in kernel label
is just a cached copy.   When there's an MBR but no disklabel, the MBR
is converted into disklabel format for the kernel to use (MBRs contain
a subset of what a disklabel contains, including everything important).
When there's no label of any kind, a suitable one will be synthesised.

The output from that should show one of the partitions as an ext2 type
partition.

Then (assuming that happens to be partition 'e') just do

mount -t ext2fs /dev/wd0e /mnt

(adopted as needed to your requirements, add -r for a readonly
mount, and of course, /mnt should be wherever you want it to appear).

Unfortunately even though "since ext2 is supported by NBSD" is supposed
to be true, I have come across (recent) ext2 filesystems that cannot be
handled.

kre

Re: Can version bump up to 9.99.100?

2022-09-25 Thread Robert Elz

Date:Fri, 23 Sep 2022 22:57:52 -0400
From:"David H. Gutteridge" 
Message-ID:  

  | Sometimes it's necessary to test for when a feature was added in a
  | -current release, and there's no simple or precise way to do it, as
  | you've noted. If a feature was added sometime in xx.yy.zz, then a test
  | might (retroactively) be expressed with zz+1 as the floor.

Yes, I know.   But if a feature is added then there ought also be some
better way added to test that that has happened, not the kernel version.
If that doesn't happen, people (that's us) should complain, and make it
happen.

For kernel changes, testing the kernel version has at least some kind of
rationale behind it, though the x.99.abi bump scheme doesn't generally
fit well, as the abi bumps almost never happen for new features - there
could be months (or longer) pass before the abi value is altered.

When the change isn't to the kernel, but relates to something changed in
userland, testing the kernel version is 100% useless (and often wrong).
(And yes, that includes anything that uname outputs, or any other kernel
supplied information that wasn't previously set by userland).

The really hard case is where a kernel bug (eg: an ioctl not working
correctly) is fixed, which has been worked around, but no longer needs
to be afterwards.   An example might be the O_NONBLOCK on pty master
devices that just got changed.  This kind of thing will generally not
require a kernel abi bump (and could also happen in x.n (like 9.2 over 9.1)
as well, and appear in x.n_STABLE along the way) and very often requires no
other changes other than to the code in the kernel source file(s) concerned,
so there really is nothing to test.

An idea might be to add some new sysctl var (for kernel changes only)
that gets bumped far more frequently (every time something new is added,
or some user noticeable bug gets fixed (ie: not whitespace, KNF, spelling
errors in comments, changes to printf output) and is perhaps reset
whenever a __NetBSD_Version__ bump occurs (and this would apply in both
stable and HEAD versions) so that there is a more precise way to test for
this kind of thing (including looking to see if running a version that had
a bug temporarily imported, by knowing (with hindsight) the value of this
var before the bug was added, and when it was later fixed).

Alternatively, perhaps only reset for new netbsd-N branches (so 10.0 would
start again at 0, as soon as branched, and HEAD which will become 10.99.0
will also revert to 0 at that time) and otherwise simply both climb (an
unbroken sequence through 10.1 10.2 ... 10.17 ...) until we branch -11
where 11.0 would start at 0, and 11.99.0 would as well (but 10.whatever
would just keep climbing, until EOL).

This ought to be a potentially BIG number (something we can never run out of
in any practical situation), but just a simple integer, so it probably would
not, ever go past 6 or 7 digits, but we don't want to ever worry about the
possibility of overflow so an int64_t or similar, a bump should just be a ++.

It also ought live in a source file of its own, depending only on the header
file which declares it for the sysctl routines that export it (in some other
file, where that is doesn't matter) so changing its value is cheap, and
no-one will be reluctant to do it.

Cheap particularly wrt the cost of builds after it is altered - unlike
__NetBSD_Version__ which, being in param.h, causes almost every file in the
kernel to need to be recompiled when it is altered, even though almost nothing
in the kernel cares in the slightest what value it has (just a few things here
and there).

This tiny (3 line - one comment, one #include, one var decl - plus copyright
noise) file could even be MD, so the version numbers for one port don't
affect others (a change to an x86 private function, need not show up as a
change that is visible on sparc systems) - or we could have 2 variables,
one MI for changes that affect everyone, and one MD, for the others, and
return both of them from one sysctl.   Or this might just be too much
complexity, a single number shared by everything would also work, and
is what I think I'd prefer.

Kernel commits would then note what this value has been bumped to for the
change (when the change is one that requires it - and the rule should be:
if in doubt, bump it, it is cheap).

How to test new features / fixes for userland, I'll leave to others to ponder.
(also, here, this is tech-kern).

kre

Re: 9.99.100 fallout: file(1)

2022-09-21 Thread Robert Elz

Date:Wed, 21 Sep 2022 19:33:47 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | -   if (ver_rel == 0 && ver_patch != 0) {
  | +   if (ver_maj >= 9) {

I'd suggest instead
if (ver_min == 99) {

While this issue never happened with earlier NetBSDs there's no
real reason to exclude them from the possibility that it might have.

On the other hand, there's never been a version since we introduced the
.99 concept (NetBSD 2 going on to 3?) where x.99 had anything other than
a single decimal suffix.   And we never had, and I don't think anyone
expects that we ever will have, a N.9x version of NetBSD where x != 9).
That is, ver_min never has been, and never will be, 99, other than to
indicate "on the way to ver_maj + 1".

The way you have it coded, I suspect that 9.1 binaries will appear to
be 9.1.0 instead (the ver_patch data is always appended for ver_maj >= 9).

However, I wonder why this kind of info is embedded in ELF files, what
point does that have?   Maybe it would be better to have them just say
x.99 (and forget the kernel ABI bump number) ?

kre

Re: Can version bump up to 9.99.100?

2022-09-17 Thread Robert Elz

Date:Fri, 16 Sep 2022 23:46:59 +
From:David Holland 
Message-ID:  

  | While it's possible that some of
  | these may exist, it's unlikely that there are many of them or that
  | they appear anywhere especially important.

That's all encouraging, and yet more reason to bump the version
soon (as required) so we have the 9.99.1xx series around long
enough for any issues to be found and fixed, before we're back
to 10.0 and 10.99.1 with a whole different set of issues to fix.

kre

Re: Can version bump up to 9.99.100?

2022-09-16 Thread Robert Elz

Date:Fri, 16 Sep 2022 12:59:24 -0400
From:"David H. Gutteridge" 
Message-ID:  

  | So there will be information loss there, at minimum. Whether that ends
  | up being significant at some point, I guess we can't say.

I would hope not.   That is, I am assuming (but don't know pkgsrc well
enough to be sure) that OPSYS_VERSION gets used for some kind of feature
test.   That's OK (not the ideal method - but sometimes it is the only
practical one) for major, or even minor version comparisons.  It isn't for
the 3rd field (xx) in N.99.xx for NetBSD.   That field is not changed
for feature additions, so some N.99.xx may have a particular feature,
and others not, but is changed for internal ABI alterations (which don't
necessarily affect what is visible by applications in any way at all).

Note also that this value is never changed (in the NetBSD N.99.xx case)
because of changes that occur to anything outside the kernel - so it can
never safely be used to test what version of some application or library
function might be installed.   Never.

If pkgsrc (or pkgsrc packages) are using this sensibly, then limiting
OPSYS_VERSION at 09 for all future __NetBSD_Version__ values 9.99.x
where x >= 100 should be safe, as nothing should ever care about those
final 2 digits.

That's "if".

kre

ps: the issue I was concerned about more would occur when the kernel
version info gets embedded in a package version, and other similar things.

Re: Can version bump up to 9.99.100?

2022-09-16 Thread Robert Elz

Date:Thu, 15 Sep 2022 23:46:45 -0700 (PDT)
From:Paul Goyette 
Message-ID:  

  | The human-oriented version is used as part of the path to modules
  | directory.  Need to make sure that the modules set is properly
  | populated,

That much I had tested.

  | and that module loads find them in the directory.

but that I did not - but Kengo NAKAHARA now has,  so that's all good,
I really couldn't see how there would be a problem here (but testing
it was good) as it is all just strings (the only comparisons are of the
binary blobs).   That is, except for in pkgsrc, which is why I still
have a (very mild) concern about that one - it actually compares the
version numbers using its (until it gets changed) "Dewey" comparison
routines, and for those, 9.99.100 is uncharted territory.

kre

Re: Can version bump up to 9.99.100?

2022-09-15 Thread Robert Elz

Date:Fri, 16 Sep 2022 11:10:30 +0900
From:Kengo NAKAHARA 
Message-ID:  <90c3c46e-6668-9644-70c3-0eab2cf1c...@iij.ad.jp>

  | Hmm, I will test kernel module building before commit.

Sorry, I wasn't clear - I build everything (modules included) - I just
never actually load any modules, so I haven't tested them (my kernels have
the MODULAR option disabled).   I cannot imagine an issue, as internally
everything just uses __NetBSD_Version__ as a 32 bit (ordered) blob - the
breakdown into 9.99.100 type strings is just for us humans (and pkgsrc).

kre

Re: Can version bump up to 9.99.100?

2022-09-15 Thread Robert Elz

Date:Thu, 15 Sep 2022 17:08:52 +0900
From:Kengo NAKAHARA 
Message-ID:  <279eae4e-79f4-39c0-5279-83d5738b6...@iij.ad.jp>

  | Can version bump up to 9.99.100?  Is there anything wrong?

It can.   There are no issues with the base system (incl xsrc)
I have tested this in the past, it all just works - and that we
were going to need it sometime before the -10 branch has been
obvious for a while.

Two things I did not test were kernel modules (since I never use
them) which I highly doubt will give any problem, and should be
a trivial fix in the unlikely event there is an issue;

And pkgsrc, because when I tested I needed to revert to the then
current version number (.97 at the time I think), and that reversion
would do things to some pkgsrc version numbers that it should not
be required to deal with.

I don't have enough of a handle on the latter to guess, but if
something in pkgsrc breaks, this will provide the motivation to fix
it, which otherwise might never happen.  In case any change is
required to the parts of it in base, getting that done before the
-10 branch would be good.

So just do it.

kre

Re: Adding ESRT and EFI vars support for fwupd

2022-08-19 Thread Robert Elz

Date:Fri, 19 Aug 2022 12:40:11 +0200
From:=?UTF-8?Q?Pawe=c5=82_Cichowski?= 
Message-ID:  <56898e46-7714-200b-4528-afffddd6d...@3mdeb.com>

  | I've built the kernel and release for evbarm aarch64 from the latest 
  | sources and ran it on QEMU. Unfortunately, /dev/efi wasn't present on 
  | the system.

That's just because it is not installed by default - as it currently
exists (with no tools to use it yet committed) there is little point.

However /dev/MAKEDEV should be able to make it, just "sh /dev/MAKEDEV efi"

  | Is it linked to the issue you talked about (no conversion 
  | between EFI device paths and disks)?

No, that's missing userland code.   That is, an EFI boot variable (or
driver variable) will contain strings which (perhaps indirectly) reference
the busses, port numbers, unit numbers, using EFI terminology.

A disk name is something more like /dev/wd0e  (or NAME=wedge).

For the user interface, we'd like to be able to reference devices and
partitions using unix style names, not EFI paths.

  | How should I approach implementing it?

As above, that is all it should take (I have done it before).

  | Or is making a device node manually using mknod the case (I thought 
  | MAKEDEV.tmpl should've added it automatically)?

You could also use mknod if you prefer, /sbin/mknod /dev/efi c 361 0
Add options to set the uid/gid/perms if you need (or apply chown[/chgrp]
and chmod after, or just accept the default).

  | Additionally, what ways to debug kernel drivers would you recommend? 
  | `printk` or `aprint_debug`?

Depends what the issue is, but in general, while it does need some minor
work, that driver works I thing, for as much as it does.

  | - get table - returns table address by uuid (efi_ops)
  | - copy table - copies the table from memory to a variable passed by 
  | reference (efi_ops)
  | - get table size - helper function, returns size of table in bytes
  | - other operations on efi vars are a second priority, since the main 
  | task is to support ESRT

That all sounds reasonable to me.

  | I reckon these not only need to be added to /dev/efi, but efi_runtime 
  | too (so they become machine dependent). If you have a different view or 
  | any other ideas to extend the implementation please let me know.

That I will leave for Jared or someone else to comment on.

kre

Re: Anyone recall the dreaded tstile issue?

2022-07-22 Thread Robert Elz

Date:Fri, 22 Jul 2022 11:24:46 +0100
From:Patrick Welche 
Message-ID:  

  | Having not seen the dreaded turnstile issue in ages, a NetBSD-9.99.99/amd64
  | got stuck on shutdown last night with:

How long did you wait?

I have seen situations where it takes 10-15 mins to sync everything to
drives (I have plenty of RAM available) - which I think is another
problem - but not this one.

It isn't the case that every time we find something "stuck" on a tstile
wait that the system is broken - they're just locks, sometimes processes
are going to need to wait for one.

In the kind of scenario described, things like sync and halt will need
to wait for all the filesystems to be flushed - if that's going to take
a long time (which it really shouldn't, but that's a different issue)
then it is going to take a long time.

The other day I managed to crash my system (my fault, though really what
I did - yanking a USB drive mid write - shouldn't really cause a crash,
just mangled data) in the middle of the afternoon.   It rebooted easily
enough, wapbl replaying logs kept all the filesystems safe enough (I think
the drive I pulled needs a little more attention, but that's a different
problem) but then I discovered that files I had written about 02:00 in the
morning (more than 12 hours earlier) were all full of zeroes - the data
must have been sitting in RAM all that time, and nothing had bothered to
send it to the drive.   That's not good ...   We also seem to no longer
have the ancient update(8) which used to issue a sync every 30 secs, to
attempt to minimize this kind of problem.

kre

Re: Anyone recall the dreaded tstile issue?

2022-07-16 Thread Robert Elz

Date:Sat, 16 Jul 2022 00:48:59 -0400 (EDT)
From:Mouse 
Message-ID:  <202207160448.aaa09...@stone.rodents-montreal.org>

  | That's what I was trying to do with my looking at "X is tstiled waiting
  | for Y, who is tstiled waiting for Z, who is..." and looking at the
  | non-tstiled process(se) at the ends of those chains.

That can sometimes help, but this is a difficult ussue to debug, as
often the offender is long gone before anyone notices.

  | My best guess at the moment is that there is a deadlock loop where
  | something tries to touch the puffs filesystem, the user process forks a
  | child as part of that operation, and the child gets locked up trying to
  | access the puffs filesystem.

That is possible, as is the case where locking is carried out
improperly (I lock a then try to lock b, you lock b then try to
lock a) - but those are the easier cases to find.

A more common case, I believe, is

func()
{
lock(something);
/*
 * do some work
 */
 if (test for something strange) {
/*
 * this should not happen
 */
return EINVAL;
}
/*
 * more stuff
 */
 unlock(something),
 return answer,
}

where I am sure you can what's missing in this short segment ...  real
code is typically much messier, and the locks not always that explicit,
they can be acquired/released as side effects of other function calls.

The function (func here) which causes the problem is no longer
active, no amount of stack tracing will find it.  The process
which called it might not even still exist, it might have
received the error return, and exited.

Finding this kind of thing requires very careful and thorough
code reading, analysing every lock, and making sure that lock
gets released, somewhere, on every possible path after it is taken.
The best you can really hope for from examining the wedged system
is to find which lock is (usually "might be") the instigator of it all.
That can help narrow the focus of code investigation.

Mouse, start with the code you added ... make sure there are
no problems like this buried in it somewhere (your own code, and
everything it calls).   If that ends up finding nothing, then
the best course if action might be to use a fairly new kernel.
Some very good people (none of whom is me, so I can lather praise)
have done some very good work in fixing most if the issues we
used to have.  I haven't seen a tstile lockup in ages, and I used
to quite often (fortunately mostly ones that affected comparatively
little, but over time, things get more and more clogged, until a
reboot - whuch can rarely be clean in this state - is required).

kre

Re: killed: out of swap

2022-06-14 Thread Robert Elz

NetBSD implements overcommitted swap - many processes malloc()
(or mmap() which that really becomes in the current implementation)
far more memory than they're ever going to actually use.  It is only
when some real physical memory is required (rather than simply a marker
"zero filled page might be required here") that the system actually
allocates any real resources.   Similarly pages mapped from a file only
need swap space if they're altered - otherwise the file serves as the
backing store for it.

Once upon a time there was a method to turn overcommitted swap off, and
require actual allocations (of RAM or swap) to be made for all reserved
(virtual) memory.  I used to enable that all the time - but I haven't seen
any mention of it in ages, and the mechanism might no longer still exist.

kre

Re: procfs files vs symlink

2022-01-16 Thread Robert Elz

Date:Fri, 14 Jan 2022 06:22:11 + (UTC)
From:RVP 
Message-ID:  <8ad9feaf-a513-d33d-c887-3ca8407c...@sdf.org>

  | It does not, and wasn't meant to. I noticed that d_type was being
  | set to VREG and attached a patch for that to my reply.

Oh.  Then apologies.  I saw ...

  || The readlink on 4 will fail because it is no longer the symlink it
  || originally was:

[...ktrace output omitted ...]

  || Anyway, here's one more patch:

and just assumed that the patch that followed was intended to address
the earlier issue, though you didn't actually say that.

  | I think this does it:

Yes, that's a much better solution, and I see Christos committed it.

kre

Re: procfs files vs symlink

2022-01-13 Thread Robert Elz

Date:Thu, 13 Jan 2022 21:12:51 + (UTC)
From:RVP 
Message-ID:  

  | The patch is for processes to know that stat() will have to be
  | called for that particular dirent.

Yes, I understood the patch.  But why?

  |  DT_REG would not be right there.

Not always.  No.  But if the fd refers to a regular file,
it would be, right?

  | (I've done this myself: call stat() if dirent.d_type is DT_UNKNOWN;
  | otherwise, take dirent.d_type as valid and save a syscall.)

Sure, d_type is an optimisation.  Since procfs knows what types
of files it has -- either it is creating them, or they come from
an open fd, which has a known type -- why not implement the
optimisation?

  | A note in dirent.3 that procfs (and some others?) will always return
  | DT_UNKNOWN would be a good idea, I think.

DT_UNKNOWN is intended for filesystems with directories that do not
maintain the d_type field, and so would need to fetch the inode (or
its equivalent) to supply the type - which would be relatively harmless
when the application follows the readdir with a stat on each entry,
but is needless overhead when the application does not care.

If procfs is not setting d_type correctly, that should be fixed.
(Which does not mean setting the type to unknown, when it is known.)

If your patch makes any difference to the way ls /proc/self/fd
works, that is just a fluke (relates to the way ls happens to
sequence its operations) and in no way any kind of general fix.

kre

Re: procfs files vs symlink

2022-01-13 Thread Robert Elz

Date:Thu, 13 Jan 2022 06:52:01 + (UTC)
From:RVP 
Message-ID:  <91af8c4-d0bd-c31d-6b6a-355826d5...@sdf.org>

  | The EINVAL is caused by using readlink() on what was a symlink,
  | but, is not anymore because the fd now points to a regular file.

The analysis looks right, but

  | Anyway, here's one more patch:
  |
  | ---START PATCH---
  | --- sys/miscfs/procfs/procfs_vnops.c.orig   2022-01-12 20:57:25.944841288 
+
  | +++ sys/miscfs/procfs/procfs_vnops.c2022-01-12 23:07:44.736070587 
+

that cannot be more than papering over one special case.

This problem isn't in procfs, but in ls, and is hard to fix.
The general issus is that any directory can change while ls
is processing it, leading to either incorrect (or even nonsense)
output, or apparent errors.

procfs is only notable here by being more dynamic than most
filesystems, and its self/fd subr particularly in that it changes
as the process opens and closes files - if ls was being run with -RL
it might recurse forever (I did not test it though).

Any fix for this would need to be in ls, not procfs.  If we were
willing to permit the extra overhead, and the possibility that it
might never terminate, then we coukd have is stat every file
again just as it is preparing the output, and in case an inconsistency
is discovered, go back and start again (which would mean needing to
buffer all output until all was known to be consistent, as the sort
results might differ with updated info about the file).

But that's too much to expect, and in any case would require the
following more reasonable change to have any hope of the preceding
terminating.   That is for ls itself to make sure to do nothing
which can create or remove any files while it is collecting its
info, so that we can avoid the almost guaranteed problem observed
here - so no opening or closing any files either which provides a
challenge for implementing recursive listings, but that could be
mitigated by requiring stabikity onky while reading any one directory,
alowing sub-dirs to be opened only after completing collecting
info from the current dir.   This doesn't prevent inconsistencies,
but it should prevent ls from itself causing the issue, guaranteeing
bad output (for dirs like /proc/x/fd).

Or we could do a subset of that, and just not close the dir,
even when ls has reached EOF until all entries have had their
data collected, so ls would at leas be doing its stat() on
the same files it read.  This should be easy to implement,
but is really just a fix for this one obscure issue (no-one
normally has any interest in getting a listing of ls's fd
subdir of /proc!)

Or we could simply accept that thus cannot be perfect, static
listings of dynamic data will have issues - potential issues
at a minimum, and attempting to fix that completely is either
impossible, or at least so costly as to be n9t worth it
(we could add an O_FREEZE open option to essentially make a
snapshot of the file/directory while tge returned fd is open,
guaranteeing that all references via that fd would see
unchanged data).

But singling out procfs for special attention isn't the
right thing to do (/tmp can have similar problems, slightly
less reproducible, when used with shells that implement pipes
or command substitution (or similar, like bash's process substitution)
using files (or fifos) in /tmp).

kre

Re: procfs files vs symlink

2022-01-12 Thread Robert Elz

Date:Tue, 11 Jan 2022 22:20:15 +0100
From:Manuel Bouyer 
Message-ID:  

  | > What causes that EINVAL?
  |
  |
  | I'm not sure (somneone suggested that the file descriptor has been closed
  | when ls tries to fstat() it, but I can't confirm this).

That should generate EBADF not EINVAL.  Attempting readlink()
on something that is not a symlink, and various other possibilities
like that would be more probable.  EINVAL isn't listed as a possible
error return from [f]stat ... not that that guarantees that it cannot
happen, particularly from within emulation code.

kre

Re: eventfd(2) and timerfd(2) APIs

2021-09-18 Thread Robert Elz

Date:Sat, 18 Sep 2021 15:54:06 -0700
From:Jason Thorpe 
Message-ID:  <63bf9e95-498a-4389-9a14-2f3c87a51...@me.com>

  | I've changed the man pages to state "set for non-blocking I/O".

That should be much better.

  | Yes, they're file descriptors, so close(2) gets rid of them.
  | Does this really need to be stated explicitly?

I would, since open() doesn't make them, it's just a few extra words,
and an extra Xr in the SEE ALSO.

  | st->st_birthtimespec = st->st_ctimespec = tfd->tfd_btime;

ctimespec should really be mtime, not btime (but you can do it after
the itimer_lock() region by just copying mtimespec).

Making it a pretend FIFO is reasonable.

  | Actually, fchmod(), fchown(), etc. only work on DTYPE_VNODE descriptors.

That probably should be fixed, POSIX certainly requires it to work on
shared memory segments (though it limits which bits are required to
exist there).

  | You'll get EBADF if you try it on anything else

and that's definitely wrong, on a pipe the system is allowed to return EINVAL,
and that's what ought to be returned whenever the fd is not something that
supports chmod (etc).   EBADF should only be used when the process doesn't
have the fd open.

But none of this part is related to your proposal, I'm not suggesting that
you need to go fix any of that.

kre

Re: eventfd(2) and timerfd(2) APIs

2021-09-18 Thread Robert Elz

Date:Sat, 18 Sep 2021 13:21:27 -0700
From:Jason Thorpe 
Message-ID:  <5e7b8a22-14c2-4dce-ace2-31552f412...@me.com>

  | >  unless the
  | >  .Nm
  | >  object was created with
  | >  .Dv TFD_NONBLOCK .
  |
  | I'm using those names, because those are the names used in the Linux API.

It wasn't the names I was concerned about.

  | If you look at the code (it's on the thorpej-futex branch),
  | you will see that they are aliases for O_NONBLOCK and O_CLOEXEC.

That was kind of obvious anyway from the man page:

  The following flags define the behavior of the resulting object:
  .Bl -tag -width "EFD_SEMAPHORE"
  .It Dv EFD_CLOEXEC
  Sets the
  .Dv O_CLOEXEC
  flag; see
  .Xr open 2
  for more information.

So:
  | I will clarify this in the man page.

probably isn't really necessary.   I was more concerned with the
"unless the object was created with" - implying that if those flags
are changed later, that would be irrelevant, as it is the state at
create time that matters.   That would be unfortunate indeed, but:

  | Actually, I didn't plumb fcntl through because just about nothing

might explain part of that (though you can't avoid the ability to
alter O_CLOEXEC that way, as that's a much higher level operation).

  | else plumbs it through either, but I'll go ahead and do so.

Please do.   What other things don't permit fcntl() to work?   We
should fix any of those.

  | The behavior of timerfd with respect to read is documented in my man page:

Yes, I saw that.

  | Writes to a timerfd return an error.  I will clarify this in the man page.

That would be useful.   You might want to also indicate how these
descriptors are destroyed (I assume just close(2) but who knows).

  | > Finally, what does fstat() return about these fds?

The one I should have asked about, but forgot, was (st_mode & _S_FMT)
Ie: what kind of object are these things pretending to be?

Since they're fd's, they can be inherited, open, by other processes
(and since the man page hints at it, probably sent through a AF_UNIX
socket), but particularly in the former case, the receiving process
needs to know (or at least be able to find out) what it is that is on
this fd it has received.

  | Of course, we don't document what these are for other kinds of descriptors,

for many there's no need, as everything is exactly what stat(2) claims
it will be.   For any where that is not true, or is insufficient, we
should be documenting it.

If this was just a linux compat hack, so linux binaries could run,
then most of this wouldn't matter - the application would do whatever
linux allows it to do, and nothing actually built on NetBSD would
ever care.

But if these are to be full NetBSD interfaces, they need to be
both complete (and sane) and properly documented.   That means
which of the f*() interfaces (fstat, fchmod, fchown, ...) work,
and which simply return errors, and whether any of them which
do work do anything useful.   Not necessarily documented in
those 2 man pages, but perhaps a section 9 page, and/or
section 4 if theses things are pretending to be some kind
of special file.

kre

Re: eventfd(2) and timerfd(2) APIs

2021-09-18 Thread Robert Elz

Date:Sat, 18 Sep 2021 10:26:29 -0700
From:Jason Thorpe 
Message-ID:  <986563ad-88c2-41b9-bf69-51b26240b...@me.com>

  | https://www.netbsd.org/~thorpej/timerfd.2

This one contains duplicated text...

  Because they are associated with a file descriptor, they may be passed
  to other processes, inherited across a fork, and multiplexed using
  .Xr kevent ,
  .Xr poll ,
  or
  .Xr select  they are associated with a file descriptor, they may be passed
  to other processes, inherited across a fork, and multiplexed using
  .Xr kevent 2 ,
  .Xr poll 2 ,
  or
  .Xr select 2 .

That should be fixed before anything is committed.

Apart from that both man pages contain text like

  unless the
  .Nm
  object was created with
  .Dv TFD_NONBLOCK .

Since these things are working with file descriptors, I assume that
fcntl(2) can be used to manipulate flags like O_NONBLOCK and O_CLOEXEC
in which case I would guess (and hope) that the state of those flags when the
object was created isn't what is releant, but the state of the flags at
the time of the operation concerned.

The man pages should probably be reworded with that in mind.

The exact relationships of the {event,timer}fd_*() functions
and read()/write() is also not clear to me - are those just wrappers
around read/write or are they distinct sys calls of their own?

I initially assumed the former, but then I see that timerfd_settimer()
has an extra flags arg, which write() (I presume) has no easy way to
pass in, so now I am not sure.

If these are distinct operations how to actual read()/write() interact?
What would the flags be on a write()?

Finally, what does fstat() return about these fds?   What is the dev_t ?
What is the inode number, is the link count meaningfil, how about the
uid and permissions?And what affects the time fields?

I suspect that regardless of the merits of the interfaces, the specs
need some improvement.

kre

Re: Is there the system call to set all 3 time at once?

2021-09-09 Thread Robert Elz

Date:Thu, 09 Sep 2021 11:58:23 +
From:bsdairekii...@posteo.de
Message-ID:  <2cbd055de44ea130b54e525543d5d...@posteo.de>

  | I have looked NetBSD manual page and find this description since NetBSD 
  | 5.0 in utimes and since NetBSD 6.0 in utimensat.

I believe that.   But don't you think that if anyone really believed that
being able to set the birthtime explicitly was a useful thing to do, sometime
in the past many many years since that was added, that it wouldn't have been
done by now?

  | I have been using UFS2 on NetBSD longer than UFS1 and will continue to 
  | do so for the next 10 years.

Good.

  | Therefore, as in the description, a 
  | system call should be added that allows to set all 3 times at once.

That doesn't follow (as Tom said).   If you think you have a use for
birthtime, tell me what it is, and I will show you how that cannot work
(in general, in some very limited restricted scenarios it might be
functional, but those are so restricted that nothing ever needs to
set birthtime, other than the system).

If you don't have an actual use for it, why would you care whether you
can set it or not?

kre

Re: Is there the system call to set all 3 time at once?

2021-09-08 Thread Robert Elz

Date:Tue, 07 Sep 2021 22:14:31 +
From:bsdairekii...@posteo.de
Message-ID:  

  | this page https://man.netbsd.org/utimensat.2 describes this.
  |
  | > Ideally a new system call will be added that allows the setting of all 
  | > three times at once.
  |
  | Where can I find this system call?

You cannot, it doesn't exist.

  | If it is not available, can you create it?

Can?  Yes.  Will?  No.

birthtime is a waste of space, it's not standard, not portable (not even
within NetBSD, UVS2 supports it, UFS1 does not) and useless.

Simply forget that it exists (that is, when it exists).

kre

Re: Some changes to autoconfiguration APIs

2021-08-07 Thread Robert Elz

Date:Wed, 4 Aug 2021 17:52:46 -0700
From:Jason Thorpe 
Message-ID:  <68ff8737-f347-4a7f-960b-9e4a6ca9e...@me.com>

  | It addresses the concerns about compile-time type checking
  | by using an anonymous structure constructed in-line

Is there something in the C definition of such things which guarantees
that the un-init'd fields all get set to 0/NULL ?   Or is that just
happening because of the "const" spread all over - which might be causing
the compiler to allocate static storage, but which is not required of const?

Also, is there some way to distinguish an integer valued attribute which
is explicitly set to 0 from one which isn't set at all (in which case we
might want to default it to some other value) or do we only ever use
attributes via various kinds of pointers (or perhaps never have any,
anywhere ever, where 0 is a sensible value) ?

kre

Re: Some changes to autoconfiguration APIs

2021-08-01 Thread Robert Elz

And as a possible optional extra, one fairly
easy way to add type checking woukd be to add
an extra dummp printf format string arg, unused
by config_found (would cost one useless ptr
push at each call, but we can bear that), declare
config_fiund __printf_like, and let the compiker
do arg verification (correct types, etc) for
us.  We'd miss correct pointed at type verification
but everyone avoids tgat using (void *) anyway.

Provide a few standard format strings for
the common use cases, whatever they are, so
people just naturally use those (they'd just
be things like "%d%d%d%p%u" where the %d%d
and %d%p are two key abd data pairs, and the
%u is for a null key that terminates things.

Seems like it could work to me, but  I haven't
worken on autoconf cide since the early 80's I
think  so who knows.

kre

Re: Some changes to autoconfiguration APIs

2021-08-01 Thread Robert Elz

Date:Mon, 2 Aug 2021 01:36:26 +0100
From:David Brownlee 
Message-ID:  

  | 3) This email takes one of Taylor's suggestions and hangs an explicit
  | version on the calls, which should give reasonable forward
  | compatibility

That solves the wrong problem,  there is (one
exception) no binary compat issue, when the
kernel is recompiled with a new CF_VERSION
all the code is recompiled with that new
version ID, they all get it fron the sane .h
file, so all depend upon it.

The exception is modules, if one wanted to be
able to change versions without a kernel vers
bump, but I see no reason at all to ever need,
or even want, that, it isn't as if anything here
makes kernel version bumps less needed.

The issue, if you looked at Jason's message is
what happens when the semantics of an option
are to change, or similar, not trying to keep
some kind of ABI stability.

Personally I'm something of a fan of trusting
the caoabilities of the developers who work in
this area, make the interface simple enough to
be understood, then just let people use it.

kre

Re: Is there a command to change btime (creation time of files)?

2021-05-28 Thread Robert Elz

Date:Fri, 28 May 2021 16:49:26 +
From:Kenny 
Message-ID:  

  | I am using NetBSD 9.2 (amd64) with ZFS as file system and I have
  | not found a command to change btime for my files.

Don't bother, the birth time is a total waste of space.  It is used
by nothing that matters, and is useful for nothing at all.   It is
mostly unsupported (in that the stat command is just about the only
way to view it, and while it is possible to set it, but not arbitrarily,
the procedure is baroque and not worth explaining).

kre

Re: 9.1: boot-time delay? [WORKAROUND FOUND]

2021-05-27 Thread Robert Elz

Date:Thu, 27 May 2021 20:19:06 +
From:"Koning, Paul" 
Message-ID:  <8765ae3a-b5b7-4b67-82ce-93473a5b9...@dell.com>

  | In this particular case it's converting frequency to period,
  | that is a sensible conversion.

But it isn't, you can't convert 60 ticks/second into some number of
milliseconds, the two are different units.   That's just the same as
you can't convert metres/second (velocity) into seconds.   Given a
particular velocity, and some number of metres, you can calculate the
time it takes to move that far, but that isn't converting velocity
into seconds.

What it is happening is that (in one direction of the other, depending
upon which function) it is converting between the number of ticks that
occur and the duration of an interval (which of course depends upon the
frequency, but it is not converting the frequency).

The hztoms() function is no different than a ustoms() function, except
that in the former we have a semi-variable (the frequency) which is simply
a constant (1000) in the second - but that's only a variable because we
allow HZ to vary (by architecture, and sometimes, configuration).   Calling
ustoms() thousandtoms() would be absurd.   So is calling this one hztoms().

  | You could say "hztoperiodinus" but that's rather verbose.

That doesn't help, we're still not converting a frequency to a period.

And in another reply:

Johnny Billquist  said:
  | Frequency essentially means a counting of the number of  time something
  | happens over a specific time period. With hertz, the time  period is one
  | second.

Sure.

  | So then converting the number of times an event 
  | happens in a second into how long it is between two events makes total 
  | sense.

It would, but that's not what the functions do.   What they do is tell
how many ticks occur in a specific number of milliseconds (or vice
versa).   Your calculation is just (in milliseconds) 1000/hz, and assuming
hz isn't varying, is a constant.

  | A tick is not a duration. A tick is a specific event at a specific time.  It
  | has no duration. You have a duration between two ticks.

Sure, reasonable point, but as Mouse said, when we're dealing with this
stuff the number of ticks is counted as a representation of the number
of those durations, and we just say how many ticks happened.  The ticks
represent the duration between them - that might be slightly sloppy, but
it isn't outright wrong.

kre

Re: 9.1: boot-time delay? [WORKAROUND FOUND]

2021-05-27 Thread Robert Elz

Date:Thu, 27 May 2021 05:05:15 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | mlel...@serpens.de (Michael van Elst) writes:
  |
  | >Either direction mstohz or hztoms should better always round up to
  | >guarantee a minimal delay.
  |
  | And both should be replaced by hztous()/ustohz().

While changing ms to us is probably a good idea, when a change happens,
the "hz" part should be changed too.

hz is (a unit of) a measure of frequency, ms (or us) is (a unit of) a
measure of time (duration) - converting one to the other makes no sense.

What these functions/macros do is convert between ms (or us) and ticks
(another measure of a duration), not hz, so the misleading "hz" part of
the name should be removed (changed) if a new macro/function is to be invented.
(The benefit isn't worth it to justify changing the current existing names,
but we shouldn't persist with nonsense if we're doing something new.)

kre

ps: note that the variable "hz" (and the macro HZ) are used correctly --
their values are frequencies (ticks/second).

Re: ZFS L2ARC on NetBSD-9

2021-04-19 Thread Robert Elz

Date:Sun, 18 Apr 2021 18:58:56 +
From:Andrew Parker 
Message-ID:  <2245776.bZt9KSGgi3@t470s.local>

  | Does anyone else have a working L2ARC?

Sorry, don't even know what that is, and don't (currently anyway) use zfs,
but:

  | -   interval = hz * l2arc_feed_secs;
  | +   interval = mstohz(l2arc_feed_secs);

Are you sure about that part of the change (the earlier fragment looked
reasonable) ?

mstohz() when starting with seconds (which the name of that var suggests)
looks like it would be much smaller than intended, whereas simply multiplying
seconds by hz gives ticks, which looks to be the objective in all of that.
Alternatively multiply secs by 1000 to generate ms, and mstohz() that.

Watch out for potential overflow in all of this though.

kre

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Robert Elz

Date:Sun, 11 Apr 2021 18:14:44 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | +   spb = vnd->sc_geom.vng_secsize / DEV_BSIZE;

Do we know for sure here that vng_secsize >= DEV_BSIZE ?

When I first used unix (long long ago) the drives I used had a
sector size of 256 bytes (not DEC hardware.)(Floppies were
128 bytes if I recall correctly).

It would not be good if spb became 0 there.

kre

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Robert Elz

Date:Sun, 11 Apr 2021 14:25:40 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | Seems to have been introduced with netbsd-7.

Perhaps, but the effect was probably invisible until Jan this year
when the calculation of ncylinders was corrected - before then the
value would have been much bigger, and consequently, the entire image
would have been visible when it was used in a c * h * s calculation
to calculate the size.

kre

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Robert Elz

Date:Sun, 11 Apr 2021 14:25:40 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | +   dg->dg_secperunit = vnd->sc_size / DEV_BSIZE;

While it shouldn't make any difference for any properly created image
file, make it be

(vnd->sc_size + DEV_BSIZE - 1) / DEV_BSIZE;

so that any trailing partial sector remains in the image.

kre

Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Robert Elz

Date:Sun, 11 Apr 2021 15:53:07 +0200
From:Manuel Bouyer 
Message-ID:  

  | On Sun, Apr 11, 2021 at 01:28:46PM -, Michael van Elst wrote:
  | > vnd computes a fake geometry based on 1MB cylinders.
  |
  | Why does this trucates the total number of sectors of the vnd ?
  | there's no reason to do so.

I agree, that's truly stupid.   No-one cares about cyl/head/sect any
more, all disk I/O is done using block numbers ... and even if someone
still had a real hardware interface that used CSH addressing, that's only
an issue for that physical device.   vnd is doing filesystem I/O, that's
all based on byte offsets into the file, which (faked) cylinder would be
used could not possibly be less relevant.   Just fake a chs for the
purposes of the labels, etc, like real modern drives do, after which
everyone ignores that and simply uses block numbers.

kre

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Robert Elz

Date:Mon, 5 Apr 2021 01:14:01 +0200
From:Joerg Sonnenberger 
Message-ID:  

  | That is discussed in the security model Taylor presented a long time
  | ago. In short: nothing. In most use cases, you are screwed at this point
  | anyway

This is where the disconnect is happening I think.   Many of you are
simply not understanding the point.

I am not screwed, I just don't care.Is that so hard to understand?

Let me make it plainer.

I run systems on which I allow root logins with no password.   I have run
systems where root ssh access is permitted, put those together and you
get root access from over the net (and telnet would allow that as well).

Alternatively I can aim for greater security, and configure a root
password ... like say the system's host name.

NetBSD allows me to do all that - it might not be the standard configuration,
but it is possible.   You might think it is insane, and that's fine, but
there are reasons.

On recent NetBSD, as I understand it, I can

dd if=/dev/zero bs=N count=1 >/dev/random

and now I have "entropy".   But it refuses to provide a simpler knob
to do the same thing (or perhaps something a little saner, but equally
as simple to use).

The logic behind that makes no sense to me.

I understand that some people desire highly secure systems (I'm not
convinced that anyone running NetBSD can really justify that desire,
but that's beside the point) and that's fine - make the system be able
to be as secure as possible, just don't require me to enable it, and
don't make it impossible or even difficuly to disable it - and allow
some kind of middle ground, just just "perfectly secure" and "hopeless".

kre

Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Robert Elz

Date:Sun, 4 Apr 2021 15:28:13 +
From:Taylor R Campbell 
Message-ID:  <20210404152814.3c56360...@jupiter.mumble.net>

  | you can let NetBSD take care of it automatically
  | on subsequent boots by running `/etc/rc.d/random_seed stop' to save a
  | seed to disk.)

Is that file encrypted?   If it is, where does the decryption key come from?

If not, what prevents someone from reading (copying) the file from the
system while it is stopped (assessing the storage device via other methods)
and then knowing exactly what the seed is going to be when the system boots?

I think I'd prefer possibly insecure, but difficult to obtain from outside
like disk drive interrupt timing low order bits than that.   Regardless of
how unproven that method might be.

And what's the scheme for cheap low-end devices that have no writable storage?
(The proverbial internet toaster, for example).

Lastly, why would anyone presume that RDRAND generates less predictable
bits (less predictable to someone who knows how it works) than any of
the other methods that are used.   After all, all the chips are more or
less identical, what about them can absolutely guarantee unpredictable
data (a very rare thing for computers) and how can anyone be certain
that it has been correctly implemented without any bugs?

If we want really good security, I'd submit we need to disable
the random seed file, and RDRAND (and anything similar) until we
have proof that they're perfect.

Personally, I'm happy with anything that your average high school
student is unlikely to be able to crack in an hour.   I don't run
a bank, or a military installation, and I'm not the NSA.   If someone
is prepared to put in the effort required to break into my systems,
then let them, it isn't worth the cost to prevent that tiny chance.
That's the same way that my house has ordinary locks - I'm sure they
can be picked by someone who knows what they're doing, and better security
is available, at a price, but a nice happy medium is what fits me best.

kre

Re: partial failures in write(2) (and read(2))

2021-02-16 Thread Robert Elz

Date:Mon, 15 Feb 2021 23:18:33 +0100
From:Rhialto 
Message-ID:  

  | A system call with error can return with the carry set and the error and
  | short count returned in a separate registers. The carry bit is how
  | errors used to be indicated since at least V7 (even V6?) anyway.

Earlier than v6, this dates back to when much of the system was
written in assembly code (including many of the utilities).

The issue isn't how to return multiple values from the kernel, that's
easy, we even have standard sys calls (like pipe()) which do that
routinely.

The problem is that the definition of write() (and most other system
calls) is that they don't affect errno unless there is an error, and
if there is an error, they return -1 (which leaves no place to return
a short count as well).   This all actually happens in the libc stub.

We could, of course, invent new interfaces (a write variant with an
extra pointer to length written arg perhaps, or where the length arg
is a pointer to a size_t and that is read and then written with either
the amount written, or the amount not written).

But I don't believe that any of this is needed, or desirable.

We should first make sure that we do what POSIX requires, and simply
return a short write count (and no error) in the cases where that
should happen (out of space, over quota, exceeding file size limit,
and writing any more would block and O_NONBLOCK is set, more?).

In the other error cases we should simply leave things alone and
accept it - it is the way unix always has been, and we have survived.
If we have a drive returning I/O errors (on writes), do we really
expect that earlier data written will have been written correctly?
Do you want to rely upon that?It might have been possible once,
when drives were stupid, and simply wrote sectors in the order
presented, but with modern drives, with internal caches, which
write the data in any order they like, when they like, and do block
remapping when a sector goes bad, I wouldn't trust anything on
the drive once it starts saying write failed.   Pretending that
the first 8K of a 16KB write worked, and there was an I/O error
after that is folly.   It may easily have been that the 2nd 8K
block was written, and the first one gave up in error, eventually.
Some of the data intended to be written may have been written, but
we have no sane way to work out what (again, entire new interfaces
could allow the info to be returned, but to what point?  Who would
ever write code to make use of that info?)

It's even worse for the remaining cases, where the error is caused
by broken software (either a broken kernel doing insane things, or
a broken application asking to write data from memory it does not
own, etc).   Nothing can be assumed reliable in cases like that.

So, let's all forget fanciful interface redesigns, fix whatever we
need to fix to make things work the way they are supposed to work
(if there is anything) and leave the rest as "the world just broke"
type territory.

kre

Re: partial failures in write(2) (and read(2))

2021-02-06 Thread Robert Elz

Date:Fri, 05 Feb 2021 20:43:30 -0500
From:Greg Troxel 
Message-ID:  

  | An obvious question is what POSIX requires, pause for `kill -HUP kred` :)

Hey!   wiz is the daemon, I'm an angel...

  | I think your case (a) is the only conforming behavior and obviously what
  | the spec says must happen.

For what I'd call detectable in advance errors (and signals) yes, I agree,
that's required (that is all the cases where you can tell simply from the
state of the world that the write cannot complete as asked).  For hardware
errors (and in that category I think include the case of a buffer that
starts out with valid addresses and continues to invalid ones, where a
SIGSEGV would perhaps also be acceptable behaviour, but if not, and EFAULT
is generated), I don't think anything is specified at all.

The standard recommends advancing the file offset to the point of the error,
but doesn't require it, and certainly doesn't require returning the number of
bytes written up to the point where the error occurs (nor does it preclude
that I believe).   This is not surprising, as what it describes is what
systems actually do, and most systems traditionally upon detecting an I/O
error, or copy{in/out} failure, simply return -1 with errno set, rather
than attempting to advise the application how much data was actually 
transferred before the error.

kre

Re: CVS commit: src/external/gpl3/gcc/dist/gcc/config/aarch64

2020-10-16 Thread Robert Elz

Date:Fri, 16 Oct 2020 04:07:31 +
From:"Thomas Mueller" 
Message-ID:  <20201016052422.e063084...@mail.netbsd.org>

  | Should I add ,linux to the end of the procfs line?

You can, but it isn't needed these days -- I used to mount procfs twice,
once without the linux option, on /proc, and once with, on /emul/linux/proc)
but there seems to be little point in that any more (even though the linux
/proc has a whole bunch of trash that has nothing to do with processes, and
should be, and generally is, available from /kern ... /proc/cpuinfo is an
example of that, though that one is missing from kernfs and should be added
there).

I do add "hidden" to the mount option list though, there's essentially
no point in including /proc /kern /dev/pts (or anything else like those)
in default df output (which is the only thing "hidden" generally affects).

kre

Re: /dev/random issue

2020-10-01 Thread Robert Elz

Date:Thu, 1 Oct 2020 18:57:12 +0200
From:Manuel Bouyer 
Message-ID:  <20201001165712.ga1...@antioche.eu.org>

  | which, basically. means that one should not use reboot, halt or poweroff
  | any more ...

And of course, the system must never cash, hang, or suffer a power failure.

kre

Re: wait(2) and SIGCHLD

2020-08-16 Thread Robert Elz

Date:Sun, 16 Aug 2020 16:13:57 - (UTC)
From:chris...@astron.com (Christos Zoulas)
Message-ID:  

  | They don't vanish, they get reparented to init(8) which then wakes up
  | and reaps them.

That probably would work, approximately, but isn't what's supposed to
happen when a child's parent is ignoring SIGCHLD - the child should
skip zombie state, and simply be cleaned up.

The difference would be detectable if init were sent a SIGSTOP
(assuming that isn't one which would cause a system panic)
so it would stop reaping children (temporarily) - processes of
the type in question should not be showing up as zombies.

kre

Re: wait(2) and SIGCHLD

2020-08-14 Thread Robert Elz

Date:Fri, 14 Aug 2020 20:01:18 +0200
From:Edgar =?iso-8859-1?B?RnXf?= 
Message-ID:  <20200814180117.gq61...@trav.math.uni-bonn.de>

  | 3. I don't see where POSIX defines or allows this, but given 2., I'm surely
  |missing something.

Actually, I did go take a look, it is in the XSH page for _Exit()
under "Consequences of Process Termination" (some other places reference
this section).

kre

Re: wait(2) and SIGCHLD

2020-08-14 Thread Robert Elz

Date:Fri, 14 Aug 2020 20:01:18 +0200
From:Edgar =?iso-8859-1?B?RnXf?= 
Message-ID:  <20200814180117.gq61...@trav.math.uni-bonn.de>

  | 3. I don't see where POSIX defines or allows this, but given 2., I'm surely
  |missing something.

It is specified to work this way in POSIX, though right now I don't
have the time to go dig out exactly where.

Setting SIGCHLD to SIG_IGN effectively means that you want to ignore
your children - they then don't report any exit status to their parent,
but simply vanish when they exit.   Thus when the parent does a wait()
it has no children, and gets ECHLD.

Leave (or set) SIGCHLD to SIG_DFL and you don't get signals, but child
processes do report status to their parent.   Catch SIGCHLD and you'll
get signalled when a child exits (I'm not sure if NetBSD guarantees one
signal delivery for each exited child or just a signal if there are
some unspecified number of exited children).

The actions on an ignored SIGCHLD is SysV inherited behaviour,
Bell Labs (v7/32V) and CSRG BSD systems didn't act this way.

kre

Re: style change: explicitly permit braces for single statements

2020-07-12 Thread Robert Elz

Date:Sun, 12 Jul 2020 13:01:59 +1000
From:Luke Mewburn 
Message-ID:  <20200712030159.gh12...@mewburn.net>

  |   | IMHO, permitting braces to be consistently used:
  |   | - Adds to clarity of intent.
  |   | - Aids code review.
  |   | - Avoids gotofail: 
https://en.wikipedia.org/wiki/Unreachable_code#goto_fail_bug

Permitting the braces is probably no big deal, but does none of
that.   Actually using the extra braces might, but unless you change
"permitting" to "requiring", that's unlikely to happen a lot.

I simply cannot see myself changing

if (p == NULL)
return NULL;

into

if (p == NULL) {
return NULL;
}

Aside from anything else, the closing brace occupies an extra
line (and often two, as those are often followed by blank lines)
which means two less lines of context I get to see in my window
(however big the window is - enlarging it still means 2 less lines
of context than would be possible) - and that's for each time this
is done.

But as long as they're just permitted, and not required, then I
don't have a big problem with it - but note that if I'm working
on code written like that, I'm likely to delete non-required
meaningless braces (just as cleaning up trailing whitespace,
fixing tab vs space indentation, and wrapping long lines, etc).

kre

Re: pg_jobc going negative?

2020-07-10 Thread Robert Elz

Date:Fri, 10 Jul 2020 16:47:28 +0200
From:Rhialto 
Message-ID:  <20200710144728.gy3...@falu.nl>

  | It also seems to be involved in deciding wether to send a SIGTTOU or
  | SIGTTIN to a process

Ah, right, thanks ... when I was reviewing uses in the kernel I
was concentrating on places where pg_jobc changes, and just sort of
dismissed the places where it was simply examined   then I never
went back to them again.   I will make sure my plans keep this working
(I knew that orphaned pg processes no longer do any kind of auto-stop,
though they can stil be sent SIGSTOP I believe).

  | I found this above fixjobc() which goes into a bit more detail what is
  | being counted:

Yes, I've been looking at that - that's where things are going wrong I
think, when looking at the children each child can decrement pg_jobc
but it only seems to get incremented once.   Easy to see how it becomes 
negative.

mo...@rodents-montreal.org said:
  | But each of those steps involves some winnowing-down.

Yes, again speculating, but my guess would be (for libkvm)
"if someone uses it, we make it available, if no-one does,
we don't" and quite probably ps was using it via /dev/mem
already (though what use that was supposed to have I cannot guess,
perhaps just for debugging the kernel implementation)

After that it appears that everything was just copied to each new
interface, based upon the philosophy of phasing out uses of libkvm,
which would mean making sure that the alternate interface could do
everything libkvm could do, otherwise someone would find the missing
data a justification for sticking with libkvm.

mo...@rodents-montreal.org said:
  | Did your userland sweep include pkgsrc? 

No, I don't have the resources to do that.  I don't use many packages
(so don't have many distfiles) and have even fewer of those unpacked.
My test system for HEAD has none of it at all.

mo...@rodents-montreal.org said:
  | [...] I'd be astonished if there aren't at least a few
  | programs there that grub around in things like this.

No question, there are, but this particular field seems very unlikely
to have any users - really really unlikely.

kre

Re: pg_jobc going negative?

2020-07-10 Thread Robert Elz

Date:Fri, 10 Jul 2020 09:54:25 -0400 (EDT)
From:Mouse 
Message-ID:  <202007101354.jaa16...@stone.rodents-montreal.org>

  | > I see 3 ways forward...
  |
  | I count 4, but maybe kre is counting two of them as subclasses of a
  | single one.

Maybe, no-one said I could count (and the "I see 3" was added before
I started enumerating them...)

  | It seems to me that if pg_jobc is exported, someone presumably once
  | cared and there's thus a decent chance someone still cares.

Really?   I'd have thought it more likely, given the context, that
"it can be obtained via /dev/kmem" -> "it can be fetched via libkvm" ->
"it can be fetched via sysctl/procfs".   That is, simply maintaining
the available data, rather than evaluating its usefulness.

  | Did you do a sweep for userland references to it?

I didn't, but you're right, I should have  ...  and now have.

  | It seems plausible to me that it's used, at least for zero/nonzero,
  | by userland tools that are interested in process groups.

And you're right, it is used, by exactly one userland tool ... ps
If given "ps -o jobc" that field is printed.  It is also included
with "ps -j" output.

The only other references I can see are in libkvm (making the data
available that way) and the kernel.

ps(1) itself obviously doesn't really care, it is just showing the
info that is made available to it, so we'd only have an actual user
if one of us humans actually finds that output useful for something.

Anyone?

  | Is there any record of who added the code to export pg_jobc to
  | userland?

The earliest reference I can find is when cgd created (the no longer existing)
kern_kinfo.c in 1993 - pg_jobc was copied into the einfo struct (e_jobc) in
that version kern_kinfo.c 1.1 - changes after that have moved the code around,
but there's been no change of substance (to the exporting code).   I'm not
sure where to look for the corresponding (now deleted) early kinfo code (it
was all moved into init_sysctl.c when that was created in 2003, and then
later moved out to kern_proc.c)

That is, I'd say it comes from CSRG BSD code, and we'd need to delve back
into their sccs archives to see when it was first done (and hopefully, why).

kre

Re: pg_jobc going negative?

2020-07-10 Thread Robert Elz

Date:Tue, 9 Jun 2020 08:23:19 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

I have spent a little time looking at this now, and I think
it is just all a mess.

  | pg_jobc is not a reference counter.

Maybe not technically a "reference" counter, as what it is counting isn't
strictly references, but anything that has x++ and if (--x == 0) do_stuff()
is essentially a reference counter.   What it is counting references to
isn't clear (particularly here), but that is what it is doing, or trying
to do (it has all the same issues as things which really are ref counters).

  | The assertion probably stopped
  | a bug in a different place by coincidence.

I doubt that, this code is not at all good.   There is no question but
that the counter does not count properly.

As best I can work out, and someone correct me if I'm wrong,
the whole purpose of pg_jobc is so that orphanpg() can be called
when a process group is orphaned (no longer has a session leader).

If it has any other use, I cannot see it.

What's there now simply doesn't work for this purpose.   It was
suggested that the FreeBSD code has been modified from what we
have, and that simply adopting that might work.   I went to look
at their code, but before I did that, I saw that a month ago
(that is, just around the time of the original discussion here)
they copied maxv's KASSERTs into their code.  A week ago they removed
them again, as they were occasionally firing.   That tells me,
even without looking at their code, that they (still) have similar
bugs to what we do, and thus that just importing their code won't
help us.

I see 3 ways forward...   simply drop the KASSERT the way that FreeBSD
have done, and let things return to the semi-broken but rarely bothersome
code that was there before.   That's not really a good idea, as the
sanitizers apparently find problems with the code the way it was (not
too surprising, deleting the KASSERT won't fix the bugs, it just stops
noticing them explicitly).

Or, we could properly define what pg_jobc is counting, and then make sure
that it counts whatever that is properly - is incremented in all the
appropriate places, and decremented properly as well.   Currently
the comment about it in proc.h is:
/*
 * Number of processes qualifying
 * pgrp for job control 
 */
which makes it clear that it is a reference counter (not necessarily
counting the number of something which exists, so that something can be 
deleted, but it is counting references to processes).   Unfortunately
I have no idea what "qualifying pgrp for job control" is supposed to mean.

That could be done, but it seems like a lot of work to me, and not easy.

Another (more radical) approach would be to simply drop orphanpg()
completely, and thus no longer need pg_jobc at all.   The system
wouldn't be bothered by this at all - all orphanpg() does is locate
stopped members of the process group, and send then SIGCONT (to restart)
and SIGHUP (to probably kill them - or at least inform them they their
environment has changed).   If the system wasn't doing this, users manually
(or a script run from cron or something) could do it instead,  If not done
at all, badly behaving session leaders (processes which don't clean up
their stopped jobs before exiting - including ones with bugs that causes
them to abort) would over time cause a growing number of stopped jobs to
simply clutter the system, consuming various resources, but otherwise
harmless (there is nothing special about the processes, they can be killed,
or continued - it is just that the process which would normally do that
is no longer around).

Third, and the option I'd suggest, is to revert to more basic principles,
remove the pg_jobc attempt to detect when a session leader has exited,
or changed to a different process group, and instead at candidate events
(any time a process leaves a process group, for any reason) check if that
process was the session leader, and if it is, clean up any now orphaned
stopped processes.   This is likely to be slower than the current attempt,
but is much easier to make correct (and much less likely to cause problems,
other than dangling orphaned stopped processes, if incorrect).

As best I can tell, all the data needed exists already, all that will be
needed is to modify the code.   We can even leave pg_jobc in the pgrp
struct, to avoid needing a kernel version bump (and for reasons I cannot
imagine, pg_jobc is copied into kinfo and einfo structs for sysctl and /proc
access to the process data, so leaving it around avoids needing to version
those interfaces as well ... the value would be a meaningless 0, always, but
I really find it hard to believe that anything would ever care, or even 
notice).

Opinions on any of this before I start banging on the code?

kre

ps: I don't believe that any of the problems here are race conditions,
the counter is simply not maintained correctly (which isn't to s

Re: pg_jobc going negative?

2020-06-30 Thread Robert Elz

Date:Mon, 29 Jun 2020 23:22:52 +0200
From:Kamil Rytarowski 
Message-ID:  

  | Ping? This kernel crash is blocking GDB/etc and it is an instant crash.

Sorry, been side-tracked, will get to it soon.

kre

Re: stat(2) performance

2020-06-16 Thread Robert Elz

Date:Mon, 15 Jun 2020 22:34:01 +0200
From:Joerg Sonnenberger 
Message-ID:  <20200615203401.ga91...@bec.de>

  | > Running it under ktrace(1) shows it doing a stat(2) for every metadata
  | > file in the tree. The machine sounds like it is hitting the disk for
  | > every one. Is there any kind of cache for the attribute information
  | > that stat needs ?

There is, but like all caches, it only works for the 2nd and later
references, the first time through nothing is cached.

  | Raise kern.maxvnodes?

Unless it happens to be very small, I doubt that will help, or not
much, I'd expect that generally cvs is roaming the tree, updating
files, and moving on (largely - the prune step later is a bit of a redo)

It may be that there's nothing that can really be done - it would be
possible to pre-load the cache, (find topdir -size 0 -print >/dev/null)
but whatever time might be saved in the later cvs is likely more than
consumed by the find (which is also going to need to hit the disc
for most inodes).

The vnode cache caches only vnodes that are used, not others in the same
disk block, but the buffer cache should be able to retain those blocks
so that a later reference to one of the other inodes that was already
read from the drive needn't cause a read again - provided that the buffer
cache is big enough, so if there's anything worth trying to alter, I'd
have expected it to be
vm.bufcache
vm.bufmem_lowater
vm.bufmem_hiwater
to try and make sure that all those inode containing blocks are still
available when needed - ffs clusters inode numbers in a directory,
when it can, precisely so that this kind of buffer caching will be more
effective, so it isn't (or shouldn't be) necessary to keep everything for
the duration of the update.

kre

Re: pg_jobc going negative?

2020-06-10 Thread Robert Elz

Date:Tue, 9 Jun 2020 14:16:16 -0400
From:Christos Zoulas 
Message-ID:  

  | The FreeBSD refactoring LGTM. It also simplifies the code.

Sorry, been off net all day ... that may very well be the way to go,
but I'd like to understand what is happening with our current code
first, so we will know that the changes are fixing a problem, and not
just moving things around.

kre

Re: pg_jobc going negative?

2020-06-09 Thread Robert Elz

Date:Tue, 9 Jun 2020 17:04:54 +0200
From:Kamil Rytarowski 
Message-ID:  

  | Yes... syzkaller had like 12 different ways to reproduce it.

OK, thanks.

  | There is still a race and we randomly go to negative pg_jobc.

I am not at all surprised...

I will look at it over the next couple of days.   No guarantees...

kre

Re: pg_jobc going negative?

2020-06-09 Thread Robert Elz

Date:Tue, 9 Jun 2020 14:13:56 +0200
From:Kamil Rytarowski 
Message-ID:  <85d5e51f-afd1-1038-fd68-2366ff073...@netbsd.org>

  | Here is the simplest reproducer crashing the kernel on negative pg_jobc:

I have not looked at this closely yet, but this is likely because
ptrace() fiddles p_pptr which the routines that manipulate the pg_jobc
more or less expect to be a constant.

Is there any known reproducer of this problem which does not involve ptrace() ?

At first glance, the manipulations of pg_jobc looks a bit dodgy to me, but I 
haven't investigated enough to be able to spot a definite problem yet
(possible ptrace() generated issue aside - and yes, those need to work as
well).

I doubt very much that adding a new mutex will make a difference, all the
manipulations are done with proc_lock held, which is kind of the "big lock"
for process manipulation - adding finer grained locking might improve
performance, by improving concurrency, but is unlikely (at this stage,
nothing is impossible) to be a fix for this problem.

kre

Re: sys/idtype.h unused enumeration values

2020-05-19 Thread Robert Elz

Date:Tue, 19 May 2020 15:14:21 +0200
From:Kamil Rytarowski 
Message-ID:  

  | I've abandoned the intention of changing these values (by adding
  | comments, renaming etc).

Good, thank you.

  | Once I will have spare time I might look into
  | implementing the missing ID types, but I don't promise to do it soon.
  | P_PSETID is possibly the easiest one.

That's an orthogonal issue - implementing some of them might be
useful (perhaps) others less so, but whether ever implemented or
not, their number in the ID space should be preserved.

kre

Re: sys/idtype.h unused enumeration values

2020-05-19 Thread Robert Elz

Date:Tue, 19 May 2020 14:12:31 +0200
From:Kamil Rytarowski 
Message-ID:  <6874bb63-5146-797f-98b7-b9c497677...@gmx.com>

  | Rationale for pointless?

There is no point.   What more can I say?

  | My points were:
  |
  |  - Clobbering OS that claims the goals of clean design and clean code
  | with mutant alien bodies without counterparts in the native kernel,
  | without request from any relevant standard body.

Having the extra entries is harmless, there is no point deleting them.

By all means, if you want, add a /* not implemented by NetBSD */ comment
to each of the appropriate ones, but deleting them with the intent that
they could (perhaps) be put back later is just making hard work - the
values of anything that is actually used need to be preserved for ABI
compat, if something occupies the slot of one of the ones to be replaced
later, then the replacement has to be given a different value, making
binary compat with other systems (emulations) much more complex.   And
all just to delete a meaningless line which is harming nothing...

  |  - Collecting garbage in public headers that is unused, misleading and
  | can at best be dummy.

So, mark it as unused/unimplemented, so it will no longer be misleading.
Or replace them with placeholder symbols "was_P_..." if you'd like to
detect at compile time code that might exist that won't work as intended.

Just don't reuse the numeric values that the deleted symbols occupued.

kre

Re: sys/idtype.h unused enumeration values

2020-05-18 Thread Robert Elz

Date:Mon, 18 May 2020 21:11:36 +0200
From:Kamil Rytarowski 
Message-ID:  <05255347-1c55-2762-aaf6-fec3caf48...@gmx.com>

  | Next, I can add my value at the end of list (and before _P_MAXIDTYPE).

Other than this, everything that you propose is pointless.   This one
you can do (assuming of course that there is a good reason for your new
entry, but there's no reason to doubt that now.)

kre

Re: sys/idtype.h unused enumeration values

2020-05-18 Thread Robert Elz

Date:Mon, 18 May 2020 19:45:55 +0200
From:Kamil Rytarowski 
Message-ID:  

  | I have got a local use-case for another P_type (premature to discuss it
  | in this thread) and I would rather recycle an unused value.

Don't do that, it is just a number, use one that hasn't been used
for this purpose before.

  | Do we plan to get Solaris feature-parity with all the types? I assume
  | that the answer is NO. If so, can we delete the P_ values that are not
  | applicable for NetBSD?

I have no problem with that - just don't reuse the values for some
different purpose, keep them reserved (assign them meaningless reserved
names) just in case there's ever a need to implement one of those things
(this is very very cheap insurance).

kre

Re: All (?) network tests failing

2020-04-04 Thread Robert Elz

Date:Sun, 5 Apr 2020 01:26:15 - (UTC)
From:chris...@astron.com (Christos Zoulas)
Message-ID:  

  | It could be due to tcsh doing its file descriptor dance differently...
  | What shell are you using?

When I run tests against HEAD, I use /bin/sh - the only other
possibilities are csh (which I gave up using decades ago, before
there was a tcsh) or /bin/ksh (of whioch our version has too many
"issues" to bother with).   I have nothing from pkgsrc installed
in test setups.

The b5 tests are the same I believe, simply build HEAD, install it,
and atf-run

kre

Re: All (?) network tests failing

2020-04-04 Thread Robert Elz

Date:Sat, 4 Apr 2020 16:37:08 +0300
From:Andreas Gustafsson 
Message-ID:  <24200.36228.881611.989...@guava.gson.org>

  | Does anyone have an idea why the tests didn't start failing
  | immediately when route.c 1.167 was committed, but only after the
  | seemingly unrelated openssl update?

Not an idea, but a possibility - the change to route.c (1.167) was
unimportant - it doesn't really matter (to the tests) if it does
anything useful or not - it is possible that it just happened that the
fd that the setsockopt() was being performed on was a socket (a suitable
socket) prior to the openssl update, but after that, the rump fd's
shifted around, and what the setsockopt() was operating upon was no
longer a socket.

No idea if that is really what happened or not, but something like that
is at least plausible (even though it would seem that the changes of the
sys call having worked by accident seem to be not very high).

kre

Re: Another option issue [was Re: Rump makes the kernel problematically brittle]

2020-04-04 Thread Robert Elz

Date:Sat, 4 Apr 2020 12:59:54 -0400 (EDT)
From:Mouse 
Message-ID:  <202004041659.maa21...@stone.rodents-montreal.org>

  | I added the #include to a long string of #include "opt_h" lines,
  | none of which are conditional on anything, in
  | sys/arch/amd64/amd64/machdep.c.

  | Of course, it's entirely possible nobody remembers enough from far
  | enough back

I don't remember much from as far back as yesterday, but what failed was
the XEN Dom0 kernel build (well, the "make depend" for that).   You say
you added things to sys/arch/amd64/amd64/machdep.c which makes me wonder
if (back then) something needed to be added in a similar place in
sys/arch/xen/somewhere

kre

Re: Rump makes the kernel problematically brittle

2020-04-02 Thread Robert Elz

Date:Thu, 2 Apr 2020 14:54:13 -0400 (EDT)
From:Mouse 
Message-ID:  <202004021854.oaa20...@stone.rodents-montreal.org>

  | Yes, I got a very nice and helpful off-list mail (thank you!) saying,
  | approximately, that I needed to have the #include of opt_autoconf.h
  | inside the _KERNEL_OPT conditional.

As you guessed, that was the issue I suspected.

Since you're just using an #ifndef that should be all that's needed,
for this issue anyway.

  | Is this documented anywhere?

You're putting documented and rump into the same thought space?

kre

Re: All (?) network tests failing

2020-04-02 Thread Robert Elz

Date:Mon, 30 Mar 2020 14:25:01 -0400
From:Christos Zoulas 
Message-ID:  <3d3ac2b9-5e6e-400c-9a4b-10742c90c...@zoulas.com>

  | All the tests are failing for you the same way:
  | rump.route: SO_RERROR: Socket operation on non-socket

Not all, but quite a few are.

This one I think is due to src/sbin/route/rouyte.c 1.167

 sock = prog_socket(PF_ROUTE, SOCK_RAW, 0);
 if (setsockopt(sock, SOL_SOCKET, SO_RERROR,
 &on, sizeof(on)) == -1)
 warn("SO_RERROR");

where that setcockopt() was added.   I think that needs to be a prog_*
type call, so rump can do the right thing.   That will mean adding it
to prog_opts, and right now I don't have time to work out what the correct
magic is, but if no-one else does in the next day or so, I will take
another look.

That should take care of the failing network related tests that contain
rump.route commands, but that's not all of the failing tests.

kre

Re: Rump makes the kernel problematically brittle

2020-04-02 Thread Robert Elz

Date:Thu, 2 Apr 2020 12:45:35 -0400 (EDT)
From:Mouse 
Message-ID:  <202004021645.maa22...@stone.rodents-montreal.org>

  | But the error makes me reasonably sure it's related to the defflag I
  | added to files.kern.

Perhaps.   I'd actually like to see the diff for this (related) one ...

usr/src/sys/kern/subr_autoconf.c
Include "opt_autoconf.h" and implement NO_DETACH_MESSAGES, to
suppress device-detached console spammage on shutdown.

as that's clearly where the problem occurs.   Rump doesn't do kernel
options the wame way the kernel does, and care needs to be taken when
adding op_*.h includes as those files won't generally exist in the rump
universe.

kre

Re: New tools proposal: ioctlname and ioctldecode

2020-04-02 Thread Robert Elz

Date:Thu, 2 Apr 2020 04:11:17 +0200
From:Kamil Rytarowski 
Message-ID:  

  | This is partially enforceable. As once we generate catchall switch like:
  |
  | case FOO_OP:
  | ...
  | case BAR_OP:
  | ...
  |
  | a compiler will report error whenever FOO_OP = BAR_OP.

That makes it easy to detect, not enforce.   Once detected that way,
what happens next?  Neither will (or can really) change as they both
have existing applications compiled that use them - in the worse case
the conflicts come from attempting to implement compat mode for someone
else's binaries, and support their existing ioctl's (which we obviously
cannot alter) - and do that for two different systems at once.

This is the same reason we cannot fix the few duplicates that exist in
our code - we want to retain backward binary compatibility, which means
supporting ancient binaries that happen to use those ioctl values.

Avoiding conflicts is (always has been) the aim - it allows drivers to
detect attempts to use an ioctl intended for some different device, rather
than believing it was intended for them, but there simply isn't, and can't
really be, any way to enforce that.

kre

ps: don't forget all the sockioctl()s which also need decoding.   Perhaps
even more than device ioctls.

Re: All (?) network tests failing

2020-03-31 Thread Robert Elz

Date:Mon, 30 Mar 2020 14:25:01 -0400
From:Christos Zoulas 
Message-ID:  <3d3ac2b9-5e6e-400c-9a4b-10742c90c...@zoulas.com>

  | All the tests are failing for you the same way:
  | rump.route: SO_RERROR: Socket operation on non-socket
  |  <>I doubt that my gif change affected that. This smells to me like the =
  | rump fd hijack is not
  | working either because we have some new system call involved or =
  | something is messing
  | up the file descriptors.

If something has decided to move an fd out of the low number space
(not all that high necessarily) then rumphijack will confuse the fd
from user space with one of its own (it isn't very smart about that,
and bases the decision entirely upon the value of the fd it sees).

I wonder if something changte to try and be "nice" to other programs
by moving a "background" fd out of the 0..50 type space usually used by
user fd's and somewhere up > 100 ? (the fd space really runs up to the
thousands, but nothing we run ATF/Rump tests against ever needs more than
a small number of fd's, so they never naturally get out of the low number
area).

kre

Re: All (?) network tests failing

2020-03-31 Thread Robert Elz

Date:Mon, 30 Mar 2020 20:47:12 -0400
From:Christos Zoulas 
Message-ID:  

  | Unfortunately they still work for me after a clean build. I am going to =
  | try to download a standard build...

Does your tree have any uncommitted changes?

(I see the same 200+ tests failing as everyone else seems to see, on
amd64 (I do my tests in a XEN DomU), but I note b5 is seeing the same
on i386 (at least) as well).

kre

Re: SIGCHLD set to SIG_DFL on exec(3)

2020-02-08 Thread Robert Elz

Date:Sat, 8 Feb 2020 16:47:42 +0100
From:Kamil Rytarowski 
Message-ID:  

  | We are allowed to fix this in the kernel for everybody:

Indeed we are.   And if you want to change things that way, that's fine.

It turns out this one wasn't the actual problem in this instance, so
doing that would have made no difference...   The issue with was with
SIGCHLD blocked, and that one POSIX doesn't allow us to undo.

Certainly, as the extract you quoted from the standard says, exec()ing
a new process with "unusual" signals blocked can lead to weird behaviour,
as almost nothing races around unblocking (etc) signals as part of the
startup code.   It doesn't help that SIGCHLD is decidely weird (thanks
to Sys V) with its bizarre side effects.

It seems likely that our shell actually never had a problem (though would
have, had SIGCHLD been ignored) with this, as we don't actually use
SIGCHLD for anything at all.Why some other (well, apparently all
other) ash derived shells do is not clear to me - though over the years
our internal sh wait(1) code has been rewritten quite frequently it seems.

I would however keep the new code that sets SIGCHLD to SIG_DFL, and
unblocks it (always, regardless of its state at entry) even if the
kernel were changed as you suggest above, as one day I'd like to make
the netbsd sh (like several other programs we have) exportable to others
to use, and we cannot control what their kernels do.

We did have (well, I introduced) a problem when the shell inherited children
from a previous executable to occupy the process, if that child completed
at the "wrong" time we would dump core.. (which was also an aspect of this
issue) - which was trivial to fix once the possibility even occurred to me.
It isn't something that happens often.

Finally note that it took a particularly bizarre piece of bash code to make
bash exec another shell (or for that matter, any other process) in this way,
and that it did is most likely a bug in it (ie: not intententional) and
will probably get fixed, this is a very rare problem, with a very simple
fix, which really affects almost nothing, so I wouldn't spend a lot of time
worrying about other possible solutions.

kre

Re: Rump dependencies (5.2)?

2020-01-12 Thread Robert Elz

Date:Sun, 12 Jan 2020 22:36:23 -0500 (EST)
From:Mouse 
Message-ID:  <202001130336.waa17...@stone.rodents-montreal.org>

  | I can't recall ever wanting its functionality,

It is used mostly by a lot of the ATF tests.

  | and trying to figure out what the dependency graph is
  | when it exists only implicitly in Makefiles scattered all over the tree
  | sounds like a recipe for serious headaches.

That it is.

kre

Re: __{read,write}_once

2019-11-21 Thread Robert Elz

Date:Fri, 22 Nov 2019 01:04:56 +0100
From:Kamil Rytarowski 
Message-ID:  <1a9d9b40-42fe-be08-d7b3-e6ecead5b...@gmx.com>

  | I think that picking C11 terminology is the way forward.

Use a name like that iff the intent is to also exactly match the
semantics implied, otherwise it will only create more confusion.

kre

Re: __{read,write}_once

2019-11-21 Thread Robert Elz

Date:Thu, 21 Nov 2019 19:19:51 +0100
From:Maxime Villard 
Message-ID:  

  | So in the end which name do we use? Are people really unhappy with _racy()?
  | At least it has a general meaning, and does not imply atomicity or ordering.

I dislike naming discussions, as in general, while there are often a
bunch of obviously incorrect names (here for example, read_page_data() ...)
there is very rarely one obviously right answer, and it all really just becomes
a matter of choice for whoever is doing it.

Nevertheless, perhaps something that says more what is actually happening,
rather than mentions what doesn't matter (races here), so perhaps something
like {read,write}_single_cycle() or {read,write}_1_bus_xact() or something
along those lines ?

kre

Re: __{read,write}_once

2019-11-11 Thread Robert Elz

Date:Mon, 11 Nov 2019 21:32:06 +0100
From:Joerg Sonnenberger 
Message-ID:  <2019203206.ga4...@bec.de>

  | The update needs to be uninterruptable on the local CPU in the sense that
  | context switches and interrupts don't tear the R and W part of the RMW cycle
  | apart.

If I understood the original message, even that was not required.

All that was really demanded was that each of the R and W oarts
not be split into pieces where an intervening interrupt might alter
the "other" part, allowing a read to receive, or a write to produce,
a value that neither the basline code, nor the interrupt routine
even intended.

But in
x += 2;
if an interrupt intervenes and does
x += 1;
it is acceptably for the evential reault to be old_x + 1 (which is very
unlikely), old_x + 2, or old_x + 3, but only those.

That's how I understood the "racy is OK" part of it.  Ie: we don't care
about race conditions like this.

What we don't want is if x started out at  and is read in 2
halves, for the base code to read the  half, then the interrupt to
change the value to 0001 and then the base code to read  for the
bottom half, leading to a reasult of 1 in x, rather than 0001 or 00010001
which is not acceptable (or anything else like that, and similarly for
writes).

Or in other words, it is acceptable for some uses for the occasuonal
change to be lost (particuarly for some event counters) if it is not
likely to happen frequently, but not for entirely bogus values to appear.

kre

1 2 3 4 >

1 - 100 of 334 matches

Mail list logo