Re: [RFC] new APIs to use wskbd(4) input on non-wsdisplay tty devices

2024-04-12 Thread Thor Lancelot Simon
On Fri, Apr 12, 2024 at 09:13:17AM -0400, Thor Lancelot Simon wrote:
> On Sat, Apr 06, 2024 at 11:56:27PM +0900, Izumi Tsutsui wrote:
> > 
> > I'd like to add new APIs to use wskbd(4) input on non-wsdisplay
> > tty devices, especially news68k that can use a putchar function
> > provided by firmware PROM as a kernel console device.
> 
> Wouldn't a tty be a better abstraction for this? Lots of minicomputers
> had incredibly dumb serial consoles.

I think the above is not clear.  To rephrase my question: what is the
wscons layer actually adding here?  Would it not be simpler to interface
at the tty layer instead?

Thor


Re: [RFC] new APIs to use wskbd(4) input on non-wsdisplay tty devices

2024-04-12 Thread Thor Lancelot Simon
On Sat, Apr 06, 2024 at 11:56:27PM +0900, Izumi Tsutsui wrote:
> 
> I'd like to add new APIs to use wskbd(4) input on non-wsdisplay
> tty devices, especially news68k that can use a putchar function
> provided by firmware PROM as a kernel console device.

Wouldn't a tty be a better abstraction for this? Lots of minicomputers
had incredibly dumb serial consoles.



Re: Forcing a USB device to "ugen"

2024-03-26 Thread Thor Lancelot Simon
On Tue, Mar 26, 2024 at 12:25:07AM +, Taylor R Campbell wrote:
> 
> We should really expose a /dev/ugen* instance for _every_ USB device;
> those that have kernel drivers attached have only limited access via
> /dev/ugen* (no reads, writes, transfer ioctls, ), until you do
> ioctl(USB_KICK_OUT_KERNEL_DRIVER) or whatever, at which point the
> kernel driver will detach and the user program can take over instead
> and use the full ugen(4) API.

I don't think this can be safely allowed at security level > 0, unless,
perhaps, it's restricted from working on devices that would match disk
drivers.



Re: PVH boot with qemu

2023-12-18 Thread Thor Lancelot Simon
On Mon, Dec 11, 2023 at 10:22:18AM +0100, Emile `iMil' Heitor wrote:
> 
> Or would you prefer the same kernel to be able to boot in both XENPVH and
> GENPVH modes? I am focusing on making the resulting kernel smaller but this
> could be done also.

Yes, the same kernel should be able to boot on both hypervisor platforms.
To do otherwise confounds users' expectations, provides less functionality
than other operating systems, and seems to me like a poor default.



Re: Maxphys on -current?

2023-08-05 Thread Thor Lancelot Simon
On Fri, Aug 04, 2023 at 09:48:18PM +0200, Jarom??r Dole??ek wrote:
> 
> For the branch, I particularly disliked that there were quite a few
> changes which looked either unrelated, or avoidable.

There should not have been any "unrelated" changes.  I would not be
surprised if there were changes that could have been avoided.

It has been a very, very long time, but I think there are a few things
worth noting that I discovered in the course of the work I did on this
years ago.

1) It really is important to propagate maximum-transfer-size information
   down the bus hierarchy, because we have many cases where the same device
   could be connected to a different bus.

2) RAIDframe and its ilk are tough to get right, because there are many ugly
   corner cases such as someone trying to replace a failed component of a
   RAID set with a spare that is attached via a bus that has a smaller
   transfer size limit.  Ensuring both that this doesn't happen and that
   errors are propagated back correctly is pretty hard.  I have a vague
   recollection that this might be one source of the "unrelated" changes you
   mention.

3) With MAXPHYS increased to a very large value, we have filesystem code that
   can behave very poorly because it uses naive readahead or write clustering
   strategies that were only previously held in check by the 64K MAXPHYS
   limit.  I didn't even make a start at handling this, honestly, and the
   aparrent difficult of getting it right is one reason I eventually decided
   I didn't have time to finish the work I started on the tls-maxphys branch.
   Beware!  Don't trust linear-read or linear-write benchmarks that say your
   work in this area is done.  You may have massacred performance for real
   world use cases other than your own.

One thing we should probably do, if we have not already, is remove any ISA
DMA devices and some old things like the wdc and pciide IDE attachments from
the GENERIC kernels for ports like amd64, and then bump MAXPHYS to at least
128K, maybe 256K, for those kernels.  Beyond that, though, I think you will
quickly see the filesystem and paging misbehaviors I mention in #3 above.

Thor


Re: How to submit patches?

2023-05-07 Thread Thor Lancelot Simon
On Sat, May 06, 2023 at 12:12:54PM +0200, tlaro...@polynum.com wrote:
> 
> How to submit patches without wasting time? (mine included)

It might be that you get quicker response on one of the mailing lists
for platforms where the patches are particularly useful.  It might not,
too - but the set of people with the knowledge to review work in this area
is not so large, and copying the per-port lists might help get their
attention.

-- 
 Thor Lancelot Simon t...@panix.com
"Somehow it works like a joke, but it makes no sense."
--Gilbert Gottfried


Re: Comparison of different-width ints in ixg(4)

2022-10-09 Thread Thor Lancelot Simon
On Sun, Oct 09, 2022 at 09:24:32AM -0500, Mario Campos wrote:
> On Sat, Oct 8, 2022 at 1:00 PM Taylor R Campbell
>  wrote:
> >
> > > Date: Sat, 8 Oct 2022 10:58:58 -0500
> > > From: Mario Campos 
> > >
> > > I ran a SAST tool, CodeQL, against trunk and found a couple of
> > > instances (below) where the 16-bit integer `i` is compared to the
> > > 32-bit integer `max_rx_queues` or `max_tx_queues` in ixg(4). If
> > > `max_rx_queues` (or `max_tx_queues`) is sufficiently large, it could
> > > lead to an infinite loop.
> > >
> > > sys/dev/pci/ixgbe/ixgbe_vf.c:280
> > > sys/dev/pci/ixgbe/ixgbe_vf.c:284
> > > sys/dev/pci/ixgbe/ixgbe_common.c:1158
> > > sys/dev/pci/ixgbe/ixgbe_common.c:1162
> >
> > Cool.  I don't think this case is a bug because the quantities in
> > question are bounded by IXGBE_VF_MAX_TX/RX_QUEUES, which are both 8.
> > But it would be reasonable to use u32 or even just unsigned for this.
> > Did this tool turn anything else up?
> 
> Ah, great! I also think it would still be an improvement, if only as a
> means of defensive programming should IXGBE_VF_MAX_TX/RX_QUEUES ever
> be increased.

So, to me this looks like a bit of a lesson on the importance of
understanding what the program being analyzed does, when interpreting
the results of automated analysis tools.

This is a driver for a hardware device.  The variables in question are
used to describe the number of queues available from the hardware.  These
queues are typically allocated per CPU core on the hosting system, at
most, and are definitely bounded by the properties of the underlying
device hardware.  It is not plausible that this hardware device would ever
exceed 16 bits worth of distinct queues (as Taylor notes, 8 is a typical
value).  The use of a 16 bit variable to hold the value likely reflects
the underlying hardware stuffing two values into a single 32-bit register.

Which is not to say this shouldn't be fixed - it should.  But, if the
tool flagged other results they are almost certainly more likely to
reflect actual bugs which could impact a system in the real world!

Thor


Re: MP-safe /dev/console and /dev/constty

2022-10-02 Thread Thor Lancelot Simon
On Sat, Oct 01, 2022 at 07:59:23PM -0400, Mouse wrote:
>
> usual approach to such things.  Your suggestion of pushing it into a
> separate function (which presumably would just mean using return
> instead of break to terminate the code block) strikes me as worth
> considering in general but a bad idea in this case; there are too many
> things that would have to be passed down to the function in question.

Of course, GCC offers nested functions for exactly this, but...

Thor


Re: regarding the changes to kernel entropy gathering

2021-04-07 Thread Thor Lancelot Simon
On Sun, Apr 04, 2021 at 11:02:02PM +, Taylor R Campbell wrote:
> 
> Lots of SoCs have on-board RNGs these days; there are Intel and ARM
> CPU instructions (no ARMv8.5 hardware yet that I know of, but we're
> ready for its RNG!); some crypto decelerators like tpm(4), ubsec(4),
> and hifn(4) have RNGs; and there are some dedicated RNG devices like
> ualea(4).

Can we actually use the TPM RNG from in-kernel?  Whether we should is a
different, interesting question, given how it is typically implemented.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: regarding the changes to kernel entropy gathering

2021-04-07 Thread Thor Lancelot Simon
On Tue, Apr 06, 2021 at 10:54:51AM -0700, Greg A. Woods wrote:
> At Mon, 5 Apr 2021 23:18:55 -0400, Thor Lancelot Simon  wrote:
> 
> > But what you're missing is that neither does what you
> > think.  When rndctl -L runs after the system comes up multiuser, all
> > entropy samples that have been added (which are in the per-cpu pools)
> > are propagated to the global pool.  Every stream RNG on the system then
> > rekeys itself - they are _not_ just using the entropy from the seed on
> > disk.  Even if nothing does so earlier, when rndctl -S runs as the system
> > shuts down, again all entropy samples that have been added (which, again,
> > are accumulating in the per-cpu pools) are propagated to the global pool;
> > all the stream RNGs rekey themselves again; then the seed is extracted.
> 
> That's all great, and more or less what I've assumed from all the
> previous discussion
> 
> Except it seems to be useless in practice without an initial seed,

Again there's really little I can do other than suggest you read the code.
You are certainly competent to do so, and the code does not do what you
keep claiming it does.  Read the code, all of it -- it's only a few hundred
lines -- and have a think.

When rndctl -L runs, or you perform a sufficiently long write to /dev/random,
all the per-CPU pools, which, counter to what you keep claiming, *do* accumulate
samples from all the same sources they used to, are coalesced into the global
pool.  When rndctl -S runs, all the per-CPU pools, which, counter to what you
keep claiming, *do* accumulate samples from all the same sources they used to,
are coalesced into the global pool.  If you'd like those samples coalesced
into the global pool more frequently, you can use the sysctl to do so.

Thor


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Thor Lancelot Simon
On Mon, Apr 05, 2021 at 02:13:31PM -0700, Greg A. Woods wrote:
> At Mon, 5 Apr 2021 15:37:49 -0400, Thor Lancelot Simon  wrote:
> Subject: Re: regarding the changes to kernel entropy gathering
> >
> > On Sun, Apr 04, 2021 at 03:32:08PM -0700, Greg A. Woods wrote:
> > >
> > > BTW, to me reusing the same entropy on every reboot seems less secure.
> >
> > Sure.  But that's not what the code actually does.
> >
> > Please, read the code in more depth (or in this case, breadth), then argue
> > about it.
> 
> Sorry, I was eluding to the idea of sticking the following in
> /etc/rc.local as the brain-dead way to work around the problem:
> 
>   echo -n "" > /dev/random
> 
> However I have not yet read and understood enough of the code to know
> if:
> 
>   dd if=/dev/urandom of=/dev/random bs=32 count=1

It's no better.  But what you're missing is that neither does what you
think.  When rndctl -L runs after the system comes up multiuser, all
entropy samples that have been added (which are in the per-cpu pools)
are propagated to the global pool.  Every stream RNG on the system then
rekeys itself - they are _not_ just using the entropy from the seed on
disk.  Even if nothing does so earlier, when rndctl -S runs as the system
shuts down, again all entropy samples that have been added (which, again,
are accumulating in the per-cpu pools) are propagated to the global pool;
all the stream RNGs rekey themselves again; then the seed is extracted.

It is neither the case that samples added with a 0 entropy estimate go
nowhere, nor that they do not add entropy to the seed file such that it
is _not_ "reusing the same entropy on every boot".

If you'd like to propagate samples from the per-CPU pool to the global
pool and force the stream generators to rekey more often, you can
sysctl -w kern.entropy.consolidate=1 from cron.



Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Thor Lancelot Simon
On Sun, Apr 04, 2021 at 03:32:08PM -0700, Greg A. Woods wrote:
> 
> BTW, to me reusing the same entropy on every reboot seems less secure.

Sure.  But that's not what the code actually does.

Please, read the code in more depth (or in this case, breadth), then argue
about it.



Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Thor Lancelot Simon
On Mon, Apr 05, 2021 at 09:30:16AM -0700, Greg A. Woods wrote:
> At Mon, 5 Apr 2021 10:46:19 +0200, Manuel Bouyer  
> wrote:
> Subject: Re: regarding the changes to kernel entropy gathering
> >
> > If I understood it properly, there's no need for such a knob.
> > echo 0123456789abcdef0123456789abcdef > /dev/random
> >
> > will get you back to the state we had in netbsd-9, with (pseudo-)randomness
> > collected from devices.
> 
> Well, no, not quite so much randomness.  Definitely pseudo though!
> 
> My patch on the other hand can at least inject some real randomness into
> the entropy pool, even if it is observable or influenceable by nefarious
> dudes who might be hiding out in my garage.

No.  You are confused.

All those inputs are *already* being injected into the entropy pool.  If you
don't understand that, you need to read the code more.

All echoing crap into /dev/random does is goose the system's entropy estimate
so it will give you the _output_ of the pool when it thought it shouldn't yet.

Thor


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Thor Lancelot Simon
On Sun, Apr 04, 2021 at 01:08:20PM -0700, Greg A. Woods wrote:
> 
> I trust the randomness and in-observability and isolation of the
> behaviour of my system's fans far more than I would trust Intel's RDRAND
> or RDSEED instructions.

I do not.  However, I do differ with Taylor in that I believe that system
fans are a very good example of a case where there is a well understood
_physical_ basis -- turbulence -- for the existence of the entropy being
collected, and that we should count it, in a very conservative way.

If your system has an audio device, and you mute all the inputs and turn
the gain all the way up, then feed that in as entropy samples (we do not
currently have a device driver to do this, but you could from userslace)
I'd make the same argument, since if you don't get all-zeroes, you're then
sampling the amplifier noise.

I also think there is a case to be made for skew sources since measuring
independently sourced clocks against one another is a basic technique in
many hardware RNGs; but I am not sure the one we have is good enough.

It's important to note that most hardware RNG designs *do not* come with
a real physics model alleging to _compute_ their underlying entropy; rather
they come with an explanation in physical terms of where the entropy comes
from, an explanation of how it is sampled (since this can impact the degree
to which samples are truly independent of one another), and _empirically
derived_ estimates of the minimum expected output entropy.  These are usually
obtained by taking the output data prior to any hashing or "whitening" stage
and running specialized statistical tests, compressing it, etc.

I would _personally_ support counting the entropy from environmental sources,
certainly fans but possibly also thermal sensors, if someone wanted to do the
work of characterizing it in a way which is as rigorous as what's done for
commercial RNGs (the Hifn 799x paper on this is a relatively simple worked
example) _and_ empirically general across a wide array of systems.  And I
think the audio thing is worth exploring.

There is also Peter Gutmann's view that what matters is not really what
mathematicians call "entropy" but, really, the work required for an adversary
to predict the output.  This line of thought can lead to very different
views of what bits should count, on a per-system basis; and I think it would
not be wrong to let the system administrator override this with rndctl,
source by source, as previously they could - but the problem remains that
nobody knows _when_ to count such samples and it is very hard to know you
are not overcounting.



Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-31 Thread Thor Lancelot Simon
On Wed, Mar 31, 2021 at 11:24:07AM +0200, Manuel Bouyer wrote:
> On Tue, Mar 30, 2021 at 10:42:53PM +, Taylor R Campbell wrote:
> > 
> > There are no virtual RNG devices on the system in question, according
> > to the quoted `rndctl -l' output.  Perhaps the VM host needs to be
> > taught to expose a virtio-rng device to the guest?
> 
> There is no such thing in Xen.

Is the CPU so old that it doesn't have RDRAND / RDSEED, or is Xen perhaps
masking these CPU features from the guest?

Thor


Re: Set USB device ownership based on vedorid/productid

2021-02-15 Thread Thor Lancelot Simon
On Mon, Feb 15, 2021 at 12:15:28PM -0500, Mouse wrote:
> > Does NetBSD provide any framework that allows USB device
> > ownership/permissions to be autmatically set on USB
> > VendorId/DeviceId?
> 
> As far as I know it doesn't; a quick look at 8.0's manpages didn't show
> me anything.  (9.1 is not easy for me to check right now; today I'm on
> the wrong job for that.)

I don't think so, because these IDs aren't exposed to the right layer
of the autoconf machinery.  They're used internally in the driver match
routines.  Hoisting them up to the config file seems like a big task and
in some ways a regression.

But I wonder if, for the "generic" drivers like ugen, they could be
added as optional locators, passed down into ugen's match/attach so
a specific ugen instance could be pinned down to a specific vendor and
device ID.  Then you'd get, let's say, ugen9 whenever you had vid 0x5150
did 0x1337, and you could adjust the permissions on /dev/ugen9 accordingly
in /dev and away you'd go.

I can see how this would work for static config but I am not really
up to date enough on the modern world of NetBSD kernel modules and boot
time configuration to understand how or even whether it could work well
there.

Thor



Re: partial failures in write(2) (and read(2))

2021-02-10 Thread Thor Lancelot Simon
On Fri, Feb 05, 2021 at 08:10:06PM -0500, Mouse wrote:
> > It is possible for write() calls to fail partway through, after
> > already having written some data.
> 
> It is.  As you note later, it's also possible for read().
> 
> The rightest thing to do, it seems to me, would be to return the error
> indication along with how much was successfully written (or read).  But
> that, of course, requires a completely new API, which I gather is more
> intrusive than you want to get into here.

I think it could be done with a signal in combination with the existing
API.

Thor


Re: /dev/random issue

2020-10-02 Thread Thor Lancelot Simon
On Thu, Oct 01, 2020 at 06:11:20PM +0200, Martin Husemann wrote:
> On Thu, Oct 01, 2020 at 05:57:12PM +0200, Manuel Bouyer wrote:
> > Source Bits Type  Flags
> > /dev/random   0 ???  estimate, collect, v
> [..]
> > seed  0 ???  estimate, collect, v
> 
> No random number generator and you did not seed the machine.

I still firmly believe that the fan sensor, at least, should be
counting bits by default -- there is an obvious, random physical
process (turbulence) involved.

That's not likely to get you enough bits to move forward,
though.

> 
> On another machine with working random number generator (nearly all modernish
> amd64 machines have that) do:
> 
>   dd if=/dev/random of=/tmp/file bs=32 count=1
> 
> then scp the file over and dd it into /dev/random:
> 
>   dd if=/tmp/file of=/dev/random bs=32 count=1
> 
> This will be preserved accross reboots, so it is a one-time only fix.
> 
> Martin

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: "wireguard" implementation improperly merged and needs revert

2020-08-23 Thread Thor Lancelot Simon
On Sat, Aug 22, 2020 at 08:35:39PM +0200, Jason A. Donenfeld wrote:
> 
> In its current form, there are implementation flaws and violations
> that I do not consider acceptable, and deploying this kind of thing is
> highly irresponsible and harmful to your users.

Can you please explain what these (heck, what even _one_ of these) are?

As far as I know the code was written following the published documentation.
That's how Internet protocol development works, is it not?

Thor


Re: /dev/crypto missing

2020-07-27 Thread Thor Lancelot Simon
On Tue, Jul 28, 2020 at 01:35:53AM +, Taylor R Campbell wrote:
> 
> /dev/crypto is totally obsolete as it exists today.  Really the only
> reason it continues to exist is to test opencrypto drivers from
> userland before using them in the kernel.

This is not really the case.  The OpenSSL project has *finally* made
the changes to their core TLS state machine required to take advantage
of asynchronous crypto via device driver in a performant way.  It would
now be possible, with a better /dev/crypto ENGINE in OpenSSL, to actually
get a pretty good performance bump from hardware accelleration on a number
of platforms.

Unfortunately, roughly contemporaneously with so doing, they also managed to
rewrite their own /dev/crypto engine to a weird variant Linux /dev/crypto
API, ignoring the significant enhancements we added in NetBSD about 15
years ago (multiple request submission/retrieval and asynchronous
operation).  This is particularly frustrating to me since, back then, we
(Coyote Point and NBMK) sent them patches for both parts of the puzzle... 

Anyhow, it's no longer the case that OpenSSL structurally _couldn't_ use
/dev/crypto efficiently.  But it'd take a second rewrite on their new
devvcrypto ENGINE to make it do so.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: AES leaks, cgd ciphers, and vector units in the kernel

2020-06-27 Thread Thor Lancelot Simon
This is a *huge* effort.  Thank you.

On Sun, Jun 28, 2020 at 03:27:56AM +, Taylor R Campbell wrote:
> > Date: Mon, 22 Jun 2020 23:43:20 +
> > From: Taylor R Campbell 
> > 
> > There is some more room for improvement -- SSSE3 provides PSHUFB which
> > can sequentially speed up parts of AES, and is supported by a good
> > number of amd64 CPUs starting around 14 years ago that lack AES-NI --
> > but there are diminishing returns for increasing implementation and
> > maintenance effort, so I'd like to focus on making an impact on
> > systems that matter.  (That includes non-x86 CPUs -- e.g., we could
> > probably easily adapt the Intel SSE2 logic to ARM NEON -- but I would
> > like to focus on systems where there is demand.)
> 
> I drafted derivatives of Mike Hamburg's vpaes code using Intel SSSE3
> and using ARM NEON / aarch64 SIMD.  In principle the ARM NEON code
> should work on armv7, but I have only compile-tested it there, and
> there are a few kinks to be worked out before it can be used in the
> kernel on armv7.
> 
> I pushed it to the riastradh-kernelcrypto topic on hg src-draft, and I
> updated the userland aestest utility if you want to get a rough idea
> of the performance without updating your kernel (see previous message
> for usage instructions):
> 
> https://www.NetBSD.org/~riastradh/tmp/20200627/aestest.tgz
> 
> The summary of the patch set now is (kernel only -- no userland
> changes):
> 
> - every architecture gets constant-time AES, with BearSSL's aes_ct
>   32-bit bitsliced implementation -- there is no more vulnerable AES
>   code in the NetBSD kernel, although there is a substantial
>   performance hit on many platforms
> 
> - every architecture gets new cgd(4) support for Adiantum, which is
>   generally as fast as or faster than AES-CBC and AES-XTS were before
>   and provides better security (and has lots of room to be sped up;
>   any speedups would also be applicable to other purposes too, like
>   Wireguard)
> 
> - most high-end x86 of the past decade gets much much faster AES with
>   AES-NI CPU support (no 32-bit yet)
> 
> - almost all x86 of the past decade gets faster or much faster AES
>   with a vpaes-style SSSE3-based implementation (32-bit included)
> 
> - most x86 of the past two decades, including all amd64, mitigates the
>   performance hit with a bitsliced SSE2-based implementation (32-bit
>   included)
> 
> - VIA gets much faster AES with VIA ACE (for all users in the kernel,
>   including cgd, not just those that use opencrypto as we had before
>   with the via_padlock.c driver)
> 
> - almost all aarch64 (except rpi) gets much much faster AES with
>   ARMv8.0-AES CPU support
> 
> - 64-bit rpi (and, with a little more work, armv7 with NEON) mitigates
>   the performance hit -- and may get faster -- with a vpaes-style
>   NEON-based implementation
> 
> Some other CPUs like modern POWER have AES CPU instructions these days
> too.  The vpaes approach could probably be adapted to PowerPC Altivec,
> and maybe some other vector units I'm not as familiar with (MIPS SIMD
> Architecture, MSA?).  BearSSL's aes_ct64 64-bit bitsliced
> implementation might be worth adopting for 64-bit CPUs without a
> vector unit, if anyone cares -- maybe alpha or mips64.  But I think
> I'm at the limit of what I'm willing to do for fun with the hardware I
> have easy access to.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: makesyscalls

2020-06-09 Thread Thor Lancelot Simon
Python is essentially uncrosscompilable and its maintainers have repeatedly
rudely rejected efforts to make it so.

If that weren't the case, and the way installed Python modules were "managed"
(I use the term liberally) were made sane, I'd think it were a fine thing to
use in base.  But it is the case, and that won't be made sane, and so I think
it belongs nowhere near NetBSD.  Could you translate your prototype into a
different language, one that's less basically hostile to our build system
and its goals and benefits?



Re: KAUTH_SYSTEM_UNENCRYPTED_SWAP

2020-05-18 Thread Thor Lancelot Simon
On Mon, May 18, 2020 at 09:08:14PM +0100, Alexander Nasonov wrote:
> matthew green wrote:
> > what's the use-case for disabling encrypted swap later?
> 
> It might be too slow on some machines.
> 
> > i'd argue we should avoid kauth for this and simply disable
> > it always as i've been unable to think of any use case that
> > is the only solution.
> 
> Always encrypted swap would be even better but ... slow machines.

Compared to the time required to put the pages out to disk?

Could try chacha8.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Initial entropy with no HWRNG

2020-05-12 Thread Thor Lancelot Simon
On Tue, May 12, 2020 at 10:00:20AM +0300, Andreas Gustafsson wrote:
>
> Adding more sources could mean
> reintroducing some timing based sources after careful analysis, but
> also things like having the installer install an initial random seed
> on the target machine (and if the installer itself lacks entropy,
> asking the poor user to pound on the keyboard until it does).

As Peter Gutmann has noted several times in the past, for most use cases
you don't have to have input you know are tied to physical random processes;
you just have to have input you know it's uneconomical for your adversary
to predict or recover.

This is why I fed the first N packets off the network into the RNG; why I
added sampling of kernel printf output (and more importantly, its *timing);
etc.  But the problem with this kind of stuff is that there really are use
cases where an adversary _can_ know these things, so it is very hard to
support an argument that _in the general case_ they should be used to
satisfy some criterion that N unpredictable, unrecoverable (note I do
*not* say "random" here!) bits have been fed into the machinery.  The data
I fed in from the VM system are not quite the same, but in a somewhat similar
situation.

That said, I also added a number of sources which we *do* know are tied to
real random physical processes: the environmental sensors such as temperature,
fan speed, and voltage, where beyond the sampling noise you've got thermal
processes on both micro and macro scales, turbulence, etc; and the "skew"
source type which, in theory, represents skew between multiple oscillators
in the system, one of the hybrid analog-digital RNG designs with a long
pedigree (though as implemented in the example "callout" source, less-so).

Finally, there's a source type I *didn't* take advantage of because I was
advised doing so would cause substantial power consumption: amplifier noise
available by setting muted audio inputs to max gain (we can also use the
sample arrival rate here as a skew source).

I believe we can and should legitimately record entropy when we add input
of these kinds.  But there are three problems with all this.

*Problems are marked out with numbers, thoughts towards solutions or
 mitigations with letters.*

1) It's hard to understand how many bits of entropy to assign to a sample from
   one of these sources.  How much of the change in fan speed is caused by
   system load as a factor (and thus highly correlated with CPU temperature),
   and how much by turbulence, which we believe is random?  How much of the
   signal measured from amplifier noise on a muted input is caused by the
   bus clock (and clocks derived from it, etc.) and how much is genuine
   thermal noise from the amplifier?  And so forth.

   The delta estimator _was_ good for these things, particularly for things
   like fans or thermistors (where the macroscopic, non-random physical
   processes _are_ expected to have continuous behavior), because it could
   tell you when to very conservatively add 1 bit.  If you believe that at
   least 1 bit of each 32-bit value from the input really is attributable to
   entropy.  I also prototyped an lzf-based entropy estimator, but it never
   really seemed worth the trouble -- it is, though, consistent with how
   the published analysis of physical sources often estimates minimum
   entropy.

   A) This is a longwinded way of saying I firmly believe we should count
  input from these kinds of sources towards our "full entropy" threshold
  but need to think harder about how.

2) Sources of the kind I'm talking about here seldom contribute _much_
   entropy - with the old estimator, perhaps 1 bit per change - so if you
   need to get 256 bits from them you may be waiting quite some time (the
   audio-amp sources might be different, which is another reason despite
   their issues, they are appealing).
  
3) Older or smaller systems don't have any of this stuff onboard so it does
   them no good: no fan speed sensors (or no drivers for them), no temp
   sensors, no access to power rail voltages, certainly no audio, etc.

B) One thing we *could* do to help out such systems would be to actually run
   a service to bootstrap them with entropy ourselves, from the installer,
   across the network.  Should a user trust such a service?  I will argue
   "yes".  Why?

B1) Because they already got the binaries or the sources from us; we could
simply tamper those to do the wrong thing instead.

Counterargument: it's impossible to distinguish the output of a
cryptographically-strong stream cipher keyed with something known
to us from real random data, so it's harder to _tell_ if we subverted
you.

Counter-counter-argument: When's the last time you looked?  Users
who _do_ really inspect the sources and binaries they get from us
can always not use our entropy server, or run their own.

B2) Because we have already arranged to mix in a whole 

Re: Rump makes the kernel problematically brittle

2020-04-02 Thread Thor Lancelot Simon
On Thu, Apr 02, 2020 at 04:16:35PM -0400, Greg Troxel wrote:
> The other side of the coin to "rump is fragile" is "an operating system
> without rump-style tests that can be run automatically is suscpetible to
> hard-to-detect failures from changes, and is therefore fragile".  There
> have been many instances (usually on current-users, I think) of reports
> of newly-failing tests cases, leading to rapid removal of
> newly-introduced defects.

I have to say I have always found rump a major impediment to kernel
development.  I chalk this up to one problem with rump, and one problem
only, but it is a problem so serious to this day I feel core should not
have allowed rump to be committed to HEAD without it being definitively
resolved.

The problem is that rump duplicates the entire kernel configuration and
build framework -- poorly.  Rump builds more like v7 did than like modern
BSD, and this means that any code added to the kernel typically has to
be fitted into not one build system but two.  This just plain sucks.  It
is shoddy.  It is definitely the case that rump is a huge technical
accomplishment that enables us to do really cool things - with testing,
with userspace network stacks, and more.  But it is tremendously unfortunate
that it was checked into our tree before it was ready, and has remained
that way ever since.

I'd love to see a GSoC project to actually make rump build like the
kernel...but it may be too much work.

Thor

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-21 Thread Thor Lancelot Simon
On Tue, Jan 21, 2020 at 11:12:06PM +0100, Mateusz Guzik wrote:
> 
> This does not happen with rb trees and would not happen with the hash
> table if there was bucket locking instead of per-cpu.
> 
> I would stick to hash tables since they are easier to scale (both with
> and without locks).
> 
> For instance if 2 threads look up "/bin" and "/usr" and there is no
> hash collision, they lock *separate* buckets and suffer less in terms
> of cacheline bouncing. In comparison with rb trees they will take the
> same lock. Of course this similarly does not scale if they are looking
> up the same path.

I think there's probably a generic bucket-locked hashtable implementation
in one of the code contributions Coyote Point made long ago.  That said,
we're talking about a screenful of code, so it's not like it'd be hard to
rewrite, either.  Do we want such a thing?  Or do we want to press on
towards more modern "that didn't work, try again" approaches like
pserialize or the COW scheme you describe?

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: Proposal: removing urio(4), Rio 500 MP3 player (1999), and Rio-related packages

2020-01-03 Thread Thor Lancelot Simon
On Thu, Jan 02, 2020 at 08:36:51PM +0100, Maxime Villard wrote:
> 
>  - uscanner, which was brought up by other people for an unrelated reason.
>It was removed from FreeBSD in 2009, from OpenBSD in 2013, and disabled
>in NetBSD in 2016. It has been superseded by ugen+SANE.

I would like to suggest that the use of "generic" USB/SCSI/etc. devices
that allow sending arbitrary commands from userland is one of the least
safe design patterns in modern operating systems.  Not all security
issues are accidental - some work as designed, and I think this is one
such.

So it's a bit of a shame to see uscanner or any other target-specific
driver go, with an inherently unsafe generic target driver as replacement,
though perhaps in this case it's necessary.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: racy acccess in kern_runq.c

2019-12-06 Thread Thor Lancelot Simon
On Fri, Dec 06, 2019 at 09:00:33AM +, Andrew Doran wrote:
> 
> Please don't commit this.  These accesses are racy by design.  There is no
> safety issue and we do not want to disturb the activity of other CPUs in
> this code path by locking them.  We also don't want to use atomics either. 

I'm sorry, at this point I can't help myself:

/* You are not expected to understand this. */

Thor


Re: Fonts for console/fb for various locales: a proposal

2019-09-29 Thread Thor Lancelot Simon
On Sat, Sep 28, 2019 at 02:39:19PM +0200, tlaro...@polynum.com wrote:
> 
> 4. Rasterizing (c). This is the whole purpose of METAFONT. METAFONT is a
> rasterizer.

Rasterization of vector fonts by privileged code has been a major source
of security holes in other operating systems.  Does the very limited
benefit really justify the risk?

Thor


Re: fexecve

2019-09-08 Thread Thor Lancelot Simon
On Sun, Sep 08, 2019 at 06:27:11PM -, Christos Zoulas wrote:
> In article <20190908180303.ga6...@panix.com>,
> Thor Lancelot Simon   wrote:
> >On Sun, Sep 08, 2019 at 01:23:46PM -0400, Christos Zoulas wrote:
> >> 
> >> Here's a simple fexecve(2) implementation. Comments?
> >
> >I think this is dangerous in systems which use chroot into filesystems
> >mounted noexec (or nosuid) and file-descriptor passing into the constrained
> >environment to feed data.  Now new executables (and even setuid ones) can
> >be fed in, too.
> >
> >What can we do about that?
> 
> - We can completely dissallow fexecve in chrooted environments.
> 
> or
> 
> - We can check the permissions of the mountpoint of the current working
>   directory in addition to checking the mountpoint of the executable's
>   vnode.

I'd like to figure out a way to make this _optional_ in chrooted environments
because I think in a system designed to use it, it actually could provide a
security enhancement.  At the same time, I'm worried about the effect on
systems designed as sketched out above but without this feature in mind.

But I'm having trouble thinking through how it'd work.  A flag of course and
a test, but on what -- the receive side of the socket when the chroot's
performed, perhaps?

Or, maybe:

1) Find a way to take the properties of the listen socket from which the
   received-on socket was cloned into account; so if I chroot-then-listen
   and I don't have a writable, executable filesystem in which to create
   my listening socket, I can't receive an executable fd that way

2) At chroot time, block executable fd passing on any socket that hasn't
   been deliberately marked as "can receive executables" with fcntl

Maybe those two in combination (neither looks easy, from my memory of the
relevant code, particularly #1) would work?

Thor


Re: fexecve

2019-09-08 Thread Thor Lancelot Simon
On Sun, Sep 08, 2019 at 01:23:46PM -0400, Christos Zoulas wrote:
> 
> Here's a simple fexecve(2) implementation. Comments?

I think this is dangerous in systems which use chroot into filesystems
mounted noexec (or nosuid) and file-descriptor passing into the constrained
environment to feed data.  Now new executables (and even setuid ones) can
be fed in, too.

What can we do about that?

Thor


Re: /dev/random is hot garbage

2019-07-23 Thread Thor Lancelot Simon
On Mon, Jul 22, 2019 at 07:11:34PM +0200, Kamil Rytarowski wrote:
> 
> It looks like we need a paravirt random driver for xen that could solve
> the rust / random(6) problem.

Or just run on a CPU that has RDRAND / RDSEED available.  Our package
builders are old; I'd chip in a few bucks to replace them, if that
helped.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: Interface description support

2019-06-24 Thread Thor Lancelot Simon
On Mon, Jun 24, 2019 at 08:08:24AM -0400, Mouse wrote:
> 
> I think I would prefer that attempting to set a description that
> equals an interface name is rejected, and, when attaching an interface
> after boot (ie, after interface descriptions can have been set), when
> choosing an interface disambiguator number, putative names are tested
> for conflicts against not only other interfaces' names but against
> their descriptions as well.  Together, these would mean that there is
> no overlap between the set of interface names and the set of interface
> descriptions.

I like.

Thor


Re: of_getnode_byname: infinite loop

2019-04-27 Thread Thor Lancelot Simon
On Sat, Apr 27, 2019 at 07:52:07AM +0200, yarl-bau...@mailoo.org wrote:
> I would like to insist since I got no answer.

I think you forgot to attach the proposed patch.

Thor


Re: Regarding the ULTRIX and OSF1 compats

2019-03-23 Thread Thor Lancelot Simon
On Sun, Mar 17, 2019 at 05:41:18AM +0800, Paul Goyette wrote:
> On Sat, 16 Mar 2019, Maxime Villard wrote:
> 
> > Regarding COMPAT_OSF1: I'm not totally sure, but it seems that Alpha's
> > COMPAT_LINUX uses COMPAT_OSF1 as dependency (even if there is no proper
> > dependency in the module), because there are osf1_* calls. Some more
> > compat mess to untangle, it seems...
> > 
> > In all cases, it's only a few functions that are just wrappers, so it
> > probably shouldn't be too complicated to solve.
> 
> It's a total of 15 functions (I generated this list by building an alpha
> GENERIC kernel with COMPAT_OSF1 removed):
> 
> osf1_sys_wait4
> osf1_sys_mount

[...]

There's a funny historical reason for this -- Linux on Alpha originally
used OSF/1 syscall numbering and args for its own native syscalls.

No, our implementation need not actually reflect that -- but it's an
interesting historical detail perhaps we should write down somewhere.



Re: svr4, again

2019-03-11 Thread Thor Lancelot Simon
I'd rather not - on the rare occasions when I boot up my pmax, I do actually
have a couple of Ultrix binaries I run.  If I couldn't, about 5 years of
my academic work would be lost to me (because Digital's packaged applications
couldn't export to any portable format except print-analogues like
Postscript).

What needs to be done to Ultrix compatibility to keep it alive?  I will
admit to not looking at it in at least a decade, but it was pretty darned
small, wasn't it?

On Sat, Mar 09, 2019 at 10:56:34AM -0500, Christos Zoulas wrote:
> Let's retire them.
> 
> christos
> 
> > On Mar 9, 2019, at 5:28 AM, Maxime Villard  wrote:
> > 
> > Re-reading this thread - which was initially about SVR4 but which diverged 
> > in
> > all directions -, I see there were talks about retiring COMPAT_ULTRIX and
> > COMPAT_OSF1, because these were of questionable utility, in addition to 
> > being
> > clear dead wood (in terms of use case, commits in these areas, and ability 
> > to
> > test changes).
> > 
> > Does anyone have anything to say?

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: scsipi: physio split the request

2019-01-01 Thread Thor Lancelot Simon
On Tue, Jan 01, 2019 at 02:48:05PM -0500, Thor Lancelot Simon wrote:
> 
> The work remaining to be done on the branch, as I see it, is:

[...]

I missed one.  I got fed up dealing with the way arguments are
passed through the mount syscalls (especially for ufs) so the
work of letting mount(8) pass a maxxfer value through (either
initially or on an update mount) is not done.  That's definitely
a thing that should be done before this hits the tree.

Thor


Re: scsipi: physio split the request

2019-01-01 Thread Thor Lancelot Simon
On Thu, Dec 27, 2018 at 09:07:41AM +, Emmanuel Dreyfus wrote:
> Hello
> 
> A few years ago I made a failed attempt at running LTFS on a LTO 6 drive. 
> I resumed the effort, and once I got the LTFS code ported, running 
> a command like mkltfs fails with kernel console saying:
> st0(mpii0:0:2:0): physio split the request.. cannot proceed
> 
> This is netbsd-current from yesterday.

You really need tls-maxphys.  It won't be a ton of fun to rebase it
on newer NetBSD-current but it can't be more than a day's work (IIRC
where I left it we were pre device/softc cleanup, and that'll be some
nuisance to address if so).

tls-maxphys propagates the maximum supported transfer size down the
system's actual discovered bus topology at boot time; any node in the
tree can enforce its own restrictions as it sees fit, and nodes like
RAIDframe that effectively demux I/O can compute and declare their
own supported maximum.

The work remaining to be done on the branch, as I see it, is:

1) *Some* backpressure mechanism *must* be implemented to prevent
   the filesystems from greedily attempting maximum size I/Os at
   all times, because with a new, much larger maximum in many
   cases, this will lead to much worse unfairness than we now see
   (and some threads doing I/O will much more obviously starve
   others).

   IIRC we've already got something effective for either read or
   write but not the other but it's been a while, so I could be wrong.

2) There's an ugly case with RAIDframe if a component is replaced with one
   that supports a smaller maxphys.  The filesystems need to be notified
   so they can change their own internal max xfer size.  I think I wrote the
   code to deal with this but it's untested.  Wants a look.

3) A number of device drivers -- particularly things in the LSI family -- will
   need to learn about newer DMA descriptor formats supported by their
   hardware in order to support transfers of reasonable size for things like
   tape drives (mpt and possibly mfi*, for example, are currently limited to
   192K because our driver only supports a very old descriptor format; this
   should be a relatively simple fix based on reading newer open-source code
   for these devices as a reference).

I believe that should be all that's needed.  I would estimate it at 5 days
of work, or perhaps a month of evenings/weekends.  I don't have that time
available now and won't in the forseeable future, but, perhaps someone reading
this does.

And of course some of you are much quicker at this stuff than I am (thorpej,
I'm looking at you ;-)).

Most of what the branch does is useful *even if* we remove the stupid VA/PA
mapping business for I/O, I think.  Because it's mostly config sugar to let
the clients know how big an I/O they can ask for at runtime, and that will
be needed regardless.

Thor


Re: Importing libraries for the kernel

2018-12-18 Thread Thor Lancelot Simon
On Fri, Dec 14, 2018 at 03:19:45PM +0100, Joerg Sonnenberger wrote:
> On Thu, Dec 13, 2018 at 11:07:23PM +0900, Ryota Ozaki wrote:
> > On Thu, Dec 13, 2018 at 6:30 AM Joerg Sonnenberger  wrote:
> > >
> > > On Thu, Dec 13, 2018 at 12:58:21AM +0900, Ryota Ozaki wrote:
> > > > Before that, I want to ask about how to import cryptography
> > > > libraries needed tor the implementation.  The libraries are
> > > > libb2[1] and libsodium[2]: the former is for blake2s and
> > > > the latter is for curve25519 and [x]chacha20-poly1305.
> > >
> > > I don't really have a problem with Blake2s, but I have serious concerns
> > > for doing asymmetric cryptography in the kernel. In fact, it is one of
> > > the IMHO very questionable design decisions behind WireGuard and
> > > something I don't want to see repeated in NetBSD.
> > 
> > Can you clarify the concerns?
> 
> Asymmetrical cryptography is slow and complex. On many architectures,
> the kernel will only be able to use slower non-SIMD implementations. ECC

We have an existing facility for doing slow cryptographic operations
asynchronously in the kernel.  Except where architectures have really fast,
instruction-style hardware support for asymmetric crypto operations (and
perhaps even there depending on operand size) this stuff should be done in
a thread interfacing to the rest of the kernel through the existing
opencrypto framework.

That said, I believe we should have asymmetric crypto operations in the
kernel for executable and security policy signing.  In fact, I believe it
strongly enough to have implemented it (interfaced via opencrypto as
described above) twice -- but I don't own either implementation and thus
can't contribute them.

They did _not_ cause measureable performance problems of any kind, and
though it is theoretically possible to do this sort of thing via a
tightly-protected userspace helper process, I prototyped that too and
it gets very ugly, very fast -- the in-kernel way with a thread is, I
believe, better.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: noatime mounts inhibiting atime updates via utime()

2018-12-05 Thread Thor Lancelot Simon
On Tue, Dec 04, 2018 at 08:23:20PM -0800, Jason Thorpe wrote:
> 
> This is especially important with solid state drives.  Those inode updates 
> that write less-than-a-NAND-chunk worth fragment the drive's translation 
> tables, which eventually leads to write amplification and reduced lifespan of 
> the drive.

Yes - this originally came to my attention with LFS, where at one
point in the distant past, the atime updates were triggering
full-segment writes for every file *read*!

I think LFS may still have a local workaround for this.  Perhaps it
should be generalized.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: pci_intr_alloc() vs pci_intr_establish() - retry type?

2018-11-29 Thread Thor Lancelot Simon
On Wed, Nov 28, 2018 at 12:34:42PM -0500, Michael wrote:
> 
> Or, for that matter, if it can use one IRQ or - say - 8 MSI.

Right, this is pretty common, isn't it?

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: nvmm

2018-10-09 Thread Thor Lancelot Simon
On Mon, Oct 08, 2018 at 12:02:14PM +0200, Maxime Villard wrote:
> Here is support for hardware-accelerated virtualization on x86 [1]. A generic
> MI driver, to which MD backends can be plugged. Most of it has already been
> discussed in private.
> 
> I intend to commit it soon, in:
> 
>   module/src/nvmm.c  -> sys/dev/nvmm/nvmm.c
>   module/src/nvmm.h  -> sys/dev/nvmm/nvmm.h
>   module/src/nvmm_internal.h -> sys/dev/nvmm/nvmm_internal.h
>   module/src/nvmm_ioctl.h-> sys/dev/nvmm/nvmm_ioctl.h
>   module/src/arch/nvmm_x86.h -> sys/dev/nvmm/nvmm_x86.h
>   module/src/arch/nvmm_x86_svm.c -> sys/dev/nvmm/nvmm_x86_svm.c
>   module/src/arch/nvmm_x86_svmfunc.S -> sys/dev/nvmm/nvmm_x86_svmfunc.S
>   module/Makefile-> sys/modules/nvmm/Makefile
>   module/nvmm.ioconf -> sys/modules/nvmm/nvmm.ioconf
>   libnvmm/   -> lib/libnvmm/
>   toyvirt/   -> share/examples/nvmm/toyvirt/
>   smallkern/ -> share/examples/nvmm/smallkern/

Wow.  Thank you!

Thor


Re: Missing compat_43 stuff for netbsd32?

2018-09-11 Thread Thor Lancelot Simon
On Tue, Sep 11, 2018 at 03:35:24PM +, Eduardo Horvath wrote:
> 
> It's probably only useful for running ancient SunOS 4.x binaries, maybe 
> Ultrix, Irix or OSF-1 depending on how closely they followed BSD 4.3.

Actually, I think amd64, sparc64, and mips64 are the only platforms where
it could even be possible to encounter netbsd32 executables that required
system calls that had the "o" names in 4.3BSD.

On amd64, because i386 architecture SunOS 4 executables exist and I am not
sure the SunOS 4 kernel did actually pick up all the new syscalls from
4.3BSD.  Whether such executables would run at all though, I'm not sure;
there is probably other COMPAT_SUNOS code needed that may not work on i386.

On sparc64 because there were releases of SunOS 3 for 32-bit SPARC and
SunOS 3 *definitely* didn't have 4.3BSD's new system calls (thus didn't
have the old ones renamed to o*).

On mips64 *if* EXEC_AOUT or EXEC_PECOFF work, because:
* Ultrix basically was 4.2BSD.
* MIPS themselves ported 4.3BSD to the R3000

The first (amd64) and last (mips64) seem pretty much theoretical but it
does seem like there may be 32-bit sparc binaries out there which work
on 32-bit NetBSD but not 64-bit NetBSD.  Whether anyone cares, dunno.

There can be a lot of value to being able to run really old executables,
but you need the right customer in the right state of utter desperation...

Thor


Re: Too many PMC implementations

2018-08-23 Thread Thor Lancelot Simon
On Thu, Aug 23, 2018 at 10:17:29AM -0700, Jason Thorpe wrote:
> 
> > On Aug 23, 2018, at 8:47 AM, Anders Magnusson  wrote:
> 
> > I have used it not long ago for vax.  Maybe I did have to do some tweaks, 
> > do not remember,
> > but I really want to be able to use kernel profiling on vax.
> > 
> > So, I really oppose removing it and leaving vax without any kernel 
> > profiling choice.
> 
> How hard would it be to add support for dtrace on Vax?

Without FBT, probably pretty easy.  But of course FBT is the only plausible
replacement for a statistical profiler that DTrace offers.

The basic requirement for FBT is a dynamic patcher (to really do it right);
though some __predict-false branches can be inserted at the head of every
function and a global used instead.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: Too many PMC implementations

2018-08-23 Thread Thor Lancelot Simon
On Thu, Aug 23, 2018 at 06:25:56PM +0200, Kamil Rytarowski wrote:
> 
> As useful I mean the number of commits to the src/ tree. If nothing
> landed, probably nothing was useful. When were the most recent patches
> from gprof or similar?

This is a plainly bogus criterion.  After we integrated DTrace, there
were several periods of a year or more during which there were no
"patches from DTrace or similar".  I know, and it was pretty frustrating
to me too, since I paid a considerable amount of money to have it ported
and maintained, and I had to justify that to my boss.

Should it have, then, been ripped out?  It's certainly a lot more complex
than gprof.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: Too many PMC implementations

2018-08-23 Thread Thor Lancelot Simon
On Thu, Aug 23, 2018 at 05:09:35PM +0200, Kamil Rytarowski wrote:
> 
> Observing that all the useful profiling is already done with DTrace, we
> can remove complexity from the kernel with negligible cost.

I'm not sure what to make of this.  I'm trying to come up with a way to
make the above statement true, and I'm having some difficulty.

You can't possibly mean "Observing that (unproven premise), therefore
(conclusion)", so I'll discard that interpretation.

Do you perhaps mean "*If* we were to observe that all useful profiling
were done with DTrace, *then* we could remove complexity from the
kernel with negligible cost"?

Because Ragge and others have been pointing out that in that case,
the premise "all useful profiling is done with DTrace" does not appear
to be true.  Profiling kernel code on VAX may not be useful *to you*
but that does not imply it is "not useful" simpliciter.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: 8.0 performance issue when running build.sh?

2018-08-09 Thread Thor Lancelot Simon
On Thu, Aug 09, 2018 at 10:10:07AM +0200, Martin Husemann wrote:
> With the patch applied:
> 
> Elapsed time: 1564.93 seconds.
> 
> -- Kernel lock spin
> 
> Total%  Count   Time/ms  Lock   Caller
> -- --- - -- --
> 100.002054 14.18 kernel_lock
>  47.43 846  6.72 kernel_lockfileassoc_file_delete+20
>  23.73 188  3.36 kernel_lockintr_biglock_wrapper+16
>  16.01 203  2.27 kernel_lockscsipi_adapter_request+63
>   5.29 662  0.75 kernel_lockVOP_POLL+93
>   5.29  95  0.75 kernel_lockbiodone2+81
>   0.91  15  0.13 kernel_locksleepq_block+1c5
>   0.60  21  0.08 kernel_lockfrag6_fasttimo+1a
>   0.29   9  0.04 kernel_lockip_slowtimo+1a
>   0.27   2  0.04 kernel_lockVFS_SYNC+65
>   0.07   2  0.01 kernel_lockcallout_softclock+42c
>   0.06   3  0.01 kernel_locknd6_timer_work+49
>   0.05   4  0.01 kernel_lockfrag6_slowtimo+1f
>   0.01   4  0.00 kernel_lockkevent1+698
> 
> so .. no need to worry about kernel_lock for this load any more.

Actually, I wonder if we could kill off the time spent by fileassoc.  Is
it still used only by veriexec?  We can easily option that out of the build
box kernels.

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: hashtables

2018-07-24 Thread Thor Lancelot Simon
On Tue, Jul 24, 2018 at 10:14:57PM +, co...@sdf.org wrote:
> 
> I don't have any opinions about hashtables - I just know that
> implementing one would take me really long and would probably still be
> buggy in the end, and I didn't find a similar API that claims to be a
> hashtable.
> 
> Is there some code I could use? I'm totally OK with not a hashtable too.

The basic approach used in our kernel is to declare an array of HEADs
for one of our queue.h datastructures (you usually only have to walk
in one direction, and hardly ever even that, so it's usually a LIST).

Each LIST is a "bucket".  The hash picks out the "bucket" (the LIST_HEAD)
and then you just insert at the head (if adding) or walk down the LIST
until you find the entry you're looking for (if searching).

As a hash function, if you've got things of a convenient size, you can
just use the % operator as your hash function (use an array of buckets
that's of a convenient prime size).  If what you have to hash is larger
there's a hash(9) API.

There are examples all over the kernel; the multicast and ifaddr hashes
in netinet/in.c are fairly simple, though they've had some locking added.
Maybe someone can suggest one even simpler.

This extends fairly naturally to other use cases -- you can put a
mutex at the head of each chain and make a datastructure that allows
parallel access to different buckets; you can sacrifice evenness of
the hash and use a power-of-2 hash size and shift/mask for the lookup,
which is faster most of the time but has a much worse worst case.

This kind of implementation doesn't lend itself to being expanded
so it's collision-free, but most else you might like to do is not too
hard.

Thor


Re: hashtables

2018-07-24 Thread Thor Lancelot Simon
On Tue, Jul 24, 2018 at 07:01:36PM +, co...@sdf.org wrote:
> hi netbsd,
> 
> if I were to state I totally need hashtables, what already existing API
> would you tell me to use instead?

The kernel has lots of hash tables in it.  If a hash table is the right
datastructure for your purpose, why do something else?

Thor


Re: 8.0 performance issue when running build.sh?

2018-07-13 Thread Thor Lancelot Simon
On Fri, Jul 13, 2018 at 09:37:26AM +0200, Martin Husemann wrote:
> 
> Do you happen to know why 
> 
>   vmstat -e | fgrep "bus_dma bounces"
> 
> shows a > 500 rate e.g. on b45? I never see a single bounce on any of my
> amd64 machines. The build slaves seem to do only a few of them though.

Only the master does much if any disk (actually SSD) I/O.  It must be
either the mpt driver or the scsi subsystem.  At a _guess_, mpt?  I
don't have time to look right now.

This is annoying, but seems unlikely to be the cause of the performance
regression; I don't think the master runs builds any more, does it?  So
it shouldn't be doing much and should have tons of CPU and memory
bandwidth available to move that data.

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: 8.0 performance issue when running build.sh?

2018-07-12 Thread Thor Lancelot Simon
On Thu, Jul 12, 2018 at 07:39:10PM +0200, Martin Husemann wrote:
> On Thu, Jul 12, 2018 at 12:30:39PM -0400, Thor Lancelot Simon wrote:
> > Are you running the builds from tmpfs to tmpfs, like the build cluster
> > does?
> 
> No, I don't have enough ram to test it that way.

So if 8.0 has a serious tmpfs regression... we don't yet know.

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: 8.0 performance issue when running build.sh?

2018-07-12 Thread Thor Lancelot Simon
On Thu, Jul 12, 2018 at 05:29:57PM +0200, Martin Husemann wrote:
> On Tue, Jul 10, 2018 at 11:01:01AM +0200, Martin Husemann wrote:
> > So stay tuned, maybe only Intel to blame ;-)
> 
> Nope, that wasn't it.
> 
> I failed to reproduce *any* slowdown on an AMD CPU locally, maya@ kindly
> repeated the same test on an affected Intel CPU and also could see no
> slowdown.

Are you running the builds from tmpfs to tmpfs, like the build cluster
does?

Thor


Re: Removing bitrotted sys/dev/pci/n8 (NetOctave NSP2000)

2018-07-10 Thread Thor Lancelot Simon
On Tue, Jul 10, 2018 at 06:56:53PM +, co...@sdf.org wrote:
> Hi,
> 
> The code in sys/dev/pci/n8 has bitrotted - it still makes references to
> LKM_ system things, so it is unlikely it builds.
> This has been the case since netbsd-6.

I still have the hardware, but I seriously doubt anyone's using it.  We
imported this driver mostly as a testbed for hardware crypto improvements
(it was not really performance-competitive any more even at the time; but
we had a good relationship with the IC designer and extensive documentation
on their SDK and OpenSSL modifications).

The hardware is no longer made and was, as far as I know, used under NetBSD
only within the engineering organization at Coyote Point, which doesn't
exist any more.

I think the clock's up on this one; take it out please.

Thor


Re: CVS commit: src/sys/arch/x86/x86

2018-07-09 Thread Thor Lancelot Simon
On Mon, Jul 09, 2018 at 12:24:15PM +0200, Kamil Rytarowski wrote:
> 
> The C11 standard could indeed use consistent wording. In one place
> "correctly aligned" in other alignment "restrictions" and
> "requirements". None of these terms is marked as a keyword or term and I
> read them in the context of the document as the same phenomenon (I
> haven't found a different interpretation of this in the wild).

Right, but, architecturally, x86 doesn't have these "restrictions" or
"requirements".  Not for correctness, not with the overwhelming majority
of integer instructions.  Only (sometimes) for performance.

Which makes relying on the standard's language about unaligned access
incorrect.  There's no undefined behavior here because there can't be;
strict alignment is not a feature of the target architecture.

That means that a compiler which emits broken code if it encounters such
a pointer is the broken thing -- *not the source code* -- and whanging
around x86 MD code in our tree to accomodate such a compiler would be
broken too.

I understand we have some code that's on the borderline between being MI
and MD, or that's "MD but for several different architectures".  ACPICA
is an obvious example.  And that such code may legitimately be compiled for
targets where unaligned access is not architecturally valid, and compilers
are free to do odd things with code that tries to do it.  But the right
thing to do, I think, is to acknowledge that such code is MI; not to, by
misguided policy, insist that _all_ "MD" code should be written as if it
were MI.

Because if you head down that road, we're going to be forced to
write a bunch more stuff in assembler that we'd figured out over the decades
how to write in a performant and readable way in C; and I don't think anyone
benefits from having more asm in our kernel.

Thor


Re: CVS commit: src/sys/arch

2018-07-08 Thread Thor Lancelot Simon
On Sun, Jul 08, 2018 at 04:49:51PM +0200, Kamil Rytarowski wrote:
> On 08.07.2018 12:01, Martin Husemann wrote:
> > 
> > This is unecessary churn for no good reason, please stop it.
> > 
> > But worse are the other changes you are doing where kubsan insists on
> > natural alignement for {u,}int_{16,32,64} types, which is plain wrong,
> > these CPUs do not require that alignment (and it is not even clear
> > if kubsan propagates the alignment of structures correctly).
> > 
> > Martin
> > 
> 
> I've started a thread on it in tech-kern@. Please move the discussion there.

OK, let's talk about it here.  I would suggest that making changes to
MD code to conform to alignment constraints not actually present on the
"M" in question is not the right thing to do.  I'd further suggest that
there are in fact cases where it can break things or harm performance
(the trick conditionally used in some device drivers of intentionally
misaligning structures so their fields accessed by DMA meet DMA alignment
constraints comes to mind).

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: new errno ?

2018-07-07 Thread Thor Lancelot Simon
The DOMAIN and RANGE errno values are not really used outside
floating-point code, and are...conceptually appropriate...to
many other kinds of problems.

Thor

On Fri, Jul 06, 2018 at 03:59:12PM -0700, Jason Thorpe wrote:
> 
> 
> > On Jul 6, 2018, at 2:49 PM, Eitan Adler  wrote:
> > 
> > For those interested in some of the history:
> > https://lists.freebsd.org/pipermail/freebsd-hackers/2003-May/000791.html 
> > <https://lists.freebsd.org/pipermail/freebsd-hackers/2003-May/000791.html>
> 
> ...and the subsequent thread went just as I expected it might.  Sigh.
> 
> Anyway... in what situations is this absurd error code used in the 802.11 
> code?  EFAULT seems wrong because it means something very specific.  
> Actually, that brings me to a bigger point... rather than having a generic 
> error code for "lulz I could have panic'd here, heh", why not simply return 
> an error code appropriate for the situation that would have otherwise 
> resulted in calling panic()?  There are many to choose from :-)
> 
> -- thorpej
> 

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: Revisiting uvm_loan() for 'direct' write pipes

2018-05-26 Thread Thor Lancelot Simon
On Fri, May 25, 2018 at 10:01:15PM +0200, Jarom??r Dole??ek wrote:
> 2018-05-21 21:49 GMT+02:00 Jarom??r Dole??ek <jaromir.dole...@gmail.com>:
> > It turned out uvm_loan() incurs most of the overhead. I'm still on my
> > way to figure what it is exactly which makes it so much slower than
> > uiomove().
> 
> I've now pinned the problem down to the pmap_page_protect(...,
> VM_PROT_READ), that code does page table manipulations and triggers
> synchronous IPIs. So basically the same problem as the UBC code in
> uvm_bio.c.

There's always going to be some critical size beneath which the cost of
the MMU manipulations (or, these days, the interprocessor communication
to cause other CPUs to do _their_ MMU manipulations) outweighs the benefit
of avoiding copies.  This problem's been known all the way as far back as
Mach on the VAX, where they discovered that for typical message sizes
to/from the microkernel, mapping instead of copying was definitely a lose.

In this case, could we do better gathering the IPIs so the cost were
amortized over many pages?

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: secmodel_securelevel(9) and machdep.svs.enabled

2018-04-26 Thread Thor Lancelot Simon
On Wed, Apr 25, 2018 at 09:43:18PM +0100, Alexander Nasonov wrote:
> 
> Thinking a bit more about this, I don't think my patch will prevent
> data leakage from the kernel because /dev/mem and /dev/kmem are
> readable at all securelevels. It can only prevent leakage in some

If that is true it's a serious regression.  Are you talking about the
pathological case where "options INSECURE" is set?

Thor


Re: meltdown

2018-01-05 Thread Thor Lancelot Simon
On Thu, Jan 04, 2018 at 04:58:30PM -0500, Mouse wrote:
> > As I understand it, on intel cpus and possibly more, we'll need to
> > unmap the kernel on userret, or else userland can read arbitrary
> > kernel memory.
> 
> "Possibly more"?  Anything that does speculative execution needs a good
> hard look, and that's damn near everything these days.

I wonder about just "these days".  The potential for this kind of problem
goes all the way back to STRETCH or the 6600, doesn't it?  If they had
memory permissions, which I frankly don't know.  And even in microprocessors
it's got to go back to... the end of the 1980s (R6000?) certainly the 1990s.

Though of course "fail early" is an obvious principle to security types,
given the cost of aborting work in progress I can easily see the
opposite being true for CPU designers (I'm not one, so I don't really
know).  Which idiom (check permissions, then speculate / speculate, then
check permissions) is more common?

Thor


Re: RFC: ipsec(4) pseudo interface

2017-12-20 Thread Thor Lancelot Simon
On Mon, Dec 18, 2017 at 06:49:44PM +0900, Kengo NAKAHARA wrote:
> Hi,
> 
> We implement ipsec(4) pseudo interface for route-based VPNs. This pseudo
> interface manages its security policy(SP) by itself, in particular, we do
> # ifconfig ipsec0 tunnel 10.0.0.1 10.0.0.2
> the SPs "10.0.0.1 -> 10.0.0.2"(out) and "10.0.0.2 -> 10.0.0.1"(in) are
> generated automatically and atomically. And then, when we do
> # ifconfig ipsec0 deletetunnel
> the SPs are destroyed automatically and atomically, too.

Do you have IKE daemon changes to use this?

Thor


Re: Merging ugen into the usb stack

2017-12-16 Thread Thor Lancelot Simon
On Fri, Dec 15, 2017 at 08:30:00PM +0100, Martin Husemann wrote:
>  - when libusb takes over controll (as Steffen described) a kernel driver
>that would have attached (i.e. when the skipping does not happen or
>the userland application is configured wrong) is detached, so no
>concurrent access between libusb and a kernel driver can ever happen

To be safe in general -- particularly for storage devices -- this probably
has to depend on securelevel.

Thor


Re: increase softint_bytes

2017-11-23 Thread Thor Lancelot Simon
On Thu, Nov 16, 2017 at 10:25:54PM +0100, Jarom??r Dole??ek wrote:
> 
> If I count correctly, with current 8192 bytes the system supports some 100
> softints, which seems to be on quite low side - more advanced hardware and
> drivers usually use queue and softint per cpu, so it can quickly run out on
> system with e.g. 32 cores.

Indeed, the same driver (ixg) used to routinely run my FreeBSD systems
out of their equivalent to a softint -- 2 dual-port cards, 8-core CPU, and
*wham*.  I believe this was adjusted over there so that they allocate
based on the number of cores with a cap set by total RAM.  And a higher
default.

Thor


Re: kaslr: better rng

2017-11-14 Thread Thor Lancelot Simon
On Tue, Nov 14, 2017 at 02:25:00PM +0100, Maxime Villard wrote:
> Le 11/11/2017 ?? 22:23, Taylor R Campbell a ??crit :
> > Can you just use the SHA1 in libkern (and the SHA3 that will with any
> > luck soon be in libkern), or are there constraints on the size of the
> > prekern that prevent you from doing so?
> 
> No, there are no constraints. I just didn't know we could use libkern. So you
> can forget about my prng.c, I'll use libkern's SHA512 until we have SHA3.
> 
> 
> Le 12/11/2017 ?? 03:13, Thor Lancelot Simon a ??crit :
> > cpu_rng already has the code needed to do this -- best to use it, perhaps?
> 
> This would mean moving cpu_rng into libkern?

Maybe so.  I guess there is MD stuff in libkern already.  Only thing is,
looking at the code to remind myself what I did, it relies on our cpu_features
mechanism.  But if you look at the code, it's very, very simple, just a
few lines really to do the work -- in this very particular case perhaps you
would be justified to duplicate it.

Or -- it's tiny -- grab the Intel sample code examples from
https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
which include a tiny CPU feature prober and a little bit of glue around
RDRAND and RDSEED.  You can discard almost all the glue, or even just use
the Intel code (3-clause BSD licensed) as an example of how to probe the
feature bits.

The right thing to do I would think is to is use RDSEED if you have it;
if you don't have it, or if it fails, use RDRAND.  If you don't have either,
I guess use the TSC to key your hash function.  If you don't have that,
the RTC clock... just a few inb()/outb() to read it, and it's better than
nothing.

Here is something else you can use with only a fairly small amount of
MD code -- the processor temperature sensor on most Intel CPUs made since
around 2009.  As you can see from x86/x86/coretemp.c, it's just a couple
of MSR reads - the probe may be the hard part.

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: amd64: kernel aslr support

2017-11-14 Thread Thor Lancelot Simon
On Tue, Nov 14, 2017 at 04:04:39PM +, Christos Zoulas wrote:
> In article <5cee5471-dc6f-db16-8914-75ad5ad15...@m00nbsd.net>,
> Maxime Villard  <m...@m00nbsd.net> wrote:
> >
> >All of that leaves us with about the most advanced KASLR implementation
> >available out there. There are ways to improve it even more, but you'll have
> >to wait a few weeks for that.
> >
> >If you want to try it out you need to make sure you have the latest versions
> >of GENERIC_KASLR / prekern / bootloader. The instructions are still here [2],
> >and haven't changed.
> 
> Very nicely done!

Seriously!  Nice work!

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: kaslr: better rng

2017-11-11 Thread Thor Lancelot Simon
On Sat, Nov 11, 2017 at 09:23:39PM +, Taylor R Campbell wrote:
> 
> Speaking of which, you should read 256 bits out of rdseed, not 64, and
> fall back to rdrand if rdseed is not available.

cpu_rng already has the code needed to do this -- best to use it, perhaps?

Thor


Re: kaslr: better rng

2017-11-06 Thread Thor Lancelot Simon
On Mon, Nov 06, 2017 at 06:51:33PM +0100, Maxime Villard wrote:
> > 
> > What is the reason for using only part of the file, in any application?
> 
> I meant to say that the components don't take random values from the same
> area in the file, for them not to use the same random numbers twice.

That doesn't make sense to me.  Do you believe all modern keyed hash
functions are broken?

If not, why not use HMAC with a suitable hash (SHA512 is probably right
for now) and two different fixed keys, over the entire boot time seed
entropy, to derive two different seeds for the two RNGs?

-- 
  Thor Lancelot Simont...@panix.com
 "The two most common variations translate as follows:
illegitimi non carborundum = the unlawful are not silicon carbide
illegitimis non carborundum = the unlawful don't have silicon carbide."


Re: kaslr: better rng

2017-11-06 Thread Thor Lancelot Simon
On Mon, Nov 06, 2017 at 07:30:35AM +0100, Maxime Villard wrote:
> I'm in a point where I need to have a better rng before continuing - and an
> rng that can be used in the bootloader, in the prekern and in the kernel
> (early).
> 
> I would like to use a system similar to the /var/db/entropy-file 
> implementation.
> That is to say, when running the system generates /var/db/random-file, which
> would contain at least 256bytes of random data. When booting the bootloader
> reads this file, can use some of its bytes to get random values. It then gives
> the file to the prekern which will use some other parts of it. The prekern
> finally gives the file to the kernel which can use the rest.

What is the reason for using only part of the file, in any application?

Thor


Re: New line discipline flag to prevent userland open/close of the device

2017-10-29 Thread Thor Lancelot Simon
On Sun, Oct 29, 2017 at 12:47:26PM +0100, Martin Husemann wrote:
> While analyzing PR port-sparc64/52622 I ran into the following issue.
> 
> On some sun hardware, the com0 device is used to connect an old style 
> keyboard,
> and com1 is used for mouse. On other sparc64 machines, com0/com1 are regular
> serial ports and keyboard/mouse is USB attached.

pmax does something similar (though it's a different serial controller,
basically a DZ11 on a chip) -- look at what it does?

Thor


Re: amd64: kernel aslr support

2017-10-05 Thread Thor Lancelot Simon
>  * The RNG is not really strong. Help in this area would be greatly
>appreciated.

This is tricky mostly because once you start probing for hardware
devices or even CPU features, you're going to find yourself wanting
more and more of the support you'd get from the "real kernel".

For example, to probe for RDRAND support on the CPU, you need a
whole pile of CPU feature decoding.  To probe for environmental
sensors or an audio device you may need to know a whole pile about
ACPI and/or PCI.  And so forth.

EFI has a RNG API, but I think it's usually just stubbed out and
besides, you can't rely on having EFI...

I think I'd suggest some combination of:

* Just enough CPU-feature support to find/use RDRAND
  (Intel's sample code is not that big and I think it's
   suitably-licensed)

* Hash the contents of the "CMOS RAM" and/or EFI boot variables

* Maybe poke around for an IPMI BMC (has environmental sensors),
  or a TPM (has a RNG) on the LPC bus

* Maybe poke around for on-die temperature/voltage sensors
  (will again require some CPU identification support).

* Rather than just using rdtsc once, consider using rdtsc to
  "time" multiple hardware oscillators against one another;
  at the very least, you've always got the hardware clock.

* Also, you can use rdtsc to time memory accesses.

For quick and dirty "entropy extraction", you can crunch as much of this
data as you're able to connect together using SHA512.

I know, little or none of this is easy.

Thor


Re: intel sgx support

2017-09-27 Thread Thor Lancelot Simon
I guess the question is what you'd use it for.  SGX is...considerably
less general than one might hope.  Though it does look well suited for
running -- let's say -- a video codec you want to keep Secret From The
World.

It's also not available on server processors, last I looked, which takes
out a whole class of use cases.

Thor

On Wed, Sep 27, 2017 at 09:54:44AM -0400, Stephen Herwig wrote:
> Has anyone looked at adding Intel SGX support to NetBSD?
> 
> Thanks,
> Stephen

-- 
  Thor Lancelot Simont...@panix.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."  - H.L.A. Hart


Re: how to tell if a process is 64-bit

2017-09-10 Thread Thor Lancelot Simon
On Sun, Sep 10, 2017 at 03:29:22PM +, paul.kon...@dell.com wrote:
> 
> MIPS has four ABIs, if you include "O64".  Whether a particular OS allows
> all four concurrently is another matter; it isn't clear that would make
> sense.  Mixing "O" and "N" ABIs is rather messy.
> 
> Would you call N32 a 64-bit ABI?  It has 64 bit registers, so if a value
> is passed to the kernel in a register it comes across as 64 bits.  But it
> has 32 bit addresses.

I wouldn't, because if an address is passed to the kernel, it comes across
as 32 bits.  But what _do_ we do on modern, 32-bit MIPS?  Are we still O32?
It does kind of look like it -- all our 32-bit MIPS ports' sets files seem
to be linked to ../../../shared/mipsel/ which must be O32 since it is also
used for the pmax sets.

-- 
  Thor Lancelot Simont...@panix.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."  - H.L.A. Hart


Re: how to tell if a process is 64-bit

2017-09-10 Thread Thor Lancelot Simon
On Fri, Sep 08, 2017 at 07:38:24AM -0400, Mouse wrote:
> > In a cross-platform process utility tool the question came up how to
> > decide if a process is 64-bit.
> 
> First, I have to ask: what does it mean to say that a particular
> process is - or isn't - 64-bit?

I think the only simple answer is "it is 64-bit in the relevant sense if
it uses the platform's 64-bit ABI for interaction with the kernel".

This actually raises a question for me about MIPS: do we have another
process flag to indicate O32 vs. N32, or can we simply not run O32
executables on 64-bit or N32 kernels (surely we don't use the O32 ABI
for all kernel interaction by 32-bit processes)?

Thor


Re: Proposal: Disable autoload of compat_xyz modules

2017-08-07 Thread Thor Lancelot Simon
On Wed, Aug 02, 2017 at 12:35:22PM +, Taylor R Campbell wrote:
> 
> I propose to disable the following modules by default, but leave the
> code so you can still modload them or include them in your custom
> kernel config if you want:
> 
> compat_osf1

This probably still has users on Alpha.

> compat_ultrix
> exec_ecoff

These definitely have users on pmax, unless the last pmax has finally died.

Thor


Re: wscons/wsmux question

2017-06-17 Thread Thor Lancelot Simon
On Sun, Jun 18, 2017 at 12:14:41PM +0800, Paul Goyette wrote:
> 
> Would anyone object if I were to remove all of the "#if NWSMUX > 0"
> conditionals, and simply require the wsmux code to be included (via
> files.wscons) whenever any child dev (wsdisplay, wskbd, wsmouse, or wsbell)
> is configured?

How much (more) bloat?

Thor


Re: RFC: localcount_hadref() or localcount_trydarin()

2017-06-12 Thread Thor Lancelot Simon
On Mon, Jun 12, 2017 at 12:51:29PM +, Taylor R Campbell wrote:
> > Date: Mon, 12 Jun 2017 10:53:52 +0900
> > From: Kengo NAKAHARA <k-nakah...@iij.ad.jp>
> > 
> > I want to avoid detaching the encryption device while it is used by IPsec.
> > That is, once someone creates Security Assocatation(SA) to call
> > crypto_newsession(), the encryption device related the SA must not be
> > detached until the SA is flushed(done crypto_freesession()) and the SA
> > is not used(done crypto_dispatch() and cryptointr()).
> 
> Why don't you just use a global reference count first?  Is the latency
> and scalability of crypto_newsession and crypto_freesession critical?

For many workloads, it will be, yes.  This pair of operations will occur:

* Once per SSL/TLS connection even if the connection is resumed,
  which is tens of thousands of times per second on a busy server,
  possibly even hundreds of thousands of times per second.

  This assumes someone has an SSL/TLS library that can efficiently
  use our kernel crypto, but there's at least one out there that I
  know of.  With modern instruction-based accelerators rather than
  the DMA-and-interrupts style this probably matters less.

* Once per Phase 2 IPsec association -- potentially tens of
  thousands per second in recovery from an outage -- this likely
  matters more to most users of our opencrypto today.

-- 
  Thor Lancelot Simont...@panix.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."  - H.L.A. Hart


Re: Restricting rdtsc [was: kernel aslr]

2017-04-04 Thread Thor Lancelot Simon
On Tue, Apr 04, 2017 at 05:39:35PM +0200, Maxime Villard wrote:
> sorry for the delay
> 
> Le 31/03/2017 ? 19:23, Andreas Gustafsson a ?crit :
> > It's ASLR that's broken, not rdtsc, and I strongly object to
> > restricting the latter just to that people can continue to gain
> > a false sense of security from the former.
> 
> For your information, side-channels are not only limited to aslr. It has
> been demonstrated too that cache latencies can be used to keylog a privileged
> process, and to steal cryptographic keys.

Time is a basic operating system service.  Lack of cheap precision time is
not an _advantage_ of NetBSD; it is a disadvantage.

As others have noted, our general intention has been to _reduce_ the cost
to an applicaiton of obtaining timestamps in general -- by providing a
commpage with a base value, and allowing libc to use the cycle counter
as a no-system-calls-required way to obtain an offset.  Other operating
systems do this and it is a real advantage for many applicaitons.  If we
block userland access to the cycle counters, this is a nonstarter.

Yes, the ability of malicious code to measure the behavior of critical
system components and facilities is a problem, but I tend to believe the
solution has to be in the implementation of those components and facilities,
not in removing the ability of non-malicious code to make precision
measurements.

We may not have applications in base that use use rdtsc to get quick
timestamps, but they're sure out there.  OpenSSL's MD code used to
use it -- has that changed? -- and I've seen it in database applications,
language runtimes, and numerous other places.  I really don't think it
would be a good idea to cause it to not work in the general case.

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-03 Thread Thor Lancelot Simon
On Tue, Apr 04, 2017 at 12:39:46AM +0200, Jarom??r Dole??ek wrote:
> 
> Is there any reason we wouldn't want to set QAM=1 by default for
> sd(4)? Seems like pretty obvious performance improvement tweak.

Supposedly, there are some rather old drives -- mid-1990s or thereabouts --
that may keep some SIMPLE tags pending and *never* finish them unless the
host occasionally issues an ORDERED tag.  I don't know if any of them
still do, but some Linux HBA drivers used to forcibly set 1 in N tags
(for relatively large values of N) to ORDERED to avoid this.

I was pondering putting this setting into scsictl (it is sufficiently
SCSI-specific it seems like it doens't belong in dkctl).

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-02 Thread Thor Lancelot Simon
On Sat, Apr 01, 2017 at 09:26:50AM +, Michael van Elst wrote:
> 
> >Setting WCE on SCSI drives is simply a bad idea.  It is
> >not necessary for performance and creates data integrity
> >isues.
> 
> I don't know details about data integrity issues although
> I'm sure there are some. But unfortunately WCE makes a difference
> on many SCSI drives nowadays. You either run with WCE or
> use a RAID controller with it's own stable storage (BBU/Flash)
> or live with a significant speed penalty for writes.

But the RAID controller is just another embedded computer, with its own
SCSI initiator.  If it's not setting WCE on the drive, how does it get better
performance than we could?  The answer must be either:

A) Less barriers (ignoring cache flushes or clearing ORDERED tags earlier
   because it has stable "cache")
B) Larger commands (we could do this...sigh).
C) More tags in flight.

These are, in theory, things we could do too.

However -- I believe for the 20-30% of SAS drives you mention as shipping
with WCE set, it should be possible to obtain nearly identical performance
and more safety by setting the Queue Algorithm Modifier bit in the control
mode page to 1.  This allows the drive to arbitrarily reorder SIMPLE
writes so long as the precedence rules with HEAD and ORDERED commands
are respected.  I don't seem to have a drive like the ones you're describing
(all my SAS stuff is several years old at best, and nothing shipped with
WCE turned on as far as I can tell), but if you're able to try this, I'd love
to know what the result is.

Given enough tags in flight, the only difference between using SIMPLE tags
for writes with QAM=1 and running with WCE enabled is that the host should
be able to tell when the writes actually hit stable storage, which is kind
of a big deal...

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Thor Lancelot Simon
On Sat, Apr 01, 2017 at 06:46:24PM +0200, Michael van Elst wrote:
> On Sat, Apr 01, 2017 at 11:12:42AM -0400, Thor Lancelot Simon wrote:
> 
> > That said, very high-latency transports like iSCSI require a lot more
> > data than we can put into flight at once.  We just don't have enough
> > parallelism in our I/O subsystem (and most applications can't supply
> > enough).
> 
> We have tons of parallelism for writing and a small amount for reading.

Unless you've done even more than I noticed, allocation in the filesystems
is going to be a bottleneck -- concurrent access not having been foremost
in anyone's mind when FFS was designed.

XFS is full of tricks for this.  Unfortunately, despite a few early papers,
the source code pretty much is the documentation -- and parts of the code
that were effectively hamstrung by the lesser capabilities of the early
Linux kernel compared to end-of-the-road Irix have in some cases been
removed.

-- 
  Thor Lancelot Simont...@panix.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."  - H.L.A. Hart


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-04-01 Thread Thor Lancelot Simon
On Sat, Apr 01, 2017 at 08:54:40AM +, Michael van Elst wrote:
> t...@panix.com (Thor Lancelot Simon) writes:
> 
> >When SCSI tagged queueing is used properly, it is not necessary to set WCE
> >to get good write performance, and doing so is in fact harmful, since it
> >allows the drive to return ORDERED commands as complete before any of the
> >data for those or prior commands have actually been committed to stable
> >storage.
> 
> Do you think that real world disks agree? WCE is often necessary to
> get any decent performance and yes, data is not committed to stable

I don't agree.  What's sometimes necessary is to adjust the other mode
page bits that allow the drive to arbitrarily reorder SIMPLE commands,
but with an I/O subsystem that can put enough data in flight at once,
there's no performance reason to use WCE and considerable reliability
reason not to.

That said, very high-latency transports like iSCSI require a lot more
data than we can put into flight at once.  We just don't have enough
parallelism in our I/O subsystem (and most applications can't supply
enough).

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Thor Lancelot Simon
On Fri, Mar 31, 2017 at 07:16:25PM +0200, Jarom??r Dole??ek wrote:
> > The problem is that it does not always use SIMPLE and ORDERED tags in a
> > way that would facilitate the use of ORDERED tags to enforce barriers.
> 
> Our scsipi layer actually never issues ORDERED tags right now as far
> as I can see, and there is currently no interface to get it set for an
> I/O.

It's not obvious, but in fact ORDERED gets set for writes
as a default, I believe -- in sd.c, I think?

This confused me for some time when I last looked at it.

> I lived under assumption that SIMPLE tagged commands could be and are
> reordered by the controller/drive at will already, without setting any
> other flags.

They might be -- there are well defined mode page bits
to control this, but I believe targets are free to use
whatever default they like.

> 
> > When SCSI tagged queueing is used properly, it is not necessary to set WCE
> > to get good write performance, and doing so is in fact harmful, since it
> > allows the drive to return ORDERED commands as complete before any of the
> > data for those or prior commands have actually been committed to stable
> > storage.
> 
> This was what I meant when I said "even ordered tags couldn't avoid
> the cache flushes". Using ORDERED tags doesn't provide on-media
> integrity when WCE is set.

Setting WCE on SCSI drives is simply a bad idea.  It is
not necessary for performance and creates data integrity
isues.

> Now, it might be the case that the on-media integrity is not the
> primary goal. Then flush is only a write barrier, not integrity
> measure. In that case yes, ORDERED does keep the semantics (e.g.
> earlier journal writes are written before later journal writes). It
> does make stuff much easier to code, too - simply mark I/O as ORDERED
> and fire, no need to explicitly wait for competition, and can drop e.g
> journal locks faster.
> 
> I do think that it's important to concentrate on case where WCE is on,
> since that is realistically what majority of systems run with.

I don't believe most SCSI drives are run with WCE on.

I agree FUA or its equivalent is needed for non-SCSI
drives.

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-31 Thread Thor Lancelot Simon
On Fri, Mar 31, 2017 at 02:16:44PM +0200, Edgar Fu? wrote:
> Oh well.
> 
> TLS> If the answer is that you're running with WCE on in the mode pages, then
> TLS> don't do that:
> EF> I don't get that. If you turn off the write cache, you need neither cache 
> EF> flushes nor ordering, no?
> MB> You still need ordering. With tagged queuing, you have multiple commands
> MB> running at the same time (up to 256, maybe more fore newer scsi) and the 
> MB> drive is free to complete them in any order.  Unless one of them is an 
> MB> ORDERED command, in which case comamnds queued before have to complete 
> MB> before.
> 
> I guess we are talking past each other. I should have phrased that ``If you
> don't use any tagging and turn off the write cache, ...''.

But that doesn't make sense.  Why would our SCSI layer not use tagging?

The problem is that it does not always use SIMPLE and ORDERED tags in a
way that would facilitate the use of ORDERED tags to enforce barriers.

Also, that we may not know enough about the behavior of our filesystems
in the real world to be 100% sure it's safe to set the other mode page
bits that allow the drive to arbitrarily reorder SIMPLE commands (which
under some conditions is necessary to match the performance of running
with WCE set).

When SCSI tagged queueing is used properly, it is not necessary to set WCE
to get good write performance, and doing so is in fact harmful, since it
allows the drive to return ORDERED commands as complete before any of the
data for those or prior commands have actually been committed to stable
storage.

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-30 Thread Thor Lancelot Simon
On Wed, Mar 29, 2017 at 11:53:55AM +0200, Edgar Fu? wrote:
> > It needs to do this [flush disc cache after committing journal] because 
> > it needs to make sure that journal data are saved before we save the 
> > journal commit block.
> So the point is to force an order (data before commit block).
> 
> > Implicitly, the pre-commit flush also makes sure that all asynchronously 
> > written metadata updates are written to the media, before the commit makes
> > them impossible to replay.
> So the point is do force an order (metadata before journal).
> 
> > Even SCSI ORDERED tags wouldn't help to avoid the need for cache flushes.
> So why that if the point of the cache flushes is to ensure an order?

It doesn't make sense to me either.  ORDERED tags are required not to complete
until all previously submitted SIMPLE and ORDERED tags have been committed to
stable storage; and if that's not enough, you can, I believe, use a HEAD tag.
Why isn't that good enough?

If the answer is that you're running with WCE on in the mode pages, then
don't do that: use SIMPLE tags for all writes except when you intend a
barrier, and ORDERED when you do.  I must be missing something.

-- 
 Thor Lancelot Simon  t...@panix.com

Cry, the beloved country, for the unborn child that is the
inheritor of our fear.  -Alan Paton


Re: Restricting rdtsc [was: kernel aslr]

2017-03-28 Thread Thor Lancelot Simon
On Tue, Mar 28, 2017 at 04:58:58PM +0200, Maxime Villard wrote:
> Having read several papers on the exploitation of cache latency to defeat
> aslr (kernel or not), it appears that disabling the rdtsc instruction is a
> good mitigation on x86. However, some applications can legitimately use it,
> so I would rather suggest restricting it to root instead.

This will break a ton of stuff.  Code all over the place checks if it's
on x86 and uses rdtsc when it wants a quick timestamp.

Thor


Re: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL

2017-03-27 Thread Thor Lancelot Simon
On Tue, Mar 28, 2017 at 01:17:18AM +0200, Jarom??r Dole??ek wrote:
> 2017-03-12 11:15 GMT+01:00 Edgar Fu?? :
> > Some comments as I probably count as one of the larger WAPBL consumers (we
> > have ~150 employee's Home and Mail on NFS on FFS2+WAPBL on RAIDframe on 
> > SAS):
> 
> I've not changed the code in RF to pass the cache flags, so the patch
> doesn't actually enable FUA there. Mainly because disks come and go
> and I'm not aware of mechanism to make WAPBL aware of such changes. It

I ran into this issue with tls-maxphys and got so frustrated I was actually
considering simply panicing if a less-capable disk were used to replace a
more-capable one.

Just FYI.

Thor


Re: kernel aslr: someone interested?

2017-03-25 Thread Thor Lancelot Simon
On Sat, Mar 25, 2017 at 09:20:14AM +0100, Maxime Villard wrote:
> 
> Verily, 5-level page trees with higher entropy will be introduced by Intel
> soon, the instructions that leak kernel addresses can be made privileged
> (UMIP), cache issues are being fixed; and in short, I wouldn't be surprised
> if in five years other features appear that make ASLR even more interesting
> and faster than static code.

I would, since "static code" is clearly the base case for any address layout
scheme.  But once you're within a smidge, who cares?

Even now, with register renaming and every other trick in the book and
*without* ASLR, PIC is still measurably slower than non-PIC in userland.
Not much, but try it and see (on modern CPUs you'll need the performance
counters, I suspect -- thanks for fixing those).  But nobody seems to
notice enough to build non-PIC, so...

Obviously on VAC platforms ASLR is unacceptably slow.  But there aren't
many of those left, and we don't build every platform with the same
toolchain options anyway (of course).

Everyone else that matters has begun to move towards ASLR as the default
(many, perhaps most, are there for userspace code already) and it _does_
matter -- if Max's vault7 example doesn't suffice, have a look at those
ShadowBrokers firewall exploits that were released a few months back.
Each and every one of those comes in several different flavors tweaked
for the kernel layout of different versions of the same code on the same
platform -- where there are instructions for use (fun to read) they
basically say to stop attacking and get expert help if you run into a
target where the -- *static* -- kernel layout is unexpected.  ASLR
increases the work factor for that stuff considerably (though there
are obvious approaches if you can zap the early boot code to wire down
the "randomization" so it isn't, etc).  Yes, there are workloads where
we wouldn't want it, but I submit there are also many important
cases where we should have it.

Thor


Re: 10Gb and 40Gb equipment available

2017-02-16 Thread Thor Lancelot Simon
On Thu, Feb 16, 2017 at 10:42:21AM -0500, Thor Lancelot Simon wrote:
> NetBSD has received a donation of 10Gb Ethernet switches (Arista 7124S or SX
> and 7050).
> 
> We will use some of these switches in our own infrastructure but are
> offering others for long-term use by developers interested in using them to
> work on NetBSD-related projects.
> 
> An immediately obvious project would be porting of additional 10Gb card
> drivers, such as those for Broadcom, Solarflare, or Mellanox cards.  I
> can supply cards and cabling for this purpose.

I should note that another obvious project would be support for link-layer
encryption (MACsec) in our kernel.  I haven't used it on these switches
but according to the datasheet, it is supported.

The Linux kernel and many Windows drivers support this and it is
becoming an increasingly common protocol for protecting LAN and WAN
traffic (particularly in light of the recent slew of vulnerabilities in
IKE implementations and consequent nervousness about IPsec on WAN
links).

This should be a fairly simple task starting with the existing code for
encryption on wireless networks.  Strictly speaking, since this could be
tested card-to-card there is no need for a switch with MACsec support to
do this work, but we would be more than happy to supply one of these
switches to an interested and capable developer as a reference peer for
the protocol(s) (encryption and key negotiation).


-- 
 Thor Lancelot Simon  t...@panix.com

Cry, the beloved country, for the unborn child that is the
inheritor of our fear.  -Alan Paton


10Gb and 40Gb equipment available

2017-02-16 Thread Thor Lancelot Simon
NetBSD has received a donation of 10Gb Ethernet switches (Arista 7124S or SX
and 7050).

We will use some of these switches in our own infrastructure but are
offering others for long-term use by developers interested in using them to
work on NetBSD-related projects.

An immediately obvious project would be porting of additional 10Gb card
drivers, such as those for Broadcom, Solarflare, or Mellanox cards.  I
can supply cards and cabling for this purpose.

These are low-latency switches with a fairly rich L2 and L3 feature
set (though lacking newer features such as VXLAN encapsulation).  They run
Linux as the control plane OS and user-provided code on the control plane
is expressly supported.  A few features perhaps of note to NetBSD developers
include MACsec support, port mirroring over GRE, rapid, multiple, and
per-VLAN spanning-tree, equal cost multipath routing, and multichassis
link aggregation ("MLAG").

The 7050 switches have 40Gb ports.  I have a *limited* quantity of 40Gb
adapters and cabling available which we can provide to developers who
demonstrate a serious interest in working on 40Gb card drivers and
related stack features (e.g. large receive).  The adapters are Solarflare
and Mellanox, and suitable-licensed FreeBSD drivers are available as a
starting point.

The switch software is EOL by Arista and has some open security issues
which mean that the control plane should *not* be exposed to untrusted
networks.  This should not be a problem for development work.

Please let me know before the end of February if you want hardware from
this donation, and why.

-- 
 Thor Lancelot Simon  t...@panix.com

Cry, the beloved country, for the unborn child that is the
inheritor of our fear.  -Alan Paton


Re: workqueue for pr_input

2017-01-23 Thread Thor Lancelot Simon
On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote:
> 
> The demerit is that that mechanism adds non-trivial
> overhead; RTT of ping increases by 30 usec.

I don't see why overhead for control protocols like ICMP matters.  I
think if you look at longstanding commercial designs for networking
equipment -- software and hardware alike -- for literally decades they
have taken the approach of optimizing the fast path (forwarding, data
protocol performance) while allowing the exception/error paths (ICMP etc.)
to be very slow.

Thor


Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread Thor Lancelot Simon
On Mon, Jan 02, 2017 at 06:08:04PM +, David Holland wrote:
> On Mon, Jan 02, 2017 at 01:01:34PM -0500, Thor Lancelot Simon wrote:
>  > On Mon, Jan 02, 2017 at 05:31:23PM +, David Holland wrote:
>  > > (from a while back)
>  > > 
>  > > However, I'm missing something. The I/O queue depths that you need to
>  > > get peak write performance from SSDs are larger than 31, and the test
>  > > labs appear to have been able to do this with SATA-attached SSDs...
>  > > what are/were they doing?
>  > 
>  > Aggressive prefetching, extreme efforts to reduce command latency at
>  > the drive end of the SATA link (and higher link speeds), plus much
>  > larger request sizes than we can issue.
> 
> Yes, but I mean testing with queue depths > 31, like ~100, which I'm
> sure I remember seeing. But maybe I'm wrong... obviously I should go
> rake up some links, maybe later.

The tests could have been run with RAID controllers that present a
SCSI interface to the host.  These often support very deep queues both
for the virtual targets and at the adapter (channel) itself, at which
point it's all about minimizing latency again on the controller's side
of the interaction, where it really _is_ SATA with a limited queue
depth.

If you want a large number of SATA targets in one box you are likely
using a RAID controller even if you're just using it in JBOD mode.  That
makes every SATA target look like a SCSI target.

Thor


Re: Plan: journalling fixes for WAPBL

2017-01-02 Thread Thor Lancelot Simon
On Mon, Jan 02, 2017 at 05:31:23PM +, David Holland wrote:
> (from a while back)
> 
> However, I'm missing something. The I/O queue depths that you need to
> get peak write performance from SSDs are larger than 31, and the test
> labs appear to have been able to do this with SATA-attached SSDs...
> what are/were they doing?

Aggressive prefetching, extreme efforts to reduce command latency at
the drive end of the SATA link (and higher link speeds), plus much
larger request sizes than we can issue.

-- 
 Thor Lancelot Simon  t...@panix.com

Ring the bells that still can ring.


Re: History of disklables

2016-12-31 Thread Thor Lancelot Simon
On Wed, Dec 28, 2016 at 08:14:36PM +0100, Edgar Fu? wrote:
> > You must be misremembering something.
> Looks like it.
> 
> So disklables are a BSD thing, not a DEC thing?

Ultrix has a disklabel format, and I've moved drives from VAX to MIPS
Ultrix systems, so the answer really is "it depends".  I have also booted
Ultrix kernels on systems installed with plain 4.2BSD, however, and I didn't
lose access to my drives, so I assume (new-enough) Ultrix had both the DEC
disklabel support and the old compiled-in partitioning as a fallback.

For non-boot drives, you can use disktab, right?  In that case I can't
recall how the partition boundary information actually gets to the kernel.

Do our VAX disk drivers for Unibus/Massbus hardware retain the compiled
in partitioning as a fallback?  Without it, it'd be tough to even read
old Unix filesystems without modifying the disk contents (to write either
an Ultrix or 4.4BSD label into place).

-- 
 Thor Lancelot Simon  t...@panix.com

Ring the bells that still can ring.


Re: Possible buffer cache race?

2016-10-23 Thread Thor Lancelot Simon
On Sun, Oct 23, 2016 at 06:27:09PM +0200, Jarom??r Dole??ek wrote:
> 
> I see interesting thing - periodically, all of the tar processes get
> blocked sleeping on either tstile, biolock or pager_map. All the tar
> processes block. When I just wait they stay blocked. When I call
> sync(8), all of them unblock and continue running, until again they
> all hit the same condition later.  When I keep calling sync,
> eventually all processes finish.
> 
> Most often they block on biolock, then somewhat less frequently
> tstile; pager_map is more rare. It's usually mix of these - most
> processes block on biolock, some tstile and zero/one/two on pager_map.

I have a hunch this might be related to the metadata cache growing and
shrinking (I assume by "buffer cache" you mean the traditional buffer
cache, now used only for metadata -- not the page cache?).

It could also be related to the need to push out dirty buffers in order
to make room for new metadata.  If this happens because the clock ticks
over, it's supposed to be basically asynchronous.  But if it happens
because you need another metadata buffer _right now_ to schedule a
directory read or a write... it is effectively synchronous.

> Any idea where I should try to start poking?

You could just disable the code that tries to dynamically allocate and
free metadata cache buffers, replacing it with a fixed size allocation
at system startup -- try a very large one and see what happens.  If that
looks better, it's a hint that the problem is related to the allocations
used to dynamically grow the metadata cache.  But if it doesn't look
better, the problem may simply be related to buffer recycling to make
room for new metadata for new operations.  In that case, make the fixed
buffer cache size much smaller so recycling's going on all the time,
figure out how to instrument it, and debug...

-- 
  Thor Lancelot Simont...@panix.com

"The dirtiest word in art is the C-word.  I can't even say 'craft'
 without feeling dirty."-Chuck Close


Re: vioif vs if_vio

2016-09-24 Thread Thor Lancelot Simon
On Sat, Sep 24, 2016 at 02:02:16PM +0800, Paul Goyette wrote:
> Shouldn't the vioif(4) device be more properly named if_vio(4), to be
> consistent with other network interfaces?

I think the code was imported with the same filenames as its original
source, to ease merging of updates.

> With its current name, it could never successfully exist as an auto-loaded
> kernel module, since the auto-load code assumes the if_ prefix!

Sounds like a bug in the auto-load code.

Thor


Re: FUA and TCQ (was: Plan: journalling fixes for WAPBL)

2016-09-24 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 01:02:26PM +, paul.kon...@dell.com wrote:
> 
> > On Sep 23, 2016, at 5:49 AM, Edgar Fu??  wrote:
> > 
> >> The whole point of tagged queueing is to let you *not* set [the write 
> >> cache] bit in the mode pages and still get good performance.
> > I don't get that. My understanding was that TCQ allowed the drive to 
> > re-order 
> > commands within the bounds described by the tags. With the write cache 
> > disabled, all write commands must hit stable storage before being reported 
> > completed. So what's the point of tagging with cacheing disabled?
> 
> I'm not sure.  But I have the impression that in the real world tagging is 
> rarely, if ever, used.

I'm not sure what you mean.  Do you mean that tagging is rarely, if ever,
used _to establish write barriers_, or do you mean that tagging is rarely,
if ever used, period?

If the latter, you're way, way wrong.

Thor


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 07:45:00PM +0200, Manuel Bouyer wrote:
> On Fri, Sep 23, 2016 at 05:15:16PM +, Eric Haszlakiewicz wrote:
> > On September 23, 2016 10:51:30 AM EDT, Warner Losh  wrote:
> > >All NCQ gives you is the ability to schedule multiple requests and
> > >to get notification of their completion (perhaps out of order). There's
> > >no coherency features are all in NCQ.
> > 
> > This seems like the key thing needed to avoid FUA: to implement fsync() you 
> > just wait for notifications of completion to be received, and once you have 
> > those for all requests pending when fsync was called, or started as part of 
> > the fsync, then you're done.
> 
> *if you have the write cache disabled*

*Running with the write cache enabled is a bad idea*



Re: FUA and TCQ

2016-09-23 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 09:38:08AM -0400, Greg Troxel wrote:
> 
> Johnny Billquist  writes:
> 
> > With rotating rust, the order of operations can make a huge difference
> > in speed. With SSDs you don't have those seek times to begin with, so
> > I would expect the gains to be marginal.
> 
> For reordering, I agree with you, but the SSD speeds are so high that
> pipeling is probably necessary to keep the SSD from stalling due to not
> having enough data to write.  So this could help move from 300 MB/s
> (that I am seeing) to 550 MB/s.

The iSCSI case is illustrative, too.  Now you can have a "SCSI bus" with
a huge bandwidth delay product.  It doesn't matter how quickly the target
says it finished one command (which is all enabling the write-cache can get
you) if you are working in lockstep such that the initiator cannot send
more commands until it receives the target's ack.

This is why on iSCSI you really do see hundreds of tags in flight at
once.  You can pump up the request size, but that causes fairness
problems.  Keeping many commands active at the same time helps much more.

Now think about that SSD again.  The SSD's write latency is so low that
_relative to the delay time  it takes the host to issue a new command_ you
have the same problem.  It's clear that enabling the write cache can't
really help, or at least can't help much: you need to have many commands
pending at the same time.

Our storage stack's inability to use tags with SATA targets is a huge
gating factor for performance with real workloads (the residual use of
the kernel lock at and below the bufq layer is another).  Starting de
novo with NVMe, where it's perverse and structurally difficult to not
support multiple commands in flight simultaneously, will help some, but
SATA SSDs are going to be around for a long time still and it'd be
great if this limitation went away.

That said, I am not going to fix it myself so all I can do is sit here
and pontificate -- which is worth about what you paid for it, and no
more.

Thor


Re: Plan: journalling fixes for WAPBL

2016-09-23 Thread Thor Lancelot Simon
On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote:
> On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote:
> > > AFAIK ordered tags only guarantees that the write will happen in order,
> > > but not that the writes are actually done to stable storage.
> > 
> > The target's not allowed to report the command complete unless the data
> > are on stable storage, except if you have write cache enable set in the
> > relevant mode page.
> > 
> > If you run SCSI drives like that, you're playing with fire.  Expect to get
> > burned.  The whole point of tagged queueing is to let you *not* set that
> > bit in the mode pages and still get good performance.
> 
> Now I remember that I did indeed disable disk write cache when I had
> scsi disks in production. It's been a while though.
> 
> But anyway, from what I remember you still need the disk cache flush
> operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.

I think that's true only if you're running with write cache enabled; but
the difference is that most ATA disks ship with it turned on by default.

With an aggressive implementation of tag management on the host side,
there should be no performance benefit from unconditionally enabling
the write cache -- all the available cache should be used to stage
writes for pending tags.  Sometimes it works.

-- 
  Thor Lancelot Simont...@panix.com

"The dirtiest word in art is the C-word.  I can't even say 'craft'
 without feeling dirty."-Chuck Close


Re: Plan: journalling fixes for WAPBL

2016-09-22 Thread Thor Lancelot Simon
On Thu, Sep 22, 2016 at 04:06:55PM +0200, Manuel Bouyer wrote:
> On Thu, Sep 22, 2016 at 07:50:27AM -0400, Thor Lancelot Simon wrote:
> > On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote:
> > > 
> > > 3.2 use FUA (Force Unit Access) for commit record write
> > > This avoids need to issue even the second DIOCCACHESYNC, as flushing
> > > the disk cache is not really all that useful, I like the thread over
> > > at:
> > > http://yarchive.net/comp/linux/drive_caches.html
> > > Slightly less controversially, this would allow the rest of the
> > > journal records to be written asynchronously, leaving them to execute
> > > even after commit if so desired. It may be useful to have this
> > > behaviour optional. I lean towards skipping the disk cache flush as
> > > default behaviour however, if we implement write barrier for the
> > > commit record (see below).
> > > WAPBL would need to deal with drives without FUA, i.e fall back to cache 
> > > flush.
> > 
> > I have never understood this business about needing FUA to implement
> > barriers.  AFAICT, for any SCSI or SCSI-like disk device, all that is
> > actually needed is to do standard writes with simple tags, and barrier
> > writes with ordered tags.  What am I missing?
> 
> AFAIK ordered tags only guarantees that the write will happen in order,
> but not that the writes are actually done to stable storage.

The target's not allowed to report the command complete unless the data
are on stable storage, except if you have write cache enable set in the
relevant mode page.

If you run SCSI drives like that, you're playing with fire.  Expect to get
burned.  The whole point of tagged queueing is to let you *not* set that
bit in the mode pages and still get good performance.

Thor


Re: Plan: journalling fixes for WAPBL

2016-09-22 Thread Thor Lancelot Simon
On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote:
> 
> 3.2 use FUA (Force Unit Access) for commit record write
> This avoids need to issue even the second DIOCCACHESYNC, as flushing
> the disk cache is not really all that useful, I like the thread over
> at:
> http://yarchive.net/comp/linux/drive_caches.html
> Slightly less controversially, this would allow the rest of the
> journal records to be written asynchronously, leaving them to execute
> even after commit if so desired. It may be useful to have this
> behaviour optional. I lean towards skipping the disk cache flush as
> default behaviour however, if we implement write barrier for the
> commit record (see below).
> WAPBL would need to deal with drives without FUA, i.e fall back to cache 
> flush.

I have never understood this business about needing FUA to implement
barriers.  AFAICT, for any SCSI or SCSI-like disk device, all that is
actually needed is to do standard writes with simple tags, and barrier
writes with ordered tags.  What am I missing?

I must have proposed adding a B_ARRIER or B_ORDERED at least five times
over the years.  There are always objections...

Thor


  1   2   3   4   5   6   >