Re: Anonymous vnodes?

2023-06-27 Thread Joerg Sonnenberger
On Mon, Jun 26, 2023 at 06:13:17PM -0400, Theodore Preduta wrote:
> Is it possible to create a vnode for a regular file in a file system
> without linking the vnode to any directory, so that it disappears when
> all open file descriptors to it are closed?  (As far as I can tell, this
> isn't possible with any of the vn_* or VOP_* functions?)

Linux has O_TMPFILE for this, but we don't support this extension so
far.

Joerg


Re: Per-descriptor state

2023-05-04 Thread Joerg Sonnenberger
On Thu, May 04, 2023 at 09:58:49AM +0100, Robert Swindells wrote:
> 
> David Holland  wrote:
> >On Sun, Apr 30, 2023 at 09:44:49AM -0400, Mouse wrote:
> > > > Close-on-fork is apparently either coming or already here, not sure
> > > > which, but it's also per-descriptor.
> > > 
> > > I should probably add that here, then, though use cases will likely be
> > > rare.  I can think of only one program I wrote where it'd be useful; I
> > > created a "close these fds post-fork" data structure internally.
> >
> >I can't think of any at all; to begin with it's limited to forks that
> >don't exec, and unless just using it for convenience as you're
> >probably suggesting, it only applies when also using threads, and if
> >one's using threads why is one also using forks? So it seems like it's
> >limited to badly designed libraries that want to fork behind the
> >caller's back instead of setting up their forks at initialization
> >time. Or something.
> 
> Or it is needed for a little used application called Firefox.

For a sandbox, something like closefrom is actually much preferred as
you don't know what else has opened file descriptors. I really question
the sanity of close-on-fork...

Joerg


Re: kernel goes dark on boot

2023-04-04 Thread Joerg Sonnenberger
Am Tue, Apr 04, 2023 at 08:30:51AM + schrieb Emmanuel Dreyfus:
> Debugging what happens after that is more tricky. This is assembly
> code, I am not sure I can printf from there. I try to make the
> machine reboot at startprog64() start to see if I just reach that 
> place.

Look for beep_on_reset in the ACPI wake code. If your machine has a PC
speaker emulated, that might be the easiest option:

movb$0xc0,%al
outb%al,$0x42
movb$0x04,%al
outb%al,$0x42
inb $0x61,%al
orb $0x3,%al
outb%al,$0x61

Joerg


Re: #pragma once

2022-10-16 Thread Joerg Sonnenberger
Am Sat, Oct 15, 2022 at 07:21:35PM + schrieb Taylor R Campbell:
> How reliable/consistent is toolchain support for it?  Is it worth
> adopting or is the benefit negligible over continuing to use
> traditional #include guards?  Likely problems with adopting it?

Does it gain anything beside a tiny bit smaller? I don't think it is
faster and trading three lines of boilerplate per header file for a
non-portable construct is a worthy gain.

Joerg


Re: Emulating missing linux syscalls

2022-04-18 Thread Joerg Sonnenberger
Am Tue, Apr 19, 2022 at 02:39:44AM +0530 schrieb Piyush Sachdeva:
> On Sat, Apr 16, 2022 at 2:06 AM Joerg Sonnenberger  wrote:
> >
> > Am Wed, Apr 13, 2022 at 09:51:31PM - schrieb Christos Zoulas:
> > > In article , Joerg Sonnenberger   
> > > wrote:
> > > >Am Tue, Apr 12, 2022 at 04:56:05PM - schrieb Christos Zoulas:
> > >
> > > >splice(2) as a concept is much older than the current Linux 
> > > >implementation.
> > > >There is no reason why zero-copying for sockets should require a
> > > >different system call for zero-copying from/to pipes. There are valid
> > > >reasons for other combinations, too. Consider /bin/cp for example.
> > >
> > > You don't need two system calls because the kernel knows the type of
> > > the file descriptors and can dispatch to different implementations.
> > > One of the questions is do you provide the means to pass an additional
> > > header/trailer to the output data like FreeBSD does for its sendfile(2)
> > > implementation?
> > >
> > > int
> > > splice(int infd, off_t *inoff, int outfd, off_t *outoff, size_t len,
> > > const struct {
> > >   struct iov *head;
> > >   size_t headcnt;
> > >   struct iov *tail;
> > >   size_t tailcnt;
> > > } *ht, int flags);
> >
> > There are essentially two use cases here:
> > (1) I want a simple interface to transfer data from one fd to another
> > without extra copies.
> >
> > (2) I wanto avoid copies AND I want to avoid system calls.
> >
> > For the former:
> > int splice(int dstfd, int srcfd, off_t *len);
> >
> > is more than good enough. "Transfer up to [*len] octets from srcfd to
> > dstfd, updating [len] with the actually transferred amount and returning
> > the first error if any.
> >
> > For the second category, an interface more like the posix_spawn
> > interface (but without all the extra allocations) would be useful.
> >
> 
> Therefore, having the above const struct *ht to support
> mmap() will be a good option I guess.

It covers a very limited subset of the desired options. Basically, what
you want in this case is something like:

int splicev(int dstfd, struct spliceop ops[], size_t *lenops, off_t
*outoff);

where spliceops is used to specify the supported operations:
- read from a fd with possible seek
- read from memory
- seek output
and maybe other operations I can't think of right now. lenops provides
the number of operations in input and the remaining operations on
return, outoff is the remaining output in the current block. Some
variant of this might be possible.

Joerg


Re: Emulating missing linux syscalls

2022-04-15 Thread Joerg Sonnenberger
Am Wed, Apr 13, 2022 at 09:51:31PM - schrieb Christos Zoulas:
> In article , Joerg Sonnenberger   
> wrote:
> >Am Tue, Apr 12, 2022 at 04:56:05PM - schrieb Christos Zoulas:
> 
> >splice(2) as a concept is much older than the current Linux implementation.
> >There is no reason why zero-copying for sockets should require a
> >different system call for zero-copying from/to pipes. There are valid
> >reasons for other combinations, too. Consider /bin/cp for example.
> 
> You don't need two system calls because the kernel knows the type of
> the file descriptors and can dispatch to different implementations.
> One of the questions is do you provide the means to pass an additional
> header/trailer to the output data like FreeBSD does for its sendfile(2)
> implementation?
> 
> int
> splice(int infd, off_t *inoff, int outfd, off_t *outoff, size_t len, 
> const struct {
>   struct iov *head;
>   size_t headcnt;
>   struct iov *tail;
>   size_t tailcnt;
> } *ht, int flags);

There are essentially two use cases here:
(1) I want a simple interface to transfer data from one fd to another
without extra copies.

(2) I wanto avoid copies AND I want to avoid system calls.

For the former:
int splice(int dstfd, int srcfd, off_t *len);

is more than good enough. "Transfer up to [*len] octets from srcfd to
dstfd, updating [len] with the actually transferred amount and returning
the first error if any.

For the second category, an interface more like the posix_spawn
interface (but without all the extra allocations) would be useful.

> >I was saying that the Linux system call can be implemented without a
> >kernel backend, because I don't consider zero copy a necessary part of
> >the interface contract. It's a perfectly valid, if a bit slower
> >implementation to do allocate a kernel buffer and do IO via that.
> 
> Of course, but how do you make an existing binary use it? LD_PRELOAD
> a binary to override the symbol in the linux glibc? By that logic you
> don't need an in kernel linux emulation, you can do it all in userland :-)

You still provide the system call as front end, but internally implement
it on top of regular read/write to a temporary buffer.

Joerg


Re: Emulating missing linux syscalls

2022-04-13 Thread Joerg Sonnenberger
Am Tue, Apr 12, 2022 at 04:56:05PM - schrieb Christos Zoulas:
> In article , Joerg Sonnenberger   
> wrote:
> >Am Tue, Apr 12, 2022 at 12:29:21PM - schrieb Christos Zoulas:
> >> In article
> >,
> >> Piyush Sachdeva   wrote:
> >> >-=-=-=-=-=-
> >> >
> >> >Dear Stephen Borrill,
> >> >My name is Piyush, and I was looking into the
> >> >'Emulating missing Linux syscalls' project hoping to contribute
> >> >to this year's GSoC.
> >> >
> >> >I wanted to be sure of a few basic things before I go ahead:
> >> >- linux binaries are found in- src/sys/compat/linux
> >> >- particular implementation in - src/sys/compat/linux/common
> >> >- a few architecture-specific implementations in-
> >> >  src/sys/compat/linux/arch/.
> >> >- The src/sys/compat/linux/arch//linux_syscalls.c file
> >> >   lists of system calls, and states if a particular syscall is present or
> >> >not.
> >> >
> >> >I was planning to work on the 'sendfile()' syscall, which I believe
> >> >is unimplemented for amd64 and a few other architectures as well.
> >> >
> >> >Considering the above points, I was hoping you could point me in
> >> >the right direction for this project. Hope to hear from you soon.
> >> 
> >> I would look into porting the FreeBSD implementation of sendfile to NetBSD.
> >
> >sendfile(2) for Linux compat can be emulated in the kernel without
> >backing. That said, a real splice(2) or even splicev(2) would be really
> >nice to have. But that's a different project and arguable, a potentially
> >more generally useful one, too.
> 
> 
> Yes, splice is more general (as opposed to send a file to a socket), but I
> think splice has limitations too (one of the fds needs to be a pipe).
> Is that true only for linux?

splice(2) as a concept is much older than the current Linux implementation.
There is no reason why zero-copying for sockets should require a
different system call for zero-copying from/to pipes. There are valid
reasons for other combinations, too. Consider /bin/cp for example.

I was saying that the Linux system call can be implemented without a
kernel backend, because I don't consider zero copy a necessary part of
the interface contract. It's a perfectly valid, if a bit slower
implementation to do allocate a kernel buffer and do IO via that.

Joerg


Re: Emulating missing linux syscalls

2022-04-12 Thread Joerg Sonnenberger
Am Tue, Apr 12, 2022 at 12:29:21PM - schrieb Christos Zoulas:
> In article 
> ,
> Piyush Sachdeva   wrote:
> >-=-=-=-=-=-
> >
> >Dear Stephen Borrill,
> >My name is Piyush, and I was looking into the
> >'Emulating missing Linux syscalls' project hoping to contribute
> >to this year's GSoC.
> >
> >I wanted to be sure of a few basic things before I go ahead:
> >- linux binaries are found in- src/sys/compat/linux
> >- particular implementation in - src/sys/compat/linux/common
> >- a few architecture-specific implementations in-
> >  src/sys/compat/linux/arch/.
> >- The src/sys/compat/linux/arch//linux_syscalls.c file
> >   lists of system calls, and states if a particular syscall is present or
> >not.
> >
> >I was planning to work on the 'sendfile()' syscall, which I believe
> >is unimplemented for amd64 and a few other architectures as well.
> >
> >Considering the above points, I was hoping you could point me in
> >the right direction for this project. Hope to hear from you soon.
> 
> I would look into porting the FreeBSD implementation of sendfile to NetBSD.

sendfile(2) for Linux compat can be emulated in the kernel without
backing. That said, a real splice(2) or even splicev(2) would be really
nice to have. But that's a different project and arguable, a potentially
more generally useful one, too.

Joerg


Re: ETOOMANYZLIBS

2022-03-24 Thread Joerg Sonnenberger
Am Thu, Mar 24, 2022 at 10:13:32PM +0100 schrieb Thomas Klausner:
> riastradh pointed out that this probably needs to be applied to
> 
> src/sys/net/zlib.c
> 
> as well, but that code seems to be from an older zlib version and the
> patch doesn't apply cleanly.
> 
> Can it be changed to use common/dist/zlib instead?

At least in the past, it was special and switching to a modern
implementation non-trivial. I don't think the problem applies
to it either.

Joerg


Re: valgrind

2022-03-22 Thread Joerg Sonnenberger
Am Tue, Mar 22, 2022 at 09:01:19PM + schrieb RVP:
> On Tue, 22 Mar 2022, Joerg Sonnenberger wrote:
> 
> > Am Mon, Mar 21, 2022 at 11:09:41PM + schrieb RVP:
> > > Sanitizers are OK, but, they don't seem to work in some cases:
> > 
> > Neither case is a memory leak. They are both reachable memory
> > allocations.
> > 
> 
> Yah: Reachable is what valgrind reports too, but, it still flags
> it as "lost". My point being that I'm glad we have sanitizers, but,
> you still need valgrind for some common use-cases.

Yeah and it regulary results in "bug reports" from the majority of
people that don't understand the difference. That said, I think
draconian mode of gperftools's heap cheaper can provide the check you
want here. Just be aware that it is pure noise most of the time.

Joerg


Re: valgrind

2022-03-21 Thread Joerg Sonnenberger
Am Mon, Mar 21, 2022 at 11:09:41PM + schrieb RVP:
> Sanitizers are OK, but, they don't seem to work in some cases:

Neither case is a memory leak. They are both reachable memory
allocations.

Joerg


Re: membar_enter semantics

2022-02-13 Thread Joerg Sonnenberger
Am Mon, Feb 14, 2022 at 02:45:46AM + schrieb David Holland:
> On Mon, Feb 14, 2022 at 03:12:29AM +0100, Joerg Sonnenberger wrote:
>  > Am Mon, Feb 14, 2022 at 02:01:13AM + schrieb David Holland:
>  > > In this case I would argue that the names should be membar_load_any()
>  > > and membar_any_store().
>  > 
>  > Kind of like with the BUSDMA_* flags, it is not clear from that name in
>  > which direction they work either. As in: is it a barrier that stops the
>  > next load? Is it a barrier that ensures that a store is visible?
> 
> Given that English is left-to-right, and that memory barriers are
> about ordering memory operations, it seems a lot clearer than "enter".

I don't think that arguments works with the way barriers around read and
write operations are normally used. A read barrier is normally "ensure
that a read doesn't move before this point", where as write barrier is
normally "ensure that a write operation doesn't move beyond this point".
Note the opposite temporal direction. Not sure if there are sensible use
cases of the inverted directions, i.e. if we care about CPUs that can
reorder reads relative to later writes.

Joerg


Re: membar_enter semantics

2022-02-13 Thread Joerg Sonnenberger
Am Mon, Feb 14, 2022 at 02:01:13AM + schrieb David Holland:
> In this case I would argue that the names should be membar_load_any()
> and membar_any_store().

Kind of like with the BUSDMA_* flags, it is not clear from that name in
which direction they work either. As in: is it a barrier that stops the
next load? Is it a barrier that ensures that a store is visible?

Joerg


Re: timecounters

2021-11-14 Thread Joerg Sonnenberger
On Sat, Nov 13, 2021 at 02:25:22AM +, Emmanuel Dreyfus wrote:
> x86 TSC: cycle count from CPU register. Very quick to read, but unreliable if 
> CPU frequency changes because of power saving. Also each CPU has its own
> value (how do we cope with that?)

It's even more complicated. For older x86 CPUs, different counts could
go at different frequencies, but this was fixed around the Netburst era
or so. That said, they frequently have issues when used in SMP systems.
It's a high resolution timer.

> x86 i8254: Plain old Intel clock chip, slow but reliable and needs locking
> to read.

Slow and limited resolution. You really want to avoid it if at all
possible. To put slow into perspective and from memory, it is an order
of magnitude slower than the PCI based timers and two orders of
magnitude slower than LAPIC and TSC.

> ACPI-Fast, ACPI-Safe: Needs multiple read to get a good value. 
> I have to read more of the paper to understand why. Also the
> difference between ACPI-Fast and ACPI-Safe is not yet clear to me.

You should only see the safe variation unless it is very old hardware.
This timecounter should be considered the safe default for x86 with a
reasonable good resolution (1MHz or higher) and not too slow access. The
first generation had some issues with possible 64bit read splits, but
those are not that relevant in practise.

> dummy: not documented. sys/kern/kern_tc.c says it is a fake timecounter
> for bootstrap, before a real one is used. .

Correct.

> clockinterrupt: not documented at all?  See sys/kern/kern_clock.c

This basically is a tick counter. Don't use at all if you have any
actual timecounter.

> lapic: not quite documented in lapic(4)

This uses the counter register of the LAPIC, which is based on the CPU
bus frequency and coherent even on older x86 CPUs. It's slower than the
TSC but still much faster than e.g. the HPET. Works reasonable well
until we switch away from the periodic time interrupt. Lower resolution
than TSC, but typically much higher than HPET.

> ichlpcib: loosely documented in ichlpcib(4)

This is the hardware also used by ACPI-Safe, but comes with a bit less
overhead.

> hpet: documented in hpet(4)

If TSC or LAPIC doesn't work, this is generally the best choice. It has
a decent resolution (in the MHz range) and often half the access time
than the PCI timecounters (ichlpcib).

> Ans that is only the beginning. grep tc_name sys/kern/arch sys/dev yields
> a lot of results. For instance, arm has  bcm2835_tmr, clpssoc, a9tmr, gtmr, 
> dc21285_fclk, gpt, digctl, mvsoctmr, dmtimer, saost_count, MCT, hstimer, 
> Timer 2, LOSC, timer2, timer3, tmr1_count
> 
> Does it make sense to document all of theat?

I don't think there is generally a point in documenting all the MD ones.
X86 is somewhat special as there are often two choices with different
advantages and disadvantages. Most other platforms have a pretty clear
winner.

Joerg


Re: Interrupt storms with wm(4)

2021-10-15 Thread Joerg Sonnenberger
On Fri, Oct 15, 2021 at 02:10:32AM +, Emmanuel Dreyfus wrote:
> The wm0 interface gets a lot of interrupts. On low usage, CPU
> is spent around 10% in interrupts. It can rise to more than 80%.
> See below what systat vm says when the machine is quiet. 
> ioapic0 pin 16 is the wm0 interface.

Is it the only device using that pin? If so, can you add an event
counter or so to wm_intr to check what percentage of calls to wm_intr
return handled == 0?

Joerg


Re: [PATCH] Move DRM-driver firmware from base to its own set, gpufw

2021-09-23 Thread Joerg Sonnenberger
On Thu, Sep 23, 2021 at 02:43:22PM +, coypu wrote:
> This set is only installed on amd64,i386,evbarm.

I wonder if we shouldn't allow re-triggering firmware load via either
sysctl or drvctl and just move them to outside root. Even the option to
load them from pkgsrc might be useful.

Joerg


Re: kern.maxlockf for byte range lock limit

2021-07-13 Thread Joerg Sonnenberger
On Tue, Jul 13, 2021 at 06:37:57AM +, Emmanuel Dreyfus wrote:
> On Tue, Jul 13, 2021 at 03:36:49AM +, David Holland wrote:
> > Well, that was the idea; make it some factor times the current open
> > file limit or something like that. Not sure why the existing limit is
> > apparently per-user rather than per-process or what that's supposed to
> > accomplish. These lock objects are not exactly large so it's not
> > necessary to be tightfisted with them.
> 
> Then we could just use maxfiles, coulnd't we? 

You should be able to have multiple locks per file, but the whole point
of the limit was that shouldn't be able to trivially exhaust kernel
memory.

Joerg


Re: protect pmf from network drivers that don't provide if_stop

2021-07-01 Thread Joerg Sonnenberger
On Thu, Jul 01, 2021 at 06:47:08AM -0300, Jared McNeill wrote:
> Not really a fan of this as it doesn't protect other potential if_stop users
> (and "temporary fix" rarely is..). How about something like this instead?

I was more thinking along the lines of doing the assert in that place.
The lack of if_stop is almost always a porting bug.

Joerg


Re: maximum limit of files open with O_EXLOCK

2021-06-21 Thread Joerg Sonnenberger
On Mon, Jun 21, 2021 at 03:51:51AM +, David Holland wrote:
> On Sat, Jun 19, 2021 at 08:12:38AM +, nia wrote:
>  > The Zig developer found the kernel limit:
>  > https://nxr.netbsd.org/xref/src/sys/kern/vfs_lockf.c#116
>  > 
>  > but it doesn't seem to be adjustable through sysctl.
>  > I wonder if it should be.
> 
> I wonder if the logic should be changed to allow one "free" lock per
> open file, since those are already limited, and it's not like the
> amounts of memory involved are large.

The easiest would be to just include maxfiles as factor.

Joerg


Re: 9.1: boot-time delay? [WORKAROUND FOUND]

2021-05-27 Thread Joerg Sonnenberger
On Fri, May 28, 2021 at 03:14:24AM +0700, Robert Elz wrote:
> Date:Thu, 27 May 2021 05:05:15 - (UTC)
> From:mlel...@serpens.de (Michael van Elst)
> Message-ID:  
> 
>   | mlel...@serpens.de (Michael van Elst) writes:
>   |
>   | >Either direction mstohz or hztoms should better always round up to
>   | >guarantee a minimal delay.
>   |
>   | And both should be replaced by hztous()/ustohz().
> 
> While changing ms to us is probably a good idea, when a change happens,
> the "hz" part should be changed too.
> 
> hz is (a unit of) a measure of frequency, ms (or us) is (a unit of) a
> measure of time (duration) - converting one to the other makes no sense.

"hz" in this context comes from HZ - it is used here as dimension of
1s/HZ. Just like "ms" here is used as "1s/1000". It's a pretty sensible
naming compared to much longer naming variants.

Joerg


Re: problem with USER_LDT in current 9.99.81

2021-04-28 Thread Joerg Sonnenberger
On Wed, Apr 28, 2021 at 10:54:22AM +0100, Dave Tyson wrote:
> I don't know whether altering the LDT size will have any implications for the 
> SVS stuff, currently 
> 
> #define MAX_USERLDT_SIZE PAGE_SIZE
> 
> appears in the above include files.
> 
> I changed the define to be:
> 
> #define MAX_USERLDT_SIZE PAGE_SIZE*2
> 
> and a recompiled kernel now allows wine to work correctly.

If we want have a fixed size limit, it should be 64KB. That's the maximum
size a LDT can have and I see no reason why we want to allow less. It's
rare enough that wasting a bit space here shouldn't matter.

Joerg


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Joerg Sonnenberger
On Sun, Apr 04, 2021 at 11:47:10PM +0700, Robert Elz wrote:
> If not, what prevents someone from reading (copying) the file from the
> system while it is stopped (assessing the storage device via other methods)
> and then knowing exactly what the seed is going to be when the system boots?

That is discussed in the security model Taylor presented a long time
ago. In short: nothing. In most use cases, you are screwed at this point
anyway since various other cryptographic material like the host ssh key
is also lost. There is one special case here where this has to be taken
under consideration and that is cloning virtual machines. The short
answer is that you as system integrator are responsible for handling it
in an appropiate manner. Ensuring that the VM sees enough entropic
action before the entropy is accessed would ensure that. The seed file
doesn't replace the entropy pool, so any entropy that actually did get
added during the boot process still remains.

> I think I'd prefer possibly insecure, but difficult to obtain from outside
> like disk drive interrupt timing low order bits than that.   Regardless of
> how unproven that method might be.

See above, that's still the case. Noone said anything about not using
sources of potential entropy. All that changed is that we don't pretend
it provides entropy. As I mentioned elsewhere, a lot of the classic
entropy sources are surprisingly bad nowadays when someone can observe
the kernel, especially in a virtualized environment.

Joerg


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Joerg Sonnenberger
On Sun, Apr 04, 2021 at 03:32:08PM -0700, Greg A. Woods wrote:
> At Mon, 05 Apr 2021 00:14:30 +0200 (CEST), Havard Eidnes  
> wrote:
> Subject: Re: regarding the changes to kernel entropy gathering
> >
> > > What about architectures that have nothing like RDRAND/RDSEED?  Are
> > > they, effectively, totally unsupported now?
> >
> > Nope, not entirely.  But they have to be seeded once.  If they
> > have storage which survives reboots, and entropy is saved and
> > restored on reboot, they will be ~fine.
> 
> BTW, to me reusing the same entropy on every reboot seems less secure.

Except that's not what the system is doing. It removes the seed file on
boot and creates a new one on shutdown.

> > Systems without persistent storage and also without RDRAND/RDSEED
> > will however be ... a more challenging problem.
> 
> Leaving things like that would be totally silly.
> 
> With my patch the old way of gathering entropy from devices works just
> fine as it always did, albeit with the second patch it does require a
> tiny bit of extra configuration.

You keep repeating yourself. It doesn't make your claims any less false.
At this point, can we please just stop this thread?

Joerg


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Joerg Sonnenberger
On Mon, Apr 05, 2021 at 12:07:49AM +0200, Havard Eidnes wrote:
> I am still of the fairly firm beleif that the mistrust in the
> hardware vendors' ability to make a reasonable and robust
> implementation is without foundation.

It's not without foundation. Remember the first hardware RNG on x86? It
got killed by a down-shrink in the pre-release phase. The hardware still
said it is present. That's the reason why we don't use RDSEED directly
in place of the entropy pool and still merge other potential entropy
sources in.

Joerg


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Joerg Sonnenberger
On Sun, Apr 04, 2021 at 09:24:56PM +, RVP wrote:
> PS. Is there a way to get the bit-stream from the various in-kernel
> sources so that we can run them through these sort of tests? That
> way we can check--not intuit--how random the bit-streams they
> produce really are.

Part of the problem here is that most of the non-RNG data sources are
easily observable either from the local system (e.g. any malicious user)
or other VMs on the same machine (in case of a hypervisor) or local
machines on the same network (in case of network interrupts). That's the
real reason why their entropy is hard to estimate. It becomes even more
annoying with modern hardware features like interrupt moderation of
nics. They can make the timing of interrupts highly predicable.

Joerg


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Joerg Sonnenberger
On Sun, Apr 04, 2021 at 02:16:41PM -0700, Paul Goyette wrote:
> > Personally, I'm happy with anything that your average high school
> > student is unlikely to be able to crack in an hour.   I don't run
> > a bank, or a military installation, and I'm not the NSA.   If someone
> > is prepared to put in the effort required to break into my systems,
> > then let them, it isn't worth the cost to prevent that tiny chance.
> > That's the same way that my house has ordinary locks - I'm sure they
> > can be picked by someone who knows what they're doing, and better
> > security is available, at a price, but a nice happy medium is what
> > fits me best.
> 
> FWIW, I used to work for a company whose marketing motto was
> 
>   Good enough isn't!
> 
> But I definitely agree with you - what we used to have is "good
> enough" for the vast bulk of our users and potential users.
> 
> Perhaps sysinst(8) should ask
> 
>   Do you need a hyper-secure system?
> 
> If yes, then leave things as they are today.  But if you answer no,
> we should automatically copy enough pseudo-entropy bits to /dev/rnd
> to prevent future blocking.

For most architectures, sysinst does do exactly that. It assumes that
you don't just reset or reboot, but properly shutdown the system.

Joerg


Re: ACPI related performance trouble

2021-02-25 Thread Joerg Sonnenberger
On Thu, Feb 25, 2021 at 11:46:21AM +0100, Emmanuel Dreyfus wrote:
> I just got two identical machines, let us call them glutamine and leucine. I
> run ffmpeg4 to transcode H264 video to webm, and leucine is about 12 times
> faster than glutamine.

Have you compared the machdep sysctl?

Joerg


Re: cmake hangs on kqueue

2021-01-12 Thread Joerg Sonnenberger
On Tue, Jan 12, 2021 at 01:20:39PM +0100, Martin Husemann wrote:
> On Tue, Jan 12, 2021 at 01:11:00PM +0100, Manuel Bouyer wrote:
> > I think I've seen some mails about a similar problem in the past few months
> > but I don't remember the details (and couldn't find a PR about it either).
> 
> That was supposed to be fixed by ticket #907, which got pulled up on
> May 13 2020.

The cmake hangs are due to lost entries in the condvar waiter list. I
can usually trigger it within a few hours. Extending the waiter array
helped reduce the chance of triggering it, but it's not a fix.

Joerg


Re: Temporary memory allocation from interrupt context

2020-11-11 Thread Joerg Sonnenberger
On Wed, Nov 11, 2020 at 10:44:45AM +0100, Martin Husemann wrote:
> Consider the following pseudo-code running in softint context:

Why do those items not have a link element inside, so that no additional
memory allocation is necesary?

Joerg


Re: CVS commit: src/external/gpl3/gcc/dist/gcc/config/aarch64

2020-10-20 Thread Joerg Sonnenberger
On Wed, Oct 21, 2020 at 08:58:36AM +0900, Rin Okuyama wrote:
> I'm also one who feels hesitate to import Linux'ism into our basic
> components. However, for this problem in particular, I still think
> it is not a good choice to keep NetBSD support in driver-aarch64.c:
> 
> (a) Our sysctl(3)-based interface is not compliant to any standards,
> just like Linux's /proc/cpuinfo. But the latter is, unfortunately
> for us, the de facto standard.

It works properly in a chroot etc without needing new files. I would
call that a big plus.

Joerg


Re: style change: explicitly permit braces for single statements

2020-07-13 Thread Joerg Sonnenberger
On Mon, Jul 13, 2020 at 09:18:18AM -0400, Ted Lemon wrote:
> On Jul 13, 2020, at 9:13 AM, Mouse  wrote:
> > .  I find the braces pure visual clutter in the latter.
> 
> What really bugs me is when my code winds up with a security fail because I 
> wasn’t careful.

If only we had compilers that could warn us if the indentation doesn't
match the semantics of the language. Oh wait.

Joerg


Re: stat(2) performance

2020-06-15 Thread Joerg Sonnenberger
On Mon, Jun 15, 2020 at 09:31:48PM +0100, Robert Swindells wrote:
> 
> Doing a 'cvs update' feels really slow
> 
> Running it under ktrace(1) shows it doing a stat(2) for every metadata
> file in the tree. The machine sounds like it is hitting the disk for
> every one. Is there any kind of cache for the attribute information
> that stat needs ?
> 
> This is on a ffs filesystem mounted log,noatime.

Raise kern.maxvnodes?

Joerg


Re: TSC improvement

2020-06-09 Thread Joerg Sonnenberger
On Tue, Jun 09, 2020 at 05:16:27PM +, Taylor R Campbell wrote:
> It's great to see improvements to our calibration of the TSC (and I
> tend to agree that cpu_counter should be serializing, so that, e.g.,
> cpu_counter(); ...; cpu_counter() reliably measures time taken in the
> ellipsis).

I'm pretty sure we want to have both variants.

Joerg


Re: Avoid UB in pslist.h (NULL + 0)

2020-03-21 Thread Joerg Sonnenberger
On Sun, Mar 22, 2020 at 12:50:16AM +, Taylor R Campbell wrote:
> (b) Change how we invoke ubsan and the compiler by passing
> -fno-delete-null-pointer-checks to clang.  joerg objected to this
> but I don't recall the details off the top of my head; joerg, can
> you expand on your argument against this, and which alternative
> you would prefer?

I objected to using it *in general*. The committed change *always*
adding it, not just when using sanitizers.

Joerg


Re: NULL pointer arithmetic issues

2020-03-09 Thread Joerg Sonnenberger
On Mon, Mar 09, 2020 at 09:50:50AM -0400, Aaron Ballman wrote:
> On Sun, Mar 8, 2020 at 2:30 PM Taylor R Campbell
> > I ask because in principle a conformant implementation could compile
> > the NetBSD kernel into a useless blob that does nothing -- we rely on
> > all sorts of behaviour relative to a real physical machine that is not
> > defined by the letter of the standard, like inline asm, or converting
> > integers from the VM system's virtual address allocator into pointers
> > to objects.  But such an implementation would not be useful.
> 
> Whether an optimizer elects to use this form of UB to make
> optimization decisions is a matter of QoI. My personal feeling is that
> I don't trust this sort of optimization -- it takes code the
> programmer wrote and makes it behave in a fundamentally different
> manner. I'm in support of UBSan diagnosing these constructs because it
> is UB and an optimizer is definitely allowed to optimize based on it
> but I wouldn't be in support of an optimizer that aggressively
> optimizes on this.

I consider it as something even worse. Just like the case of passing
NULL pointers to memcpy and friends with zero as size, this
interpretation / restriction in the standard is actively harmful to some
code for the sake of potential optimisation opportunities in other code.
It seems to be a poor choice at that. I.e. it requires adding
conditional branches for something that behaves sanely everywhere but
may the DS9k.

Joerg


Re: NULL pointer arithmetic issues

2020-03-08 Thread Joerg Sonnenberger
On Sun, Mar 08, 2020 at 03:33:57PM +0100, Kamil Rytarowski wrote:
> There was also a request to make a proof that memcpy(NULL,NULL,0) is UB
> and can be miscompiled.
> 
> Here is a reproducer:
> 
> http://netbsd.org/~kamil/memcpy-ub.c
> 
> 131 kamil@rugged /tmp $ gcc -O0 memcpy.c
> 
> 132 kamil@rugged /tmp $ ./a.out
> 
> 1
> 
> 133 kamil@rugged /tmp $ gcc -O2 memcpy.c
> 134 kamil@rugged /tmp $ ./a.out
> 0
> 
> A fallback for freestanding environment is to use
> -fno-delete-null-pointer-check.

The correct fix is not to disable the null-pointer-check option but to
remove the broken automatic non-null arguments in GCC.

Joerg


Re: NULL pointer arithmetic issues

2020-03-08 Thread Joerg Sonnenberger
On Sun, Mar 08, 2020 at 03:30:02PM +0100, Kamil Rytarowski wrote:
> NULL+x is now miscompiled by Clang/LLVM after this commit:
> 
> https://reviews.llvm.org/rL369789
> 
> This broke various programs like:
> 
> "Performing base + offset pointer arithmetic is only allowed when base
> itself is not nullptr. In other words, the compiler is assumed to allow
> that base + offset is always non-null, which an upcoming compiler
> release will do in this case. The result is that CommandStream.cpp,
> which calls this in a loop until the result is nullptr, will never
> terminate (until it runs junk data and crashes)."

As you said, using a non-zero offset. Noone here argued that using
non-zero offsets is or should be valid since that would obviously create
a pointer outside the zero-sized object.

Joerg


Re: NULL pointer arithmetic issues

2020-02-24 Thread Joerg Sonnenberger
On Mon, Feb 24, 2020 at 11:42:01AM +0100, Kamil Rytarowski wrote:
> Forbidding NULL pointer arithmetic is not just for C purists trolls. It
> is now in C++ mainstream and already in C2x draft.

This is not true. NULL pointer arithmetic and nullptr arithmetic are
*very* different things. Do not conflate them.

Joerg


Re: NULL pointer arithmetic issues

2020-02-22 Thread Joerg Sonnenberger
On Sat, Feb 22, 2020 at 05:25:42PM +0100, Kamil Rytarowski wrote:
> When running the ATF tests under MKLIBCSANITIZER [1], there are many
> NULL pointer arithmetic issues .

Which flags are the sanitizers using? Because I wouldn't be surprised if
they just hit _PSLIST_VALIDATE_PTRS and friends.

Joerg


Re: Stripping down 8.0

2020-01-30 Thread Joerg Sonnenberger
On Wed, Jan 29, 2020 at 04:20:45PM -0500, Mouse wrote:
> > I'd say, just drop the acpiwmi driver, you don't include any driver
> > for acpiwmibus, so the acpiwmi driver is useless to you.
> 
> I'm not sure whether the right fix is to drop acpiwmi or add back
> acpiec or acpiecdt, so...what _is_ acpiwmi?

Beside WMI not being necessary, I would be somewhat careful with
dropping acpiec. Depending on the actual system, the ACPI DSDT tables
can reference the EC indirectly and use it for critical infrastructure.
On mobile systems the battery control is often implemented on the EC,
other options are parts of the thermal management etc.

Joerg


Re: Proposal: Remove filemon(4); switch make meta to ktrace

2020-01-13 Thread Joerg Sonnenberger
On Mon, Jan 13, 2020 at 06:09:20AM -0500, Mouse wrote:
> > - What instead?  The attached patch (patch set make-meta-v2.patch;
> >   combined diff make-meta-v2.diff) replaces make's use of
> >   /dev/filemon by ktrace, in meta mode.
> 
> How does this interact with someone ktracing a make run?  If it that
> breaks make, I think this is a very bad idea; if it breaks the user's
> ktrace, I think it is a moderately bad idea.  (I haven't had to ktrace
> make very often, but when I have there hasn't been much else that would
> be suitable.)

Don't use meta mode in that case. It is not the default.

Joerg


Re: Solving the last piece of the uvm_pageqlock problem

2019-12-24 Thread Joerg Sonnenberger
On Tue, Dec 24, 2019 at 10:08:01PM +, Andrew Doran wrote:
> This is a diff against a tree containing the allocator patch I posted
> previously:
> 
>   http://www.netbsd.org/~ad/2019/pdpol.diff

I wanted to give this a spin before travelling, but it doesn't survive
very long here. I get NULL pointer derefs in
uvmpdpol_pageactivate_locked, coming from uvmpdpool_pageintent_realize.
That's within seconds of the scan phase of a bulk build.

Joerg


Re: ptrace(2) interface for TLSBASE

2019-12-03 Thread Joerg Sonnenberger
On Tue, Dec 03, 2019 at 05:11:49PM +0100, Kamil Rytarowski wrote:
> TLSBASE is stored on a selection of ports in a dedicated mcontext entry,
> out of gpregs.

That's an implementation detail and IMO something we shouldn't leak
outside the arch. Just provide an accessor for l_private. There are then
two possible options:
(1) An MD register is used directly for the TLS base. In that case, the
register should be used and l_private is normally irrelevant.
(2) No MD register is used (directly). In that case l_private is
correct.

Joerg


Re: Some minor improvements to select/poll

2019-11-20 Thread Joerg Sonnenberger
On Wed, Nov 20, 2019 at 09:38:56PM +, Andrew Doran wrote:
> (1) Increase the maximum number of clusters from 32 to 64 for large systems.
> kcpuset_t could potentially be used here but that's an excursion I don't
> want to go on right now.  uint32_t -> uint64_t is very simple.

Careful about alignment here. There is hidden padding on ILP32 with
64bit alignment for uint64_t.

Joerg


Re: __{read,write}_once

2019-11-11 Thread Joerg Sonnenberger
On Mon, Nov 11, 2019 at 01:15:16PM -0500, Mouse wrote:
> > Uninterruptible means exactly that, there is a clear before and after
> > state and no interrupts can happen in between.
> 
> Is uninterruptible all you care about?  Or does it also need to be
> atomic with respect to other CPUs?  Eventually, of course, you'll want
> all those counter values on a single CPU - does that happen often
> enough to matter?
> 
> Also, does it actually need to be uninterruptible, or does it just need
> to be interrupt-tolerant?  I'm not as clear as I'd like on what the
> original desire here was, so I'm not sure but that we might be trying
> to solve a harder problem than necessary.

The update needs to be uninterruptable on the local CPU in the sense that
context switches and interrupts don't tear the R and W part of the RMW cycle
apart. x86 ALU instructions with memory operand as destination fit the
bill fine. It doesn't have to be atomic, no other CPU is supposed to
write to that cache line. It also generally doesn't matter whether they
see the old OR the new value, as long as it is either. We have full
control over alignment.

Joerg


Re: __{read,write}_once

2019-11-11 Thread Joerg Sonnenberger
On Mon, Nov 11, 2019 at 11:51:26AM -0500, Mouse wrote:
> >>> (2) Use uninterruptible memory operations in per CPU memory,
> >>> aggregate passively on demand.
> > Problem is that (2) is natively only present on CISC platforms in
> > general.  Most RISC platforms can't do RMW in one instruction.
> 
> (2) says "uninterruptible", not "one instruction", though I'm not sure
> how large the difference is in practice.  (Also, some CISC platforms
> provide atomic memory RMW operations only under annoying restrictions;
> for example, I think the VAX has only three sorts of RMW memory
> accesses that are atomic with respect to other processors: ADAWI
> (16-bit-aligned 16-bit add), BB{SS,CC}I (test-and-{set/clear} single
> bits), and {INS,REM}Q{H,T}I (queue insert/remove at head/tail).)

The point here is that we really don't want to have bus locked
instructions for per-CPU counters. It would defeat the point of using
per-CPU counters in first place to a large degree. Uninterruptible means
exactly that, there is a clear before and after state and no interrupts
can happen in between. I don't know about ARM and friends for how
expensive masking interrupts is. It is quite expensive on x86. RMW
instructions are the other simple option for implementing them. (3)
would be the high effort version, doing RAS for kernel code as well.

Joerg


Re: __{read,write}_once

2019-11-11 Thread Joerg Sonnenberger
On Mon, Nov 11, 2019 at 02:39:26PM +0100, Maxime Villard wrote:
> Le 11/11/2019 à 13:51, Joerg Sonnenberger a écrit :
> > On Mon, Nov 11, 2019 at 11:02:47AM +0100, Maxime Villard wrote:
> > > Typically in sys/uvm/uvm_fault.c there are several lockless stat 
> > > increments
> > > like the following:
> > > 
> > >   /* Increment the counters.*/
> > >   uvmexp.fltanget++;
> > 
> > Wasn't the general consensus here to ideally have per-cpu counters here
> > that are aggregated passively?
> 
> In this specific case it's not going to work, because uvmexp is supposed to
> be read from a core dump, and the stats are expected to be found in the
> symbol...

That's just a reason to add code for doing the aggregation, not a reason
to avoid it.

> > I can think of three different options depending the platform:
> > 
> > (1) Use full atomic operations. Fine for true UP platforms and when the
> > overhead of per-CPU precise accounting is too high.
> > 
> > (2) Use uninterruptible memory operations in per CPU memory, aggregate
> > passively on demand.
> > 
> > (3) Emulate uninterruptible memory operations with kernel RAS, aggregate
> > passively on demand.
> 
> Generally I would do (2), but in this specific case I suggest (1).
> atomic_inc_uint is supposed to be implemented on each platform already, so
> it's easy to do.
> 
> (3) is a big headache for a very small reason, and it's not going to
> prevent inter-CPU races.

Problem is that (2) is natively only present on CISC platforms in
general. Most RISC platforms can't do RMW in one instruction. Inter-CPU
races don't matter as long as only the counter of the current CPU is
modified. The main difference between (1) and (2) is the bus lock. (1)
and (3) is primarily a question on whether the atomic operation can be
inlined, if it can't, (3) would still be much nicer.

Joerg


Re: __{read,write}_once

2019-11-11 Thread Joerg Sonnenberger
On Mon, Nov 11, 2019 at 11:02:47AM +0100, Maxime Villard wrote:
> Typically in sys/uvm/uvm_fault.c there are several lockless stat increments
> like the following:
> 
>   /* Increment the counters.*/
>   uvmexp.fltanget++;

Wasn't the general consensus here to ideally have per-cpu counters here
that are aggregated passively? I can think of three different options
depending the platform:

(1) Use full atomic operations. Fine for true UP platforms and when the
overhead of per-CPU precise accounting is too high.

(2) Use uninterruptible memory operations in per CPU memory, aggregate
passively on demand.

(3) Emulate uninterruptible memory operations with kernel RAS, aggregate
passively on demand.

Essentially, the only race condition we care about for the statistic
counters is via interrupts or scheduling. We can implement the
equivalent of x86's add with memory operand as destination using RAS, so
the only overhead would be the function call in that case.

Joerg


Re: alloca() in kernel code

2019-10-12 Thread Joerg Sonnenberger
On Sun, Oct 13, 2019 at 12:46:24AM +0200, Johnny Billquist wrote:
> On 2019-10-12 20:47, Joerg Sonnenberger wrote:
> > On Sat, Oct 12, 2019 at 08:13:25PM +0200, Johnny Billquist wrote:
> > > On 2019-10-12 19:01, Emmanuel Dreyfus wrote:
> > > > Mouse  wrote:
> > > > 
> > > > > I'm presumably missing something here, but what?
> > > > 
> > > > I suspect Maxime's concern is about uncontrolled stack-based variable
> > > > buffer, which could be used to crash the kernel.
> > > > 
> > > > But in my case, the data is coming from the bootloader. I cannot think
> > > > about a scenario where it makes sense to defend against an attack from
> > > > the bootloader. The kernel already has absolute trust in the bootloader.
> > > 
> > > On this one, I agree with Maxime.
> > > 
> > > Even if it comes from the bootloader, why would you want to use alloca()?
> > 
> > Because as Emmanuel wrote initially, dynamic allocations might not be
> > possible yet.
> 
> But if you use alloca(), you will have to check what size you'd like to
> allocate, and not allocate more than some maximum amount, I would assume. Or
> do you really think that it is ok to just let it try no matter what amount
> is decided you want to allocate?
> 
> And if you figure out an upper limit, then you might as well just define an
> array of that size in the function, and be done with it.

All nice and good, but it doesn't help with the original problem. How to
deal with dynamically sized data when there is no dynamic allocator.
Without context, it is impossible to know if "dynamic size" means a
reasonable sized string, a list of memory segments etc. As such, it is
impossible to say if alloca (or just defining a fixed size array or
whatever) is reasonable or not.

Joerg


Re: alloca() in kernel code

2019-10-12 Thread Joerg Sonnenberger
On Sat, Oct 12, 2019 at 02:01:16AM +0200, Emmanuel Dreyfus wrote:
> I recently encountered a situation where I had to deal with variable
> length structure at a time where kernel dynamic allocator was not
> initialized. 

You can borrow pages directly if the data is potentially larger.

Joerg


Re: alloca() in kernel code

2019-10-12 Thread Joerg Sonnenberger
On Sat, Oct 12, 2019 at 08:13:25PM +0200, Johnny Billquist wrote:
> On 2019-10-12 19:01, Emmanuel Dreyfus wrote:
> > Mouse  wrote:
> > 
> > > I'm presumably missing something here, but what?
> > 
> > I suspect Maxime's concern is about uncontrolled stack-based variable
> > buffer, which could be used to crash the kernel.
> > 
> > But in my case, the data is coming from the bootloader. I cannot think
> > about a scenario where it makes sense to defend against an attack from
> > the bootloader. The kernel already has absolute trust in the bootloader.
> 
> On this one, I agree with Maxime.
> 
> Even if it comes from the bootloader, why would you want to use alloca()?

Because as Emmanuel wrote initially, dynamic allocations might not be
possible yet.

Joerg


Re: build.sh sets with xz (was Re: vfs cache timing changes)

2019-09-13 Thread Joerg Sonnenberger
On Fri, Sep 13, 2019 at 07:06:59AM +0200, Martin Husemann wrote:
> I am not sure whether the xz compiled in tools supports the "-T threads"
> option, but if it does, we can add "-T 0" to the default args and see how
> much that improves things. Jörg, do you know this?

It doesn't currently, since it is somewhat of a PITA to deal with
pthread support portably.

Joerg


Re: /dev/random is hot garbage

2019-07-22 Thread Joerg Sonnenberger
On Mon, Jul 22, 2019 at 04:36:41PM +, paul.kon...@dell.com wrote:
> 
> 
> > On Jul 22, 2019, at 10:52 AM, Joerg Sonnenberger  wrote:
> > 
> > 
> > [EXTERNAL EMAIL] 
> > 
> > On Sun, Jul 21, 2019 at 09:13:48PM +, paul.kon...@dell.com wrote:
> >> 
> >> 
> >>> On Jul 21, 2019, at 5:03 PM, Joerg Sonnenberger  wrote:
> >>> 
> >>> 
> >>> [EXTERNAL EMAIL] 
> >>> 
> >>> On Sun, Jul 21, 2019 at 08:50:30PM +, paul.kon...@dell.com wrote:
> >>>> /dev/urandom is equivalent to /dev/random if there is adequate entropy,
> >>>> but it will also deliver random numbers not suitable for cryptography 
> >>>> before that time.
> >>> 
> >>> This is somewhat misleading. The problem is that with an unknown entropy
> >>> state, the system cannot ensure that an attacker couldn't predict the
> >>> seed used for the /dev/urandom stream. That doesn't mean that the stream
> >>> itself is bad. It will still pass any statistical test etc.
> >> 
> >> That's exactly my point.  If you're interested in a statistically high
> >> quality pseudo-random bit stream, /dev/urandom is a gread source.  But
> >> if you need a cryptographically strong random number, then you can't
> >> safely proceed with an unknown entropy state for the reason you stated,
> >> which translates into "you must use /dev/random".
> > 
> > That distinction makes no sense at all to me. /dev/urandom is *always* a
> > cryptographically strong RNG. The only difference here is that without
> > enough entropy during initialisation of the stream, you can brute force
> > the entropy state and see if you get a matching output stream based on
> > that seed.
> 
> I use a different definition of "cryptographically strong".  A bit string
> that's guessable is never, by any useful definition, "cryptographically
> strong" no matter what the properties of the string extender are.  The
> only useful definition for the term I can see is as a synonym for
> "suitable for security critical value in cryptographic algorithms".
> An unseeded /dev/urandom output is not such a value.

Again, that's not really a sensible definition. It's always possible to
guess the seed of used by the /dev/urandom CPRNG. By definition. That
doesn't change the core properties though: there is no sensible way to
predict the output of CPRNG without knowing the initial seed and offset.
There is no known correlation between variations of the seed. As in: the
only thing partial knowledge of the seed gives you is reducing the
propability of guessing the right seed. It's a similar situation to why
the concept of entropy exhaustion doesn't really make sense.

Joerg


Re: /dev/random is hot garbage

2019-07-22 Thread Joerg Sonnenberger
On Sun, Jul 21, 2019 at 09:13:48PM +, paul.kon...@dell.com wrote:
> 
> 
> > On Jul 21, 2019, at 5:03 PM, Joerg Sonnenberger  wrote:
> > 
> > 
> > [EXTERNAL EMAIL] 
> > 
> > On Sun, Jul 21, 2019 at 08:50:30PM +, paul.kon...@dell.com wrote:
> >> /dev/urandom is equivalent to /dev/random if there is adequate entropy,
> >> but it will also deliver random numbers not suitable for cryptography 
> >> before that time.
> > 
> > This is somewhat misleading. The problem is that with an unknown entropy
> > state, the system cannot ensure that an attacker couldn't predict the
> > seed used for the /dev/urandom stream. That doesn't mean that the stream
> > itself is bad. It will still pass any statistical test etc.
> 
> That's exactly my point.  If you're interested in a statistically high
> quality pseudo-random bit stream, /dev/urandom is a gread source.  But
> if you need a cryptographically strong random number, then you can't
> safely proceed with an unknown entropy state for the reason you stated,
> which translates into "you must use /dev/random".

That distinction makes no sense at all to me. /dev/urandom is *always* a
cryptographically strong RNG. The only difference here is that without
enough entropy during initialisation of the stream, you can brute force
the entropy state and see if you get a matching output stream based on
that seed.

Joerg


Re: /dev/random is hot garbage

2019-07-21 Thread Joerg Sonnenberger
On Sun, Jul 21, 2019 at 08:50:30PM +, paul.kon...@dell.com wrote:
> /dev/urandom is equivalent to /dev/random if there is adequate entropy,
> but it will also deliver random numbers not suitable for cryptography before 
> that time.

This is somewhat misleading. The problem is that with an unknown entropy
state, the system cannot ensure that an attacker couldn't predict the
seed used for the /dev/urandom stream. That doesn't mean that the stream
itself is bad. It will still pass any statistical test etc.

Note that with the option of seeding the CPRNG at boot time, a lot of
the distinction is actually moot.

Joerg


Re: /dev/random is hot garbage

2019-07-21 Thread Joerg Sonnenberger
On Sun, Jul 21, 2019 at 07:20:08PM +, Taylor R Campbell wrote:
> This is _locally_ sensible for a library that may have many users
> beyond a compiler.

No, it can be sensible behavior to allow *optionally* checking. But it
is certainly not sensible default behavior for a library.

Joerg


Re: re-enabling debugging of 32 bit processes with 64 bit debugger

2019-06-30 Thread Joerg Sonnenberger
On Sat, Jun 29, 2019 at 08:03:59PM -, Christos Zoulas wrote:
> In article 
> ,
> Andrew Cagney   wrote:
> >
> >Having 32-bit and 64-bit debuggers isn't sufficient.  Specifically, it
> >can't handle an exec() call where the new executable has a different
> >ISA; and this imnsho is a must have.
> 
> It is really hard to make a 32 bit debugger work on a 64 bit system
> because of the tricks we play with the location of the shared
> libraries in rtld and the debugger needs to be aware of them.

I don't buy that at all. Exposing the translation to a debugger is
trivial and in most cases, it doesn't care about the lookup mechanism at
all since it just asks the dynamic linker for the path names of the
libraries anyway.

Joerg


Re: Enable functionality by default

2019-04-16 Thread Joerg Sonnenberger
On Tue, Apr 16, 2019 at 04:55:41PM +0100, Sevan Janiyan wrote:
> Not all of our network drivers support altq

altq has some significant impact on the network layer and I would expect
just enabling it to have a measureable impact on netio.

Joerg


Re: RFC: New userspace fetch/store API

2019-02-26 Thread Joerg Sonnenberger
On Tue, Feb 26, 2019 at 01:59:42PM +0100, Rhialto wrote:
> On Mon 25 Feb 2019 at 18:10:20 +, Eduardo Horvath wrote:
> > I'd do something like:
> > 
> > uint64_t ufetch_64(const uint64_t *uaddr, int *errp);
> > 
> > where *errp needs to be initialized to zero and is set on fault so you can 
> > do:
> > 
> > int err = 0;
> > long hisflags = ufetch_64(flag1p, ) | ufetch_64(flag2p, );
> > 
> > if (err) return EFAULT;
> > 
> > do_something(hisflags);
> 
> I like this, because it swaps the cost of the value that is always
> needed (which was expensive) versus the one that isn't expected often
> (the error case, was cheap).

Huh? The code always has to access err to work correctly. You don't save
anything.

Joerg


Re: Support for "pshared" POSIX semaphores

2019-02-04 Thread Joerg Sonnenberger
On Sat, Feb 02, 2019 at 02:11:47PM -0800, Jason Thorpe wrote:
> Ok, updated patch:
> 
>   
> https://patch-diff.githubusercontent.com/raw/thorpej/netbsd-src/pull/5.diff 
> 
> 
> I went an used a simple hash table with 32 buckets.  Seems good enough for 
> now.  Since the “pshared” IDs are random anyway, I didn’t bother with any 
> exotic hash function — just extract some of the random bits to use as the 
> bucket index.

This seems to allow attaching to random semaphores by just knowing the
right idea. This violates the definition of anonymous semaphores and I
wouldn't be surprised if it creates some nasty security issues...

Joerg


Re: RFC: vioif(4) multiqueue support

2018-12-25 Thread Joerg Sonnenberger
On Wed, Dec 26, 2018 at 02:37:02AM +, Taylor R Campbell wrote:
> > +static int
> > +vioif_alloc_queues(struct vioif_softc *sc)
> > +{
> > +   int nvq_pairs = sc->sc_max_nvq_pairs;
> > +   int nvqs = nvq_pairs * 2;
> > +   int i;
> > +
> > +   sc->sc_rxq = kmem_zalloc(sizeof(sc->sc_rxq[0]) * nvq_pairs,
> > +   KM_NOSLEEP);
> > +   if (sc->sc_rxq == NULL)
> > +   return -1;
> > +
> > +   sc->sc_txq = kmem_zalloc(sizeof(sc->sc_txq[0]) * nvq_pairs,
> > +   KM_NOSLEEP);
> > +   if (sc->sc_txq == NULL)
> > +   return -1;
> 
> Check to avoid arithmetic overflow here:
> 
>   if (nvq_pairs > INT_MAX/2 - 1 ||
>   nvq_pairs > SIZE_MAX/sizeof(sc->sc_rxq[0]))
>   return -1;
>   nvqs = nvq_pairs * 2;
>   if (...) nvqs++;
>   sc->sc_rxq = kmem_zalloc(sizeof(sc->sc_rxq[0]) * nvq_pairs, ...);
> 
> Same in all the other allocations like this.  (We should have a
> kmem_allocarray -- I have a draft somewhere.)

The limit should just be sanely enforced much, much earlier. The code
in ioif_attach already does that, so why bother.

> > diff --git a/sys/dev/pci/virtio_pci.c b/sys/dev/pci/virtio_pci.c
> > index 65c5222b774..bb972997be2 100644
> > --- a/sys/dev/pci/virtio_pci.c
> > +++ b/sys/dev/pci/virtio_pci.c
> > @@ -604,8 +677,14 @@ virtio_pci_setup_interrupts(struct virtio_softc *sc)
> > [...]
> > if (pci_intr_type(pc, psc->sc_ihp[0]) == PCI_INTR_TYPE_MSIX) {
> > -   psc->sc_ihs = kmem_alloc(sizeof(*psc->sc_ihs) * 2,
> > +   psc->sc_ihs = kmem_alloc(sizeof(*psc->sc_ihs) * nmsix,
> > KM_SLEEP);
> 
> Check for arithmetic overflow here.

No point here either.

Joerg


Re: Importing libraries for the kernel

2018-12-14 Thread Joerg Sonnenberger
On Fri, Dec 14, 2018 at 01:00:25PM -0500, Mouse wrote:
> >>> [...] I have serious concerns for doing asymmetric cryptography in
> >>> the kernel [...]
> >> Can you clarify the concerns?
> > Asymmetrical cryptography is slow and complex.  [...]  The
> > implementation is non-trivial [...]
> 
> Didn't that ship sail long ago?  I recall seeing people talking about
> putting entire languages into the kernel, in some cases even including
> jitters.  Much as I dislike this, I find that far more "no way in hell
> is that going into _my_ machines' kernels!".

Few of this things require 10k+ cycle operations in one go.

> I also disagree that asymmetric crypto is necessarily all that complex.
> Some asymmetric crypto algorithms require nothing more complex than
> large-number arithmetic.  (Slow, yes, but not particularly complex.)

Correct and fast implementations of large number arithmetic are
complex, esp. if you also want to avoid the typical set of timing leaks.
This applies to operation sets used by RSA as well as those used by ECC.
Different classes of operations, but a mine field to get right.

Joerg


Re: Support for tv_sec=-1 (one second before the epoch) timestamps?

2018-12-14 Thread Joerg Sonnenberger
On Thu, Dec 13, 2018 at 02:37:06AM +0100, Kamil Rytarowski wrote:
> In real life it's often needed to store time_t pointing before the UNIX
> epoch.

Again, I quite disagree and believe that you are confusing two different
things. It makes perfect sense in certain applications to store time as
relative to the UNIX epoch. But that's not the same as time_t which is a
specific type for a *system* interface. I'm strongly question the
sensibility of trying to put dates before 1970 in the context of time_t.

Joerg


Re: Importing libraries for the kernel

2018-12-14 Thread Joerg Sonnenberger
On Thu, Dec 13, 2018 at 11:07:23PM +0900, Ryota Ozaki wrote:
> On Thu, Dec 13, 2018 at 6:30 AM Joerg Sonnenberger  wrote:
> >
> > On Thu, Dec 13, 2018 at 12:58:21AM +0900, Ryota Ozaki wrote:
> > > Before that, I want to ask about how to import cryptography
> > > libraries needed tor the implementation.  The libraries are
> > > libb2[1] and libsodium[2]: the former is for blake2s and
> > > the latter is for curve25519 and [x]chacha20-poly1305.
> >
> > I don't really have a problem with Blake2s, but I have serious concerns
> > for doing asymmetric cryptography in the kernel. In fact, it is one of
> > the IMHO very questionable design decisions behind WireGuard and
> > something I don't want to see repeated in NetBSD.
> 
> Can you clarify the concerns?

Asymmetrical cryptography is slow and complex. On many architectures,
the kernel will only be able to use slower non-SIMD implementations. ECC
still easily requires 10k cycles per operation. The implementation is
non-trivial in terms of code size and historically riddled with tiny
tricky issues ranging from corner cases in the math to timing. I haven't
yet heard a really good argument why the key exchange must be part of
the kernel beyond the inability of the Linux community to coordinate
different projects.

Joerg


Re: Support for tv_sec=-1 (one second before the epoch) timestamps?

2018-12-12 Thread Joerg Sonnenberger
On Wed, Dec 12, 2018 at 08:46:33PM +0100, Michał Górny wrote:
> While researching libc++ test failures, I've discovered that NetBSD
> suffers from the same issue as FreeBSD -- that is, both the userspace
> tooling and the kernel have problems with (time_t)-1 timestamp,
> i.e. one second before the epoch.

I see no reason why that should be valid or more general, why any
negative value of time_t is required to be valid.

Joerg


Re: Importing libraries for the kernel

2018-12-12 Thread Joerg Sonnenberger
On Thu, Dec 13, 2018 at 12:58:21AM +0900, Ryota Ozaki wrote:
> Before that, I want to ask about how to import cryptography
> libraries needed tor the implementation.  The libraries are
> libb2[1] and libsodium[2]: the former is for blake2s and
> the latter is for curve25519 and [x]chacha20-poly1305.

I don't really have a problem with Blake2s, but I have serious concerns
for doing asymmetric cryptography in the kernel. In fact, it is one of
the IMHO very questionable design decisions behind WireGuard and
something I don't want to see repeated in NetBSD.

Joerg


Re: Too many PMC implementations

2018-08-17 Thread Joerg Sonnenberger
On Fri, Aug 17, 2018 at 04:20:30PM +0200, Maxime Villard wrote:
> Le 10/08/2018 à 11:40, Maxime Villard a écrit :
> > I saw the thread [Re: Sample based profiling] on tech-userlevel@, I'm not
> > subscribed to this list but I'm answering here because it's related to
> > tprof among other things.
> > 
> > I agree that it would be better to retire gprof in base, because there are
> > more powerful tools now, and also advanced hardware support (PMC, PEBS,
> > ProcessorTrace).
> > 
> > But in particular, it would be nice to retire the "kernel gprof". That is,
> > the MD/MI pieces that are surrounded by #ifdef GPROF. This kind of
> > profiling is weak, and misses many aspects of execution (branch prediction,
> > cache misses, heavy instructions, etc) that are offered by tprof.
> > 
> > I already dropped NENTRY() from x86, so KGPROF is officially not supported
> > there anymore. I think it has never worked on amd64.
> 
> So no one has any opinion on that? Because in this case I will remove it
> soon. (Talking about the kernel gprof.)

I'm quite reluctant to remove the only sample based profiler we have
right now. Esp. since we don't have any infrastructure for counter-based
profilers either AFAICT.

Joerg


Re: 8.0 performance issue when running build.sh?

2018-08-09 Thread Joerg Sonnenberger
On Fri, Aug 10, 2018 at 12:29:49AM +0200, Joerg Sonnenberger wrote:
> On Thu, Aug 09, 2018 at 08:14:57PM +0200, Jaromír Doleček wrote:
> > 2018-08-09 19:40 GMT+02:00 Thor Lancelot Simon :
> > > On Thu, Aug 09, 2018 at 10:10:07AM +0200, Martin Husemann wrote:
> > >> 100.002054 14.18 kernel_lock
> > >>  47.43 846  6.72 kernel_lockfileassoc_file_delete+20
> > >>  23.73 188  3.36 kernel_lockintr_biglock_wrapper+16
> > >>  16.01 203  2.27 kernel_lockscsipi_adapter_request+63
> > > Actually, I wonder if we could kill off the time spent by fileassoc.  Is
> > > it still used only by veriexec?  We can easily option that out of the 
> > > build
> > > box kernels.
> > 
> > Or even better, make it less heavy?
> > 
> > It's not really intuitive that you could improve filesystem
> > performance by removing this obscure component.
> 
> If it is not in use, fileassoc_file_delete will short cut already.

...and of course, the check seems to be just useless. So yes, it should
be possible to make it much less heavy.

Joerg


Re: 8.0 performance issue when running build.sh?

2018-08-09 Thread Joerg Sonnenberger
On Thu, Aug 09, 2018 at 08:14:57PM +0200, Jaromír Doleček wrote:
> 2018-08-09 19:40 GMT+02:00 Thor Lancelot Simon :
> > On Thu, Aug 09, 2018 at 10:10:07AM +0200, Martin Husemann wrote:
> >> 100.002054 14.18 kernel_lock
> >>  47.43 846  6.72 kernel_lockfileassoc_file_delete+20
> >>  23.73 188  3.36 kernel_lockintr_biglock_wrapper+16
> >>  16.01 203  2.27 kernel_lockscsipi_adapter_request+63
> > Actually, I wonder if we could kill off the time spent by fileassoc.  Is
> > it still used only by veriexec?  We can easily option that out of the build
> > box kernels.
> 
> Or even better, make it less heavy?
> 
> It's not really intuitive that you could improve filesystem
> performance by removing this obscure component.

If it is not in use, fileassoc_file_delete will short cut already.

Joerg


Re: hashtables

2018-07-25 Thread Joerg Sonnenberger
On Tue, Jul 24, 2018 at 05:03:56PM -0700, bch wrote:
> Does nbperf(1) suit any need?

I'm wondering the same. It seems to essentially switch between a fixed
set of hash tables.

Joerg


Re: Adding a boot flag for No ASLR

2018-07-24 Thread Joerg Sonnenberger
On Tue, Jul 24, 2018 at 02:08:53PM +0200, Kamil Rytarowski wrote:
> I propose to move the code disabling PaX ASLR from bootloader and kernel
> as proposed in the patch by Siddharth and introduce it directly into the
> sanitizer, We can alter the CheckASLR() routine specific to NetBSD, with
> the following pseudo-code:

You can't disable ASLR at this point. It is too late. paxctl as hack is
good enough until the proper note processing and fixed VM space layout
is implemented.

Joerg


Re: Adding a boot flag for No ASLR

2018-07-24 Thread Joerg Sonnenberger
On Tue, Jul 24, 2018 at 06:44:52AM +0200, Martin Husemann wrote:
> On Mon, Jul 23, 2018 at 11:02:04PM +0200, Kamil Rytarowski wrote:
> > We need to maintain a function to translate certain ranges to
> > shadow/meta/origin/etc. We cannot map arbitrarily wide ranges to them.
> 
> Can we extend the pax note (or create a new one) and make the sanitizers
> link that in? Then make the kernel reserve some (random) VA spaces
> (details of what is needed in the note) and provide some way for the
> sanitizers to find that special VAs (like from the aux vector)?

Yes, all sanitized binaries should contain a note if they require
certain fixed mappings to be reserved. I don't think there is *any* need
to disable ASLR beyond that.

Joerg


Re: Adding a boot flag for No ASLR

2018-07-23 Thread Joerg Sonnenberger
On Mon, Jul 23, 2018 at 07:13:49PM +0200, Kamil Rytarowski wrote:
> We need to have stack, heap and code of a program in predictable (and
> quite narrow) ranges and thus ASLR disabled or less aggressive.

What for? Nothing in the sanitizer design should require that. The only
requirement should be that the shadow area can be mapped at a fixed
location.

Joerg


Re: Adding a boot flag for No ASLR

2018-07-23 Thread Joerg Sonnenberger
On Mon, Jul 23, 2018 at 06:24:09PM +0530, Siddharth Muralee wrote:
> >
> >
> > (1) An implementation detail of userland shouldn't be leaked into the
> > kernel boot (!) process.
> >
> 
> Okay. I think this makes sense(I am still pretty new to NetBSD) - Can you
> suggest some other location/config that can be used.

paxctl.

> > (2) There is no fundamental issue that makes the sanitizers incompatible
> > with ASLR. The only issue for asan and friends is the reservation of the
> > shadow buffer and that can and should be handled explicitly.
> >
> 
> We have implemented the ATF tests for ASan - The tests work only 50% or
> less when ASLR is on. To get perfect results I think ASLR needs to be off.
> I guess Kamil can provide more info on this.

I'm very aware of the current situation. Ultimately, stack randomisation
has the same issue. The way we setup the VM space of a new process is
suboptimal for a world that wants to randomize things. I.e. at the
moment, the VM commands (epp->ep_vmcmds) are executed in order and that
makes placing fixed location objects difficult. What should happen is:
(1) Each VM object should grow an object group field. VM objects in the
same group are assigned a random location together. A special group
field value of 0 means no randomisation.
(2) Locations should be assigned first to fixed position fields and
otherwise in descending order of size.
(3) The stack of the main thread should be reserved and integrated into
the VM object reservation just like the rest. The current stack
randomisation should be removed.

It should be noted that (2) needs to deal with impossible allocations,
so it should do one pass to size up each free range in the address space
that can fit the requested object, pick a random value and then as
second iteration find the correct range. to split.

Joerg


Re: ./build.sh -k feature request (was Re: GENERIC Kernel Build errors with -fsanitize=undefined option enabled)

2018-06-25 Thread Joerg Sonnenberger
On Mon, Jun 25, 2018 at 08:23:22PM +0200, Reinoud Zandijk wrote:
> On Sun, Jun 24, 2018 at 10:01:42PM +0200, Reinoud Zandijk wrote:
> > On Wed, May 30, 2018 at 07:11:19PM +0800, Harry Pantazis wrote:
> > > Continuing..
> > > 
> > > This first errors are located in
> > > src/sys/external/bsd/drm2/dist/drm/i915/intel_ddi.c and are specific to
> > > the switch statement concerning that the case flags are not reducing
> > > directly to integer constants.
> > 
> > I'd like to request a -k flag to ./build.sh that as with a normal make(1)
> > continues to build as much as possible. This will result in reporting all
> > errors in one go without needing the 1st to be resolved before the 2nd is
> > showing up!
> 
> Attached patch will do, objections against me comitting it? It allows all that
> is buildable to be build and the failing files to be compiled later when
> patched with the -u option.

I don't really like this. I would not be surprised if this broke things
in interesting ways for full world builds if you run into a problem, cvs
up and try an update build again.

Joerg


Re: Looking for re(4) help

2018-06-22 Thread Joerg Sonnenberger
On Fri, Jun 22, 2018 at 01:38:28PM -0400, mo...@credil.org wrote:
> So far I've drawn a complete blank looking for hardware documentation;

You have searched just for the 8168 datasheet, have you? At least for
older revisions, you can easily find the register description.

Joerg


Re: Leaking kernel stack data in struct padding

2018-06-13 Thread Joerg Sonnenberger
On Wed, Jun 13, 2018 at 02:16:30PM +0300, Valery Ushakov wrote:
> but I wonder if this scrubbing should be moved into
> timespec_to_timespec50() - after all the most likley use of the compat
> struct is to write or copyout it in the compat code, so the same
> problem probably happens elsewhere.

Yes. It also gives the compiler the chance to eliminate unnecessary
writes.

Joerg


Re: Potential new syscall

2018-04-03 Thread Joerg Sonnenberger
On Tue, Apr 03, 2018 at 09:08:15AM +0700, Robert Elz wrote:
> Kamil - "just use fork" is a very common response, but no matter how
> fork gets implemented, vfork() when used correctly always performs
> better by huge margins.

But most of those cases are handled just as well by posix_spawn. Which
doesn't have any of the thread-safety issues that vfork has.

Joerg


Re: mmap implementation advice needed.

2018-03-30 Thread Joerg Sonnenberger
On Fri, Mar 30, 2018 at 04:22:29PM -0400, Mouse wrote:
> And I (and ragge, I think it was) misspoke.  It doesn't quite require
> 128K of contiguous physical space.  It needs two 64K blocks of
> physically contiguous space, both within the block that maps system
> space.  (Nothing says that P0 PTEs have to be anywhere near P1 PTEs in
> system virtual space, but they do have to be within system space.)

...and the problem to be solved here is that the memory has become
fragmented enough that you can't find 64KB of contiguous pages?
If so, what about having a fixed set of emergency reservations and
copying the non-contiguous pmap content into that during context switch?

Joerg


Re: mmap implementation advice needed.

2018-03-30 Thread Joerg Sonnenberger
On Fri, Mar 30, 2018 at 01:10:37PM -0400, Mouse wrote:
> It takes 4 bytes of PTE to map 512 bytes of VA.  (The VAX uses the
> small, by today's standards, page size of 512 bytes.)  So 2G of
> userland space requires 16M of PTEs.  Those PTEs must be in system
> virtual space.  And that 16M of system virtual space requires 128K of
> PTEs to map, and _those_ PTEs require contiguous physical space.

Let me try to rephrase that:

The first level page table of VAX needs up to 128K (for each of the two
ranges?) as contiguous physical space.
The second level page table needs 16M in some block size, but they don't
need to be all contigously?

Joerg


Re: mmap implementation advice needed.

2018-03-30 Thread Joerg Sonnenberger
On Fri, Mar 30, 2018 at 11:33:48AM +0200, Anders Magnusson wrote:
> Notes about vax memory management if someone is wondering:
> - 2 areas (P0 and P1) of size 1G each, P0 grows from bottom, P1 grows from
> top (intended for stack).

AFAICT, VAX uses a max userland address of 2G, so what exactly is the
problem? That you can't allocate enough continous memory for the PTEs?

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sun, Mar 25, 2018 at 07:24:25PM +0200, Maxime Villard wrote:
> Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit :
> > The other question is whether we can't just use the direct map for this
> > on amd64 and similar platforms?
> 
> no, because nothing guarantees the physical pages are contiguous.

At least for ubc_zerorange, they certainly don't have to be. For
ubc_uiomove, it's a bit more work to split the IO vector, but certainly
feasible as well. Alternatively, propagate it down to uiomove to allow
using *two* IO vectors.

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote:
> jo...@bec.de (Joerg Sonnenberger) writes:
> 
> >What about having a passive unmap as fourth option? I.e. when unmapping
> >in the transfer map, just add them to a FIFO. Process the FIFO on each
> >CPU when there is time or the transfer map runs out of space. Context
> >switching for example would be a natural place to do any such
> >invalidation.
> 
> The mapping is so temporary that it is almost only used on a single CPU.
> Basically it's:
> 
> win = ubc_alloc();
> if (!uiomove(win, ...))
>   memset(win, 0, ...);
> ubc_release(win, ...);
> 
> For this to be even visible on multiple CPUs, the thread would need
> to migrate to another CPU. Locking down the LWP on a single CPU
> might be the cheapest solution.

Yeah, that's what ephemeral mappings where supposed to be for. The other
question is whether we can't just use the direct map for this on amd64
and similar platforms?

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sat, Mar 24, 2018 at 08:42:34PM +0100, Jaromír Doleček wrote:
> The problem there is that FFS triggers a pathologic case. I/O transfer maps
> and then unmaps each block into kernel pmap, so that the data could be
> copied into user memory. This triggers TLB shootdown IPIs for each FS
> block, sent  to all CPUs which happen to be idle, or otherwise running on
> kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a
> lot of synchronization overhead.

What about having a passive unmap as fourth option? I.e. when unmapping
in the transfer map, just add them to a FIFO. Process the FIFO on each
CPU when there is time or the transfer map runs out of space. Context
switching for example would be a natural place to do any such
invalidation.

Joerg


Re: Let callout_reset return if it reschedules a pending callout

2018-02-28 Thread Joerg Sonnenberger
On Thu, Mar 01, 2018 at 01:58:29AM +0900, Ryota Ozaki wrote:
> On Wed, Feb 28, 2018 at 10:11 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
> > On Wed, Feb 28, 2018 at 05:47:13PM +0900, Ryota Ozaki wrote:
> >> The feature is useful when you have a reference to an object that is
> >> passed to a callout. In this case you need to take care of a
> >> reference leak on callout_reset (and other APIs); it silently
> >> reschedules (IOW cancels) a pending callout and leaks a reference.
> >> Unfortunately callout_reset doesn't tell us the reschedule.
> >
> > Can we look at this part first before discussing any API changes?
> > Why can't you store the reference next to callout instance itself?
> 
> Oh, my explanation was quite confusing... I meant that the object
> has a reference counter and the counter is incremented to indicate
> that it is referred by a callout. obj_ref and obj_unref can be
> thought as obj->refcnt++ and obj->refcnt-- respectively.

Just consider the object reference owned by the callout struct. You
change the reference count when the callout is destroyed or if the
callout is reassigned, not because it is executed.

Joerg


Re: Let callout_reset return if it reschedules a pending callout

2018-02-28 Thread Joerg Sonnenberger
On Wed, Feb 28, 2018 at 05:47:13PM +0900, Ryota Ozaki wrote:
> The feature is useful when you have a reference to an object that is
> passed to a callout. In this case you need to take care of a
> reference leak on callout_reset (and other APIs); it silently
> reschedules (IOW cancels) a pending callout and leaks a reference.
> Unfortunately callout_reset doesn't tell us the reschedule.

Can we look at this part first before discussing any API changes?
Why can't you store the reference next to callout instance itself?

Joerg


Re: gcc: optimizations, and stack traces

2018-02-11 Thread Joerg Sonnenberger
On Sun, Feb 11, 2018 at 04:13:56PM +0700, Robert Elz wrote:
> Date:Sun, 11 Feb 2018 09:11:45 +0100
> From:Maxime Villard 
> Message-ID:  <2c83e9d9-f49c-479b-7a4c-1df581a2b...@m00nbsd.net>
> 
>   | So we have the same problem, and we need to find a way
>   | to tell GCC to always push the frame at the beginning of the functions.
> 
> Either that or the stack unwind code needs to become smarter - which
> would be a better solution, as it avoids dropping (the admittedly minor)
> benefit  obtained from deferring the frame pointer update (which to be a
> useful solution would need to be universal) and adds a (not insignificant)
> cost to the stack unwind code - but performance there usually does
> not matter.

Again, the logic for that already exists. -fomit-frame-pointer would not
be acceptable otherwise.

Joerg


Re: gcc: optimizations, and stack traces

2018-02-09 Thread Joerg Sonnenberger
On Fri, Feb 09, 2018 at 11:23:17AM +0100, Maxime Villard wrote:
> It implies that if a bug occurs _before_ these two instructions are executed,
> we have a %rbp that points to the _previous_ function, the one we got called
> from. And therefore, GDB does not display the current function (where the bug
> actually happened), but displays its caller.

This analysis is wrong. GDB will first of all look for frame annotation
data, i.e. .eh_frame or the corresponding .debug_frame. Only if it can't
find such annotation will it fall back to guessing from the function
itself. We default to building .eh_frame for all binaries, but I'm not
completely sure if GCC will create async unwind tables by default.

Joerg


Re: Bunch of bugs reported by Ilja van Sprundel

2018-01-29 Thread Joerg Sonnenberger
On Mon, Jan 29, 2018 at 10:16:05PM +0100, Kamil Rytarowski wrote:
> On 29.01.2018 22:01, Joerg Sonnenberger wrote:
> > On Mon, Jan 29, 2018 at 09:58:16PM +0100, Kamil Rytarowski wrote:
> >> Another point is to set a rule that ABI is stable between patch versions
> >> and binary packages (prebuilt software) still works as-is. I'm observing
> >> now users who abandon researching this OS just because a patch version
> >> of kerberos is not compatible with existing packages.
> > 
> > This is wrong. It's not a patch version, but a different minor release.
> > 
> > Joerg
> > 
> 
> Right, and we push broken packages to users:
> 
> lrwxrwxr-x   1 bouyernetbsd 3 Mar 15  2017 7.1 -> 7.0
> 
> http://cdn.netbsd.org/pub/pkgsrc/packages/NetBSD/amd64/
> 
> Assuming that we don't have resources / interest to build proper
> packages it might be better to revise this state and improve the
> situation. (My proposal is to abandon minor releases. Abandon CAs in
> favor of frequent patch releases.)

The correct fix is the same it has been for years: focus the resources
for binary packages on the latest minor release. Killing minor releases
is not going to improve anything except making things go even more stale
by forcing more work for pullup-compatible solutions.

Joerg


Re: Bunch of bugs reported by Ilja van Sprundel

2018-01-29 Thread Joerg Sonnenberger
On Mon, Jan 29, 2018 at 09:58:16PM +0100, Kamil Rytarowski wrote:
> Another point is to set a rule that ABI is stable between patch versions
> and binary packages (prebuilt software) still works as-is. I'm observing
> now users who abandon researching this OS just because a patch version
> of kerberos is not compatible with existing packages.

This is wrong. It's not a patch version, but a different minor release.

Joerg


Re: Spectre

2018-01-18 Thread Joerg Sonnenberger
On Wed, Jan 17, 2018 at 09:38:27PM -0500, Mouse wrote:
> But, on the other hand, I can easily imagine a CPU designer looking at
> it and saying "What's the big deal if this code can read that location?
> It can get it anytime it wants with a simple load instruction anyway.",
> something I have trouble disagreeing with.

Consider something like BPF -- code executed in the kernel with an
enforced security model to prevent "undesirable" acceses. It will create
logic like:

void *p = ...;
if (!is_accesible(p))
  raise_error();
load(p);

Now imagine that the expression for p is intentionally pointing into
userland and depends on the speculative execution of something else.
Loading the pointer speculatively results in a visible side effect that
defeats in part the access check. In short, it can effectively invert
access control checks for verified code.

Joerg


Re: Proposal to obsolete SYS_pipe

2017-12-24 Thread Joerg Sonnenberger
On Sun, Dec 24, 2017 at 10:25:15PM +0100, Kamil Rytarowski wrote:
> It is a special syscall that returns two integers from one function
> call. Fanciness is not compatible with regular C syntax and it demands
> per-cpu assembly wrappers and rump-kernel workarounds. It's not easily
> usable with syscall(2).

So far I see no good reason for this move. I consider the very existance
of syscall(2) a bug, so any justification based on "it can't be used
with it" is hallow.

Joerg


Re: workqueue_drain

2017-12-20 Thread Joerg Sonnenberger
On Thu, Dec 21, 2017 at 05:32:58AM +0800, Paul Goyette wrote:
> I'm not totally convinced here.  It might be useful to wait for a
> particular work to be finished in order to allow it to be enqueued
> again (no work can be enqueued if already in the queue).  But I don't
> see how "remember the last work enqueued and wait for it to be done
> before destroying" is more versatile than "waiting for all to be done
> before destroying".  It certainly seems that the latter is a simpler
> approach.

Given that you have to make the workqueue externally non-accessible
first anyway, the former provides all the functionality of the latter.
If you don't make it unaccessible first, you always have a race
condition.

Joerg


Re: RFC: ipsec(4) pseudo interface

2017-12-18 Thread Joerg Sonnenberger
On Mon, Dec 18, 2017 at 06:49:44PM +0900, Kengo NAKAHARA wrote:
> (a) Add if_ipsec.4
> (b) move current ipsec.4(for ipsec protocol) to ipsec.9, and then
> add ipsec.4(for ipsec pseudo interface)
> (c) any other

I'd call it either ifipsec(4) or ipsecif(4).

Joerg


Re: Device probing and driver attach

2017-11-03 Thread Joerg Sonnenberger
On Fri, Nov 03, 2017 at 07:46:04PM +0100, Rocky Hotas wrote:
> - In principle, if one built a custom kernel including *only* the drivers
> needed by its current machine, would the boot time get significantly
> reduced?

Well, assuming any modernish bus with decent device enumeration
functionality, the answer is pretty much "it doesn't matter". Most of
the probe functions match things like bus/vendor IDs against a table.
There are very few PCI devices that are more involved like checking the
PCI config space (i.e. rtk vs re) and require many additional cycles. So
even with a thousand probe functions run with one microsecond teach, we
are talking about < 1 milli second per device. Given that the actual
hardware configuration often contains delay loops, probing simply doesn't
matter.

> - When a BIOS does not perform this operation, is during the
> autoconfiguration that device BARs are written by the OS?

There are two different parts here:
(1) Probing the size of a PCI BAR. This is done by the OS for general
accounting.
(2) Reserving and/or assigning space for devices. This is generally
assumptions on x86 and many other platforms to be handled by the
firmware, with some fallback code i.e. for hot plugging bridges.

Joerg


Re: Deadlock on fragmented memory?

2017-10-25 Thread Joerg Sonnenberger
On Wed, Oct 25, 2017 at 11:49:01AM +0700, Robert Elz wrote:
> Most applications would never bother, but the ones that do a lot of
> running of other processes (say, sh, and probably make) could make use
> of this and cut down on a lot of both data copying and page contention.

For sh and make it would also be nice to use posix_spawn for at least
the simple cases. That saves a lot of pmap manipulations...

Joerg


Re: max open files

2017-09-29 Thread Joerg Sonnenberger
On Tue, Sep 26, 2017 at 09:53:29AM +0200, Patrick Welche wrote:
> On Tue, Sep 26, 2017 at 06:33:01AM +, co...@sdf.org wrote:
> > this number is way too easy to hit just linking things in pkgsrc. can we
> > raise it? things fail hard when it is hit.
> > 
> > ive seen people say 'when your build dies, restart it with MAKE_JOBS=1
> > so it doesn't link in parallel'.
> 
> In the same vein, would a kernel with say "maxusers 256" make sense?

IMO maxusers should be removed and replaced by proper scaling of data
structures and limits based on RAM, nothing else.

Joerg


Re: RFC: vlan(4) use pkthdr instead of mtag

2017-09-20 Thread Joerg Sonnenberger
On Wed, Sep 20, 2017 at 02:05:38PM +0900, Shoichi YAMAGUCHI wrote:
> > cxgb_sge.c's t3_encap should not predicate the VLAN support.
> Sorry, I don't understand you. Does this remark means the functions
> related to VLAN like vlan_has_tag should use with "#if NVLAN > 0" ?

No, I mean it just shouldn't use __predict_false for the branch. That's
a stupid assumption to make. Does that make it clearer?

> Thank you the comment. I took a same mistake in VLAN_TAG_VALUE
> and fixed them.

In VLAN_TAG_VALUE it doesn't matter as it stores the entry in the field
reusing the remaining bits. In this case, the other bits of the vlan
field where used for other purpses.

Joerg


Re: RFC: vlan(4) use pkthdr instead of mtag

2017-09-19 Thread Joerg Sonnenberger
On Tue, Sep 19, 2017 at 02:33:12PM +0900, Shoichi YAMAGUCHI wrote:
> Thank you for the comments.
> 
> I updated the patch:
> https://gist.githubusercontent.com/s-ymgch228/6597cfc4b6f79c6c62fcdf25003acb55/raw/8237c14badb390355613794f5e3dee92431a89d2/vlan_mtag.patch
> https://gist.githubusercontent.com/s-ymgch228/6597cfc4b6f79c6c62fcdf25003acb55/raw/7169afbe597f3fb562cdc6afc4ef9caaa04d2542/vlan_mtag.patch.diff

Thanks for the update. Looking through it:

I'd still rename VLAN_TAG_VALUE to vlan_get_tag() for consistency with
the rest.

pq3etsec_tx_offload has a weird empty branch, but that's not new. 

cxgb_sge.c's t3_encap should not predicate the VLAN support.

if_ti.c's ti_rxeof is using the wrong mask now, if I interprete the
comment correctly?

Joerg


Re: performance issues during build.sh -j 40 kernel

2017-09-10 Thread Joerg Sonnenberger
On Sun, Sep 10, 2017 at 07:56:11PM +0200, Maxime Villard wrote:
> Le 10/09/2017 à 19:50, Joerg Sonnenberger a écrit :
> > On Sun, Sep 10, 2017 at 07:17:51PM +0200, Joerg Sonnenberger wrote:
> > > That's true, but changing this also has quite a significant downside on
> > > some workloads for second order effects. I don't think it is a good idea
> > > to change this right now, as it doesn't even fix the real problem.
> > 
> > Just to quantify this part, for a current release build on tmpfs, I see:
> > 
> > After:
> > 4267
> > 4280
> > 4261
> > 4247
> > 4300
> > 
> > Before:
> > 3915
> > 3951
> > 3991
> > 3961
> > 3968
> 
> That's the cacheline alignment on the uvm locks, right? In that case, what do
> you think are the "second order effects"?

Yes, it is adding the alignment in uvm_init.c. So an isolated build of
GENERIC on tmpfs gives:

https://www.netbsd.org/~joerg/lockstat-generic.txt

(that's without DIAGNOSTICS, hannken added a very heavy assert in genfs
recently, that needs to be investigated separateply). What I strongly
suspect is that the major factor for the lock contention in
uvm_fault_internal is still the uvm_fpageqlock contention. While a
change to the contention of that might be locally positive, it can just
as well increase the contention on the vmobjlock.

Joerg


  1   2   3   4   5   >