Re: PCI: disable I/O or mem before probing BAR size

2020-05-04 Thread David Young
On Mon, May 04, 2020 at 05:56:13PM -0400, Mouse wrote:
> On the other, if anything else could possibly be poking at the device
> while you're probing its mapping register to see how big it is, you've
> got much worse problems already.

If CPU 2 tries to read/write registers on device B while CPU 1 probes
device A's BAR for type/size, device A is enabled, and device A's base
address is momentarily undefined, then I can see device A and B both
responding to the same transaction and causing a fault to occur.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: PCI: disable I/O or mem before probing BAR size

2020-05-04 Thread David Young
On Mon, May 04, 2020 at 09:17:28PM +0200, Manuel Bouyer wrote:
> Does anyone see a problem with this ?

On the contrary, sounds like something we should have always done!

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Symbol debugging support for kernel modules in crash dumps

2020-05-01 Thread David Young
Fantastic! Thanks.

Dave

Spilling kerrectud by iPhone

> On May 1, 2020, at 6:34 PM, Christos Zoulas  wrote:
> 
> 
> Hi,
> 
> I just added symbol debugging support for modules in kernel dumps.
> Things are not perfect because of what I call "current thread
> confusion" in the kvm target, but as you see in the following
> session it works just fine if you follow the right steps. First of
> all you need a build from HEAD that has the capability to build
> .debug files for kernel modules.  Once that's done, you are all
> set; see how it works (comments prefixed by )
> 
> Enjoy,
> 
> christos
> 
> $ gdb /usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb
> GNU gdb (GDB) 8.3
> Copyright (C) 2019 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64--netbsd".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>.
> 
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb...
> (gdb) target kvm netbsd.22.core
> 0x80224375 in cpu_reboot (howto=howto@entry=260, 
>bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:718
> warning: Source file is more recent than executable.
> 718 if (s != IPL_NONE)
> 
>  Ok we got a stacktrace here, but we don't have a current thread...
>  So we set it...
> 
> (gdb) info thread
>  Id   Target Id Frame 
> * 2.1   0x80224375 in cpu_reboot (
>howto=howto@entry=260, bootstr=bootstr@entry=0x0)
>at ../../../../arch/amd64/amd64/machdep.c:718
> 
> No selected thread.  See `help thread'.
> (gdb) thread 2.1
> 
> [Switching to thread 2.1 ()]
> #0  0x80224375 in ?? ()
> 
>  Note that here we lost all symbol table access when we switched threads
>  let's load it again..
> 
> (gdb) add-symbol-file /usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb
> add symbol table from file "/usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb"
> (y or n) y
> Reading symbols from /usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb...
> 
>  OK, lets load our modules
> 
> (gdb) source /usr/src/sys/gdbscripts/modload 
> (gdb) modload
> add symbol table from file "/stand/amd64/9.99.59/modules/ping/ping.kmod" at
>.text_addr = 0x8266e000
>.data_addr = 0x8266b000
>.rodata_addr = 0x8266c000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/nfsserver/nfsserver.kmod" at
>.text_addr = 0x82a64000
>.data_addr = 0x82669000
>.rodata_addr = 0x8298e000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/npf_ext_log/npf_ext_log.kmod" at
>.text_addr = 0x82668000
>.data_addr = 0x82667000
>.rodata_addr = 0x82969000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/npf_alg_icmp/npf_alg_icmp.kmod" at
>.text_addr = 0x82666000
>.data_addr = 0x82665000
>.rodata_addr = 0x82952000
> add symbol table from file "/stand/amd64/9.99.59/modules/bpfjit/bpfjit.kmod" 
> at
>.text_addr = 0x82661000
>.data_addr = 0x0
>.rodata_addr = 0x828dd000
> add symbol table from file "/stand/amd64/9.99.59/modules/sljit/sljit.kmod" at
>.text_addr = 0x82945000
>.data_addr = 0x82664000
>.rodata_addr = 0x828f9000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/if_npflog/if_npflog.kmod" at
>.text_addr = 0x8266
>.data_addr = 0x8265f000
>.rodata_addr = 0x828ca000
> add symbol table from file "/stand/amd64/9.99.59/modules/npf/npf.kmod" at
>.text_addr = 0x82648000
>.data_addr = 0x82647000
>.rodata_addr = 0x826d6000
> add symbol table from file "/stand/amd64/9.99.59/modules/bpf/bpf.kmod" at
>.text_addr = 0x82622000
>.data_addr = 0x82621000
>.rodata_addr = 0x826a3000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/bpf_filter/bpf_filter.kmod" at
>.text_addr = 0x8263c000
>.data_addr = 0x0
>.rodata_addr = 0x82627000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/scsiverbose/scsiverbose.kmod" at
>.text_addr = 0x826a2000
>.data_addr = 0x82686000
>.rodata_addr = 0x82687000
> add symbol table from file 
> 

Re: racy acccess in kern_runq.c

2019-12-06 Thread David Young
On Fri, Dec 06, 2019 at 06:33:32PM +0100, Maxime Villard wrote:
> Le 06/12/2019 à 17:53, Andrew Doran a écrit :
> >Why atomic_swap_ptr() not atomic_store_relaxed()?  I don't see any bug that
> >it fixes.  Other than that it look OK to me.
> 
> Because I suggested it; my concern was that if not explicitly atomic, the
> cpu could make two writes internally (even though the compiler produces
> only one instruction), and in that case a page fault would have been possible
> because of garbage dereference.

atomic_store_relaxed() is not explicitly atomic?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: __{read,write}_once

2019-11-21 Thread David Young
On Thu, Nov 21, 2019 at 07:19:51PM +0100, Maxime Villard wrote:
> Le 18/11/2019 à 19:49, David Holland a écrit :
> >On Sun, Nov 17, 2019 at 02:35:43PM +, Mindaugas Rasiukevicius wrote:
> >  > David Holland  wrote:
> >  > >  > I see the potential source of confusion, but just think about: what
> >  > >  > could "atomic" possibly mean for loads or stores?  A load is just a
> >  > >  > load, there is no RMW cycle, conceptually; inter-locking the 
> > operation
> >  > >  > does not make sense.  For anybody who attempts to reason about this,
> >  > >  > it should not be too confusing, plus there are man pages.
> >  > >
> >  > > ...that it's not torn.
> >  > >
> >  > > As far as names... there are increasingly many slightly different
> >  > > types of atomic and semiatomic operations.
> >  > >
> >  > > I think it would be helpful if someone came up with a comprehensive
> >  > > naming scheme for all of them (including ones we don't currently have
> >  > > that we're moderately likely to end up with later...)
> >  >
> >  > Yes, the meaning of "atomic" has different flavours and describes 
> > different
> >  > set of properties in different fields (operating systems vs databases vs
> >  > distributed systems vs ...) and, as we can see, even within the fields.
> >  >
> >  > Perhaps not ideal, but "atomic loads/stores" and "relaxed" are already 
> > the
> >  > dominant terms.
> >
> >Yes, but "relaxed" means something else... let me be clearer since I
> >wasn't before: I would expect e.g. atomic_inc_32_relaxed() to be
> >distinguished from atomic_inc_32() or maybe atomic_inc_32_ordered() by
> >whether or not multiple instances of it are globally ordered, not by
> >whether or not it's actually atomic relative to other cpus.
> >
> >Checking the prior messages indicates we aren't currently talking
> >about atomic_inc_32_relaxed() but only about atomic_load_32_relaxed()
> >and atomic_store_32_relaxed(), which would be used together to
> >generate a local counter. This is less misleading, but I'm still not
> >convinced it's a good choice of names given that we might reasonably
> >later on want to have atomic_inc_32_relaxed() and
> >atomic_inc_32_ordered() that differ as above.
> >
> >  > I think it is pointless to attempt to reinvent the wheel
> >  > here.  It is terminology used by C11 (and C++11) and accepted by various
> >  > technical literature and, at this point, by academia (if you look at the
> >  > recent papers on memory models -- it's pretty much settled).  These terms
> >  > are not too bad; it could be worse; and this bit is certainly not the 
> > worst
> >  > part of C11.  So, I would just move on.
> >
> >Is it settled? My experience with the academic literature has been
> >that everyone uses their own terminology and the same words are
> >routinely found to have subtly different meanings from one paper to
> >the next and occasionally even within the same paper. :-/  But I
> >haven't been paying much attention lately due to being preoccupied
> >with other things.
> 
> So in the end which name do we use? Are people really unhappy with _racy()?
> At least it has a general meaning, and does not imply atomicity or ordering.

_racy() doesn't really get at the intended meaning, and everything in
C11 is racy unless you arrange otherwise by using mutexes, atomics, etc.
The suffix has very little content.

Names such as load_/store_fully() or load_/store_entire() or
load_/store_completely() get to the actual semantics: at the program
step implied by the load_/store_entire() expression, the memory
constituting the named variable is loaded/stored in its entirety.  In
other words, the load cannot be drawn out over more than one local
program step.  Whether or not the load/store is performed in one step
with respect to interrupts or other threads is not defined.

I feel like load_entire() and store_entire() get to the heart of the
matter while being easy to speak, but _fully() or _completely() seem
fine.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: __{read,write}_once

2019-11-06 Thread David Young
On Wed, Nov 06, 2019 at 06:57:07AM -0800, Jason Thorpe wrote:
> 
> 
> > On Nov 6, 2019, at 5:41 AM, Kamil Rytarowski  wrote:
> > 
> > On 06.11.2019 14:37, Jason Thorpe wrote:
> >> 
> >> 
> >>> On Nov 6, 2019, at 4:45 AM, Kamil Rytarowski  wrote:
> >>> 
> >>> I propose __write_relaxed() / __read_relaxed().
> >> 
> >> ...except that seems to imply the opposite of what these do.
> >> 
> >> -- thorpej
> >> 
> > 
> > Rationale?
> > 
> > This matches atomic_load_relaxed() / atomic_write_relaxed(), but we do
> > not deal with atomics here.
> 
> Fair enough.  To me, the names suggest "compiler is allowed to apply relaxed 
> constraints and tear the access if it wants" But apparently the common 
> meaning is "relax, bro, I know what I'm doing".  If that's the case, I can 
> roll with it.

After reading this conversation, I'm not sure of the semantics.

I *think* the intention is for __read_once()/__write_once() to
load/store the entire variable from/to memory precisely once.  They
provide no guarantees about atomicity of the load/store.  Should
something be said about ordering and visibility of stores?

If x is initialized to 0xf00dd00f, two threads start, and thread
1 performs __read_once(x) concurrently with thread 2 performing
__write_once(x, 0xfeedbeef), then what values can thread 1 read?

Do __read_once()/__write_once() have any semantics with respect to
interrupts?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Portability fix in kern_ntptime.c

2019-06-05 Thread David Young
On Thu, Jun 06, 2019 at 04:42:50AM +0700, Robert Elz wrote:
> Further, I'd never do it without a thorough review of the code,
> if you looked, you'd also see
> 
> freq = (ntv->freq * 1000LL) >> 16;
> 
> and
> 
> ntv->ppsfreq = L_GINT((pps_freq / 1000LL) << 16);
> 
> (and perhaps more) - if one of those is a shift of a negative number,
> the others potentially are as well (not that shifts of negative numbers
> bother me at all if the code author understood what is happening, which
> I suspect that they did here.)

Unfortunately, if "undefined behavior" (UB) is invoked, you simply
cannot claim to understand what is happening, because C11-compliant
compilers have a lot of leeway in what code they generate when behavior
is undefined.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Strange PCI bug

2018-07-31 Thread David Young
On Tue, Jul 31, 2018 at 06:11:32PM +0200, Maxime Villard wrote:
> It is really strange. The fact that we receive exception 0x10 seems to
> indicate that there is a problem somewhere related to the pin/irq
> initialization, but I don't really see how the delay could fix that. Or
> maybe the printfs do something more than just adding delay, I don't
> really know.
> 
> Does that ring a bell to someone? Running out of ideas... Phew, I liked
> these machines...

Supposing it is actually a PCI exception:

Maybe the delay gives the target of the `outl` enough time to finish
some initialization.  Maybe the target doesn't respond to PCI
I/O-space writes until the initialization has finished, and the write
is retried until some retry limit is exceeded.

It's also possible that the PCI I/O-space access generated by the `outl`
is flushing a posted PCI memory-space access, and it's the memory-space
access that causes a PCI exception.

Can you look at the PCI error state on the bridges and devices in the
system?  I don't remember if `ddb` will show that, but it may not be
hard to add.  The error state may lead you quickly to the precise device
that's generating the exception.
 
Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: urtwn driver is spammy

2018-06-28 Thread David Young
On Thu, Jun 28, 2018 at 11:48:56AM -0500, David Young wrote:
> On Thu, Jun 28, 2018 at 12:47:06PM +0200, Radoslaw Kujawa wrote:
> > 
> > 
> > > On 28 Jun 2018, at 12:35, Benny Siegert  wrote:
> > > 
> > > urtwn0: could not send firmware command 5
> > 
> > This means it’s unable set radio signal strength.
> 
> It's not clear *how* the firmware uses the RSSI for its rate adaptation,
> so I cannot say for sure, but it looks to me like the RSSI may be
> averaged over many stations when there is a base station.

Sorry, meant to say "too many stations."

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: urtwn driver is spammy

2018-06-28 Thread David Young
On Thu, Jun 28, 2018 at 12:47:06PM +0200, Radoslaw Kujawa wrote:
> 
> 
> > On 28 Jun 2018, at 12:35, Benny Siegert  wrote:
> > 
> > urtwn0: could not send firmware command 5
> 
> This means it’s unable set radio signal strength.

It's not clear *how* the firmware uses the RSSI for its rate adaptation,
so I cannot say for sure, but it looks to me like the RSSI may be
averaged over many stations when there is a base station.

For the firmware to use the average RSSI, as computed, for its rate
adaptation is probably unhelpful/counterproductive in an adhoc network
(IBSS mode).

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: CVS commit: src/sys/dev/pci

2018-05-17 Thread David Young
On Thu, May 17, 2018 at 12:01:30PM -0500, Jonathan A. Kollasch wrote:
> On Wed, May 16, 2018 at 01:45:36PM -0700, Jason Thorpe wrote:
> > 
> > 
> > > On May 16, 2018, at 1:07 PM, Jonathan A. Kollasch <jakll...@kollasch.net> 
> > > wrote:
> > > 
> > > I'm a bit uneasy about it myself for that same reason.  However, we
> > > do not to my knowledge have the infrastructure available to do a
> > > complete validation of the resource assignment.  If we did, we'd be
> > > able to do hot attach of PCIe ExpressCard with just a little more work.
> > 
> > 
> > We used to have something like this to support CardBus way back in the day, 
> > but I will admit I wasn’t entirely happy with it at the time.
> 
> rbus?  That's still around, and it's still ugly and doesn't always work.

I have some patches where I had begun to unify CardBus and PCI I/O- &
memory-space management using vmem(9).  The patches have probably rotted
since I stopped working on them 5+ years ago, but it might be easier to
fix them than to start from scratch.  If you want to take this on, I
will try to scare up the patches.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: compat code function pointers

2018-03-18 Thread David Young
On Sun, Mar 18, 2018 at 09:00:06PM -0400, Christos Zoulas wrote:
> Paul suggested:
> 
>   src/sys/kern/kern_junk.c
>   src/sys/sys/kern_junk.h
> 
> I suggested:
> 
>   src/sys/kern/kern_compat.c
>   src/sys/sys/compat.h

I think I have used src/sys/kern/kern_stub.c for a similar purpose in
the past.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: setting DDB_COMMANDONENTER="bt" by default

2018-02-15 Thread David Young
On Thu, Feb 15, 2018 at 06:27:25PM +, David Brownlee wrote:
> Is there some useful variant where the panic message is shown again at the
> end of the stack trace, or the stack trace defaults to a very small number
> of entries by default?

I figure it is not a simple matter to program, but you could probably
print the stack trace "upside down" followed by the panic message?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Restricting rdtsc [was: kernel aslr]

2017-03-28 Thread David Young
On Tue, Mar 28, 2017 at 04:58:58PM +0200, Maxime Villard wrote:
> Having read several papers on the exploitation of cache latency to defeat
> aslr (kernel or not), it appears that disabling the rdtsc instruction is a
> good mitigation on x86. However, some applications can legitimately use it,
> so I would rather suggest restricting it to root instead.

I may not understand some of your premises.

Why do you single out the rdtsc instruction instead of other time
sources?

What do you mean by "legitimately" use rdtsc?  It seems to me that it
is legitimate for a user to use a high-resolution timer to profile some
code that's under development.  They may want to avoid running that code
with root privileges under most circumstances.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: USB device [was "USB slave"]

2017-03-09 Thread David Young
On Thu, Mar 09, 2017 at 03:14:14PM -0500, Terry Moore wrote:
> However, please, please, please: let us follow the USB standard
> nomenclature.

I guess this means that my proposal to call the "device" role the
"index," "cache," or "staging area," according to our current mood, is
dead on arrival? :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Spinning down sata disk before poweroff

2016-06-18 Thread David Young
On Sat, Jun 18, 2016 at 10:51:25PM +0200, Manuel Bouyer wrote:
> On Sat, Jun 18, 2016 at 03:31:20PM -0500, David Young wrote:
> > BTW, what should we do during "manual" action, such as 'drvctl -d wd0'?
> > Seems like we should power off the drive then, too?  Otherwise, it is
> > set up for an abrupt power-loss, later.
> 
> But if you do this in order to do a rescan, you will power down/up
> the drive, which is not so good for enterprise-class drives (for drives
> designed for 24x7 SMART counts the number of stop/start cycles)

I don't understand.  Why detach in order to do a rescan?

BTW, I think that when I wrote "power off the drive," I should have
written "put it into standby."  I'm not sure the formal meaning of
standby, but I reckon it is up to the drive (controller?) whether or not
to stop the spindle as well as parking the heads.

Seen in a certain perspective, it's asymmetric that frequently the
BIOS powers a drive *up* before the bootloader runs, but the OS is
responsible to power a drive *down* during power off, or else the drive
may abruptly lose power.  Is there some way to hand responsibility for
the drive's power state back to the BIOS?  I guess that on x86, that
would be an ACPI BIOS method.  Don't we use an ACPI method to power down
the machine, after all?  ISTR we put the machine into "sleep state 5".

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, IL   http://trestle.tech/


Re: Spinning down sata disk before poweroff

2016-06-18 Thread David Young
On Fri, Jun 17, 2016 at 09:58:07PM +0200, Manuel Bouyer wrote:
> On Fri, Jun 17, 2016 at 11:59:09AM -0500, David Young wrote:
> > A less intrusive change that's likely to work pretty well, I think, is
> > to introduce a new flag, DETACH_REBOOT or DETACH_STAY_POWERED, that's
> > passed to config_detach_all() by cpu_reboot() when the RB_* flags
> > indicate a reboot is happening.  Then, in the wd(4) detach routine, put
> > the device into standby mode if the flag is not set.
> 
> I'd prefer to have it the other way round then: a DETACH_POWEROFF
> which is set only for halt -p.

Ok.

BTW, what should we do during "manual" action, such as 'drvctl -d wd0'?
Seems like we should power off the drive then, too?  Otherwise, it is
set up for an abrupt power-loss, later.

I may have something to say about the unusual asymmetry of BIOS and OS
responsibilities here, later

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, IL   http://trestle.tech/


Re: Spinning down sata disk before poweroff

2016-06-17 Thread David Young
On Fri, Jun 17, 2016 at 01:23:36PM +0200, Manuel Bouyer wrote:
> On Fri, Jun 17, 2016 at 01:49:43AM +, Anindya Mukherjee wrote:
> > Hi,
> > 
> > I'm running NetBSD 7.0.1_PATCH (GENERIC) amd64 on a Dell laptop. Almost 
> > everything is working perfectly, except the fact that every time I shutdown 
> > using the -p switch, the hard drive makes a loud click sound as the system 
> > powers off. I checked the SMART status (atactl and smartctl) and after 
> > every poweroff the Power_Off-Retract-Count parameter increases by 1.
> > 
> > I did some searching on the web and came across PR #21531 where this issue 
> > was discussed from 2003-2008 and finally a patch was committed which 
> > resolved the issue by sending the STANDBY_IMMEDIATE command to the disk 
> > before powering off. Since then the code has been refactored, but it is 
> > present in src/sys/dev/ata/wd.c line 1970 (wd_shutdown) which calls line 
> > 1848 (wd_standby). This seemed strange since the disk was definitely not 
> > being spun down.
> > 
> > I attached a remote gdb instance and stepped through the code during 
> > shutdown, breaking on wd_flushcache() which is always called. The code path 
> > taken is wdclose()->wdlastclose() (lines 1029, 1014). I can see that the 
> > cache is flushed but then the device is deleted in line 1023. Subsequently, 
> > power is cut off during shutdown, causing an emergency retract. So, it 
> > seems at least for newer sata disks the spindown code is not being called. 
> > I'm fairly new to NetBSD code so there is a chance I read this wrong, so 
> > feel free to correct me.
> > 
> > Ideally I'd like the disk to spin down during poweroff (-p) and halt (-h), 
> > perhaps settable using a sysctl, but not during a reboot (-r). I am 
> > planning to patch wdlastclose() as an experiment to run the spindown code 
> > to see if it stops the click. Is this a known issue, worthy of a PR? I can 
> > file one. I can also volunteer a patch once I have tested it on my laptop. 
> > Comments welcome!
> 
> 
> So the disk is not powered off because it's detached before the pmf framework
> has a chance to power it off (see amd64/amd64/machdep.c:cpu_reboot()).
> that's bad.
> Doing the poweroff in wdlastclose() is bad because then you'll have a
> poweroff/powerup cycle for a reboot, or even on unmount/mount events if this
> is not your root device. This can be harmfull for some disks (this has already
> been discussed).
> 
> The (untested) attached patch should fix this by calling pmf before detach;
> can you give it a try ?

Careful!  The alternation of detaching devices and unmounting
filesystems is purposeful.  You can have devices such as vnd(4) backed
by filesystems backed by further devices.

It's possible that unmounting a filesystem will counteract the PMF
shutdown.

A less intrusive change that's likely to work pretty well, I think, is
to introduce a new flag, DETACH_REBOOT or DETACH_STAY_POWERED, that's
passed to config_detach_all() by cpu_reboot() when the RB_* flags
indicate a reboot is happening.  Then, in the wd(4) detach routine, put
the device into standby mode if the flag is not set.

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, IL   http://trestle.tech/


Re: Introduce curlwp_bind and curlwp_unbind for psref(9)

2016-06-14 Thread David Young
On Wed, Jun 15, 2016 at 10:56:57AM +0900, Ryota Ozaki wrote:
> On Tue, Jun 14, 2016 at 7:58 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
> > On Tue, Jun 14, 2016 at 09:53:33AM +0900, Ryota Ozaki wrote:
> >> - curlwp_bind and curlwp_unbind
> >> - curlwp_bound_set and curlwp_bound_restore
> >> - curlwp_bound and curlwp_boundx
> >>
> >> Any other ideas? :)
> >
> > curlwp_bind_push / curlwp_bind_pop
> 
> Hmm, I think the naming fits in if Linux, but not in NetBSD.
> And
>   bound = curlwp_bind_push();
>   ...
>   curlwp_bind_pop(bound);
> looks odd to me.

bound = curlwp_bind_get(); curlwp_bind_put(bound)?

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, ILhttp://trestle.tech


Re: CPU-local reference counts

2016-06-13 Thread David Young
On Mon, Jun 13, 2016 at 04:54:32PM +, Taylor R Campbell wrote:
> Prompted by the discussion last week about making sure autoconf
> devices and bdevsw/cdevsw instances don't go away at inopportune
> moments during module unload, I threw together a generic sketch of the
> CPU-local reference counts I noted, which I'm calling localcount(9).

Hi Taylor,

Here are a couple of ideas that I probably picked up from other
NetBSDers and from contributors to an old thread,
<https://mail-index.netbsd.org/tech-kern/2013/01/17/msg014872.html>.

One way both to save some memory and to reduce the cache footprint of a
reference-counting scheme like this is to use "narrow" per-CPU counters
(uint16_t, say) and a wide, shared counter that's increased, using an
interlocked atomic operation, whenever a per-CPU counter rolls over.

To save some more memory, you can make struct localcount,

struct localcount {
int64_t *lc_totalp;
struct percpu   *lc_percpu; /* int64_t */
};

into a handle,

struct localcount {
unsigned intlc_slot;
};

where the lc_slot indicates an index into a few arrays: a per-CPU array
of local counters, a global array of shared counters, and an array of
whichever flags ("draining") that you may require.

Depending how many localcount you expect to be extant in the system, you
can make lc_slot a uint16_t or a uint32_t.

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, ILhttp://trestle.tech


Re: Locking strategy for device deletion (also see PR kern/48536)

2016-06-07 Thread David Young
On Tue, Jun 07, 2016 at 06:28:11PM +0800, Paul Goyette wrote:
> Can anyone suggest a reliable way to ensure that a device-driver
> module can be _really_ safely detached?
> 
> The module could theoretically maintain an open/ref counter, but
> making this MP-safe is "difficult"!  Even if the module were to
> provide a mutex to control increment/decrement of it's counter,
> there's still a problem:
> 
> Thread 1 initiates a module-unload, which takes the mutex
> 
> Thread 2 attempts to open the device (or one of its units), attempts to
> grab the mutex, and waits
> 
> Back in thread 1, the driver's module unload code determines that it
> is safe to unload (no current activites queued, no current opens),
> so it
> goes forward and unmaps the module - including the mutex!

I think that what's missing is a flag on the module that says it is
unloading, and module entrance/exit counters.  I think it could work
sort of like this---the devil is in the details:

Thread 1 initiates a module unload:
1) Acquires mutex
2) Sets the module's unloading flag
3) Unlinks module entry points---that is, they're still mapped,
   but there are no more globally-visible pointers to them
4) While module entrances > exits, sleeps on module condition
   variable C, thus temporarily releasing mutex
5) Releases mutex
6) Unmaps module

Thread 2 attempts to open the device
1) Increases module-entrance count
2) Acquires mutex
3) Examines unloading flag
a) Finding it set, signals condition variable C,
b) OR, finding it NOT set, performs open
4) increases module-exit count
5) releases mutex

The module entrance/exit counts can be per-CPU variables that you
increment using non-interlocked atomic instructions, which are not very
expensive.

Now, I am trying to remember if/why counting entrances and exits
separately is necessary.  ISTM that to avoid races, you want to add up
exits across all CPUs, first, then add up entrances, and compare.

This is not necessarily the best or only way to handle this, and I feel
sure that I've overlooked a fatal flaw in this first draft.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: nexthop cache separation

2016-04-04 Thread David Young
On Fri, Mar 25, 2016 at 07:41:18PM +0900, Ryota Ozaki wrote:
> [after]
> Routing tables
> 
> Internet:
> DestinationGatewayFlagsRefs  Use
> Mtu Interface
> default10.0.1.1   UGS --
>  -  shmif0
> 10.0.1/24  link#2 UC  --
>  -  shmif0
> 10.0.1.2   link#2 UHl --  -  lo0
> 127.0.0.1  lo0UHl --  33648  lo0

Previous to the change you've proposed, 'route show' provided a more
comprehensive view of the routing state.  Now, it is missing several
useful items.  Where has the 10.0.1.1 route gone?  Where are the
MAC addresses?  Previously, you could issue one command and within
an eyespan have all of the information that you needed to diagnose
connectivity problems on routers.  Now, to first appearances, every
routing state looks suspect, and it's necessary to dig in with arp/ndp
to see if things are ok.

Please, if your changes materially change the user interface, provide
mockups.  Mockups are powerful communication tools that help to build
consensus and provide strong implementation guidance.  Design oversights
that are obvious in mockups may be invisible in patches.  It's easy
to mockup command-line displays like route(8)'s using $EDITOR.  I
cannot recommend strongly enough that developers add mockups to their
engineering-communications repertoire.

Dave

[*] It was bad enough that networking in NetBSD contains many potential
switchbacks and blackholes once a firewall is active.  I don't think
we're worse than any other system in that regard, but ISTM we should
    strive to be *better*.

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: nexthop cache separation

2016-03-22 Thread David Young
On Tue, Mar 22, 2016 at 01:14:39PM +0900, Ryota Ozaki wrote:
> Hi,
> 
> Here are new patches:
>   http://www.netbsd.org/~ozaki-r/separate-nexthop-caches-v2.diff
>   http://www.netbsd.org/~ozaki-r/separate-nexthop-caches-v2-diff.diff
> 
> Changes since v1:
> - Comment out osbolete RTF_* and RTM_* definitions
>   - Tweak some userland codes for the change
> - Restore checks of connected (cloning) routes in nd6_is_addr_neighbor
> - Restore the original behavior on removing ARP/NDP entries for
>   IP addresses of interface itself
> - Remove remaining use of RTF_LLINFO in the kernel
>   - I think we can remove it safely

It sounds as if these changes could affect the appearance of netstat -r,
route show, route get, arp -a, arp .  Can you provide some
before/after examples?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: softint-based if_input

2016-01-25 Thread David Young
wake processing thread

processing thread:
loop forever:
enable interrupts
wait for wakeup
for each Rx packet on ring:
process packet

That stopped the user-tickle watchdog from firing.  It was handy having
a full-fledged thread context to process packets in.  But there were
trade-offs.  As Matt Thomas pointed out to me, if it takes longer for
the NIC to read the next packet off of the network than it takes your
thread to process the current packet, then your Rx thread is going to go
back to sleep again after every single packet.  So there's potentially
a lot of context-switch overhead and latency when you're receiving
back-to-back large packets.

ISTR Matt had some ideas how context switches could be made faster, or
h/w interrupt handlers could have an "ordinary" thread context, or the
scheduler could control the rate of softints, or all of the above.  I
don't know if there's been any progress along those lines in the mean
time.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: In-kernel process exit hooks?

2016-01-08 Thread David Young
On Fri, Jan 08, 2016 at 12:52:16PM +0700, Robert Elz wrote:
> Date:Fri, 8 Jan 2016 11:22:28 +0800 (PHT)
> From:Paul Goyette <p...@vps1.whooppee.com>
> Message-ID:  <pine.neb.4.64.1601081115270.22...@vps1.whooppee.com>
> 
>   | Is there a "supported" interface for detaching the file (or descriptor) 
>   | from the process without closing it?
> 
> Actually, thinking through this more, why not just "fix" filemon to make
> a proper reference to the file, instead of the half baked thing it is
> currently doing.

Yes, please! :-)

Furthermore, stick the file into LWP 0's descriptor table so that you
can see it with fstat.  It's a little more code to write---I wrote it
for gre(4)---but it's well worth the visibility.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: In-kernel process exit hooks?

2016-01-08 Thread David Young
On Fri, Jan 08, 2016 at 10:47:08AM -0600, David Young wrote:
> On Fri, Jan 08, 2016 at 12:52:16PM +0700, Robert Elz wrote:
> > Date:Fri, 8 Jan 2016 11:22:28 +0800 (PHT)
> > From:Paul Goyette <p...@vps1.whooppee.com>
> > Message-ID:  <pine.neb.4.64.1601081115270.22...@vps1.whooppee.com>
> > 
> >   | Is there a "supported" interface for detaching the file (or descriptor) 
> >   | from the process without closing it?
> > 
> > Actually, thinking through this more, why not just "fix" filemon to make
> > a proper reference to the file, instead of the half baked thing it is
> > currently doing.
> 
> Yes, please! :-)
> 
> Furthermore, stick the file into LWP 0's descriptor table so that you

Oops, meant PID 0's.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: In-kernel process exit hooks?

2016-01-08 Thread David Young
On Fri, Jan 08, 2016 at 12:26:14PM +0700, Robert Elz wrote:
> Date:Fri, 8 Jan 2016 11:22:28 +0800 (PHT)
> From:Paul Goyette <p...@vps1.whooppee.com>
> Message-ID:  <pine.neb.4.64.1601081115270.22...@vps1.whooppee.com>
> 
>   | Is there a "supported" interface for detaching the file (or descriptor) 
>   | from the process without closing it?
> 
> Inside the kernel you want to follow the exact same procedure as would
> be done by
> 
>   newfd = dup(oldfd);
>   close(oldfd);
> 
> except instead of dup (and assigning to a newfd in the process) we
> take the file reference and stick it in filemon.   There's nothing
> magic about this step.  What magic there is (though barely worthy of
> the title) would be in ensuring that filemon properly releases the file
> when it is closing.

Years ago I added to gre(4) an ioctl that a user thread can use to
delegate to the kernel a UDP socket that carries tunnel traffic.  I
think that that code should cover at least the dup(2) part of Robert's
suggestion.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Lightweight support for instruction RNGs

2015-12-21 Thread David Young
On Mon, Dec 21, 2015 at 07:38:57PM -0500, Thor Lancelot Simon wrote:
> On Mon, Dec 21, 2015 at 09:28:40AM -0800, Alistair Crooks wrote:
> > I think there's some disconnect here, since we're obviously talking
> > past each other.
> > 
> > My concern is the output from the random devices into userland. I
> 
> Yes, then we're clearly talking past each other.  The "output from the
> random devices into userland" is generated using the NIST SP800-90
> CTR_DRBG.  You could key it with all-zeroes and the statistical properties
> of the output would differ in no detectable way* from what you got if
> you keyed it with pure quantum noise.
> 
> If you want to run statistical tests that mean anything, you need to
> feed them input from somewhere else.  Feeding them the output of the
> CTR_DRBG can be nothing but -- at best -- security theater.
> 
>  [*maybe some day we will have a cryptanalysis of AES that allows us to
>detect such a difference, but we sure don't now]

Thor,

I think Alistair is concerned that the implementation of "NIST SP800-90
CTR_DRBG" could be incorrect, or else that it could be embedded in
a system in which the correct behavior is not, for whatever reason,
manifest in the userland output.  Thus the statistical properties of the
output could be different from specifications.  Maybe one of the problem
systems will be, for unforeseen reasons, one in which there is an RNG
instruction.  Stranger things have happened.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: I'm interested in implementing lockless fifo/lifo in the kernel

2015-12-02 Thread David Young
On Tue, Nov 24, 2015 at 12:04:09AM -0800, Randy White wrote:
> I love NetBSD, and I would like to contribute. I see the open job for 
> lockless queues, and stacks. I want to learn, and I want to help. I have 
> literature on UNIX kernel development. I have many systems and I think I 
> could fund myself for the most part. I am familiar with lockless programming. 
> 
> I am looking forward to working on netbsd and help maintaining its 
> awesomeness. 

Randy,

That's great.  Let me know how I can help you to get started.  It sounds
like you're already familiar with NetBSD, how and where we communicate,
etc.

BTW, we have a lockless queue in NetBSD called pcq(9).  People will
disagree whether it is the best/only lockless queue for the purpose of,
say, SMP networking. pcq(9) uses a fixed-size ring buffer, which may be
advantageous in some scenarios and a liability in others.

We are notably lacking a fast *linked* FIFO queue---i.e., one that can
take the place of struct ifqueue/IF_ENQUEUE()/IF_DEQUEUE() for mbuf
queues.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: spiflash.c process_write()

2015-09-08 Thread David Young
On Tue, Sep 08, 2015 at 06:12:11AM +, David Holland wrote:
> As noted in passing elsewhere, it seems that process_write() in
> spiflash.c allocates a scratch buffer on every call... and leaks it on
> every call too. This clearly isn't a good thing.

My recollection is fuzzy, now, but I think that that was written to
support the Meraki Mini.  ISTR NOR-flash writing was never tested on
the Mini (maybe not anywhere?).  A misplaced/corrupted write to the
NOR Flash could have bricked the Mini, and there was a limited supply
available for testing!

I think I have a Mini or two around here, somewhere.  I can make one
available with serial console and outlet control, if somebody has an
urge to test it out.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Restructuring ARP cache

2015-08-25 Thread David Young
On Mon, Aug 17, 2015 at 06:23:14PM +0900, Ryota Ozaki wrote:
 Hi,
 
 Here is a new patch that restructures ARP caches, which
 aims for MP-safe network stack:
 http://www.netbsd.org/~ozaki-r/lltable-arpcache.diff
 (https://github.com/ozaki-r/netbsd-src/tree/lltable-arpcache)

I don't think that through this piecemeal approach, NetBSD can achieve a
maintainable, MP-safe network stack that is competitive in performance
with the other stacks out there.

To put this into the old formula, you may pick three: piecemeal,
maintainable, MP-safe, performance.

I think it's important to take several steps toward simplicity before
anything else.  Neither simplicity nor MP-safety are compatible with a
network stack shot through with caches like NetBSD's stack is, now.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: argument of pci_msi[x]_count()

2015-08-12 Thread David Young
On Thu, Aug 13, 2015 at 06:56:34AM +1000, matthew green wrote:
  I don't have a problem with it, I was just questioning the rationale
  about passing pci_attach_args to functions...
 
 the original pci(9) interfaces didn't do this, but a 3rd member of
 pci_attach_args{} was needed for a new change, so someone (i forget
 now, but CVS will tell you) changed it to pass the structure itself,
 since this was only called during autoconfig when this structure was
 actually available.
 
 doing it outside of autoconfig is not a good idea, though, so any
 function that is usefully callable outside of attach probably should
 take specific arguments instead of pci_attach_args{}.

ISTR a hairy wi(4) bug came about because *_attach_args was passed
outside attach!  [I may have introduced that bug, too. :-)]

It sounds to me like the emerging consensus is that it's best to pass
only the chipset+memory tag, if that's all you need, to each MSI/MSI-X
function.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Introducing CloudABI: a pure capability-based runtime for NetBSD (and other systems)

2015-07-23 Thread David Young
On Thu, Jun 25, 2015 at 03:11:51PM +0200, Ed Schouten wrote:
 Hello NetBSD hackers,
 
 Two weeks ago I gave a talk at BSDCan about something I've been
 working on for the last half a year called CloudABI[1]. In short,
 CloudABI is an alternative UNIX-like runtime environment that purely
 uses capability-based security, strongly influenced by Capsicum[2].

Ed,

It has always seemed to me that it will be easier for a user to form and
to operate a mental model for a capability system, especially if the
system makes the capabilities visible, than to model any rules-based
system.  So capabilities have always looked like a good foundation for
building *usable* security.

Initially, I was very excited about Capsicum, practical capabilities
for UNIX.  But it seems like Capsicum isn't for users, it is for
developers: in the examples I have read, you have to modify a program's
source to make good use of Capsicum.  That seems like an unnecessarily
high barrier to use.

That brings me to my question about CloudABI.  It sounds like CloudABI
is aimed at developers, who would adapt programs to work with the new
run-time?  Or is there an upside to CloudABI for users, too?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Improving use of rt_refcnt

2015-07-06 Thread David Young
On Sun, Jul 05, 2015 at 11:50:12AM +0200, Joerg Sonnenberger wrote:
 I think the main point that David wanted to raise is that the normal
 path for packets should *not* do any ref count changes at all.

I wasn't trying to make a point.  I wanted to make sure that I properly
understood Ryota's plans.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Improving use of rt_refcnt

2015-07-04 Thread David Young
On Sat, Jul 04, 2015 at 09:52:56PM +0900, Ryota Ozaki wrote:
 I'm trying to improve use of rt_refcnt: reducing
 abuse of it, e.g., rt_refcnt++/rt_refcnt-- outside
 route.c and extending it to treat referencing
 during packet processing (IOW, references from
 local variables) -- currently it handles only
 references between routes. The latter is needed for
 MP-safe networking.

Do you propose to increase/decrease rt_refcnt in the packet processing
path, using atomic instructions?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Interrupt flow in the NetBSD kernel

2015-06-22 Thread David Young
On Sun, Jun 21, 2015 at 08:01:47AM -0700, Matt Thomas wrote:
 
  On Jun 21, 2015, at 7:30 AM, Kamil Rytarowski n...@gmx.com wrote:
  
  I have got few questions regarding the interrupt flow in the kernel.
  Please tell whether my understanding is correct.
 
 You are confusing interrupts with exceptions.  Interrupts are 
 asynchronous events.  Exceptions are (usually) synchronous and
 are the result of an instruction.

I took Kamil's question to be, When interrupts at the highest priority
level are blocked, can control flow still be interrupted?  How?  The
answer to the question is yes.  Both synchronous events (exceptions,
such as data abort on ARM) and asynchronous events (non-maskable
interrupts, such as NMI on x86) can interrupt control flow.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: MSI/MSI-X implementation

2014-11-13 Thread David Young
On Thu, Nov 13, 2014 at 12:41:38PM +0900, Kengo NAKAHARA wrote:
 (2014/11/13 11:54), David Young wrote:
 On Fri, Nov 07, 2014 at 04:41:55PM +0900, Kengo NAKAHARA wrote:
 Could you comment the specification and implementation?
 
 The user should not be on the hook to set processor affinity for the
 interrupts.  That is more properly the responsibility of the designer
 and OS.
 
 I wrote unclear explanation..., so please let me redescribe.
 
 This MSI/MSI-X API *design* is independent from processor affinity.
 The device dirvers can use MSI/MSI-X and processor affinity
 independently of each other. In other words, legacy interrupts and
 INTx interrupts can use processor affinity still. Furthermore,
 MSI/MSI-X may or may not use processor affinity.

MSI/MSI-X is not half as useful as it ought to be if a driver's author
cannot spread interrupt workload across the available CPUs.  If you
don't mind, please share your processor affinity proposal and show how
it works with interrupts.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: MSI/MSI-X implementation

2014-11-13 Thread David Young
On Thu, Nov 13, 2014 at 01:59:09PM -0600, David Young wrote:
 On Thu, Nov 13, 2014 at 12:41:38PM +0900, Kengo NAKAHARA wrote:
  (2014/11/13 11:54), David Young wrote:
  On Fri, Nov 07, 2014 at 04:41:55PM +0900, Kengo NAKAHARA wrote:
  Could you comment the specification and implementation?
  
  The user should not be on the hook to set processor affinity for the
  interrupts.  That is more properly the responsibility of the designer
  and OS.
  
  I wrote unclear explanation..., so please let me redescribe.
  
  This MSI/MSI-X API *design* is independent from processor affinity.
  The device dirvers can use MSI/MSI-X and processor affinity
  independently of each other. In other words, legacy interrupts and
  INTx interrupts can use processor affinity still. Furthermore,
  MSI/MSI-X may or may not use processor affinity.
 
 MSI/MSI-X is not half as useful as it ought to be if a driver's author
 cannot spread interrupt workload across the available CPUs.  If you
 don't mind, please share your processor affinity proposal and show how
 it works with interrupts.

Here are some cases that interest me:

1) What interrupts does a driver establish if the NIC has separate
   MSI/MSI-X interrupts for each of 4 Tx DMA rings and each of 4 Rx DMA
   rings, and there are 2 logical CPUs?  Can/does the driver provide
   any hints about the processor that is the target of each interrupt?
   What CPUs receive the interrupts?

2) Same as above, but what if there are 4 logical CPUs?

3) Same as previous, but what if there are 16 logical CPUs?

There's more than one way to crack this nut, I'm just wondering how you
propose to crack it. :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: MSI/MSI-X implementation

2014-11-12 Thread David Young
On Fri, Nov 07, 2014 at 04:41:55PM +0900, Kengo NAKAHARA wrote:
 Could you comment the specification and implementation?

The user should not be on the hook to set processor affinity for the
interrupts.  That is more properly the responsibility of the designer
and OS.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Brainy: Set of 33 potential bugs

2014-09-20 Thread David Young
On Sat, Sep 20, 2014 at 08:48:08PM +0200, Maxime Villard wrote:
 Hi,
 here is another set of 33 potential bugs found by my code scanner.
 
   http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html#Report-3
 
 Not all bugs are listed here; I've put only those which looked like proper
 bugs. I guess they will all need to be fixed in NetBSD-7.

Is the source for this scanner available somewhere?  Does it build on
some existing project, such as LLVM?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: add MSI/MSI-X support to NetBSD

2014-08-29 Thread David Young
On Sat, Jun 07, 2014 at 08:36:47AM +1000, matthew green wrote:
 
 let's not forget my favourite mis-feature of MSI/MSI-X:
 
 if you misconfigure the address, interrupts might cause main memory to
 be corrupted.  i've seen this happen, and it was rather difficult to
 diagnose the real culprit..

Picking up this discussion again, rather late.

If there is an IOMMU available, shouldn't it be used to protect against
this kind of memory corruption?  Even some x86 machines have IOMMUs
these days.

 i'm a little confused about bus_msi(9) -- pci_intr(9) is already an MD
 interface, so if it was extended or if we copied the pci_intr_map_msi()
 functions from elsewhere, it's still MD code we have to write.
 what does bus_msi(9) add?  who would use it?

bus_msi(9) gives MI code access to doorbells: MI code uses it to
establish a doorbell - interrupt handler mapping and find out the
doorbell's physical address.

All the code to map the doorbell's physaddr into a PCI busaddr, to
program the IOMMU if there is one, to establish the MSI address/data in
the PCI device, and to enable MSI is MI code using bus_dma(9), pci(9),
and bus_space(9).  Even if it's 100 lines or fewer, why duplicate it
across platforms?

Also, doorbells look to me like a potentially useful facility to make
generally available, even apart from their use with PCI MSI.  Anyway,
I'm curious what uses people would come up with.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: ixg(4) performances

2014-08-26 Thread David Young
On Tue, Aug 26, 2014 at 10:25:52AM -0400, Christos Zoulas wrote:
 On Aug 26,  2:23pm, m...@netbsd.org (Emmanuel Dreyfus) wrote:
 -- Subject: Re: ixg(4) performances
 
 | On Tue, Aug 26, 2014 at 12:57:37PM +, Christos Zoulas wrote:
 |  
 ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC
 | 
 | Right, but NetBSD has no tool like Linux's setpci to tweak MMRBC, and if
 | the BIOS has no setting for it, NetBSD is screwed.
 | 
 | I see dev/pci/pciio.h  has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
 | does that means Linux's setpci can be easily reproduced?
 
 I would probably extend pcictl with cfgread and cfgwrite commands.

Emmanuel,

Most (all?) configuration registers are read/write.  Have you read the
MMRBC and found that it's improperly configured?

Are you sure that you don't have to program the MMRBC at every bus
bridge between the NIC and RAM?  I'm not too familiar with PCI Express,
so I really don't know.

Have you verified the information at
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe with the 82599
manual?  I have tried to corroborate the information both with my PCI
Express book and with the 82599 manual, but I cannot make a match.
PCI-X != PCI Express; maybe ixgb != ixgbe?  (It sure looks like they're
writing about an 82599, but maybe they don't know what they're writing
about!)


Finally, adding cfgread/cfgwrite commands to pcictl seems like a step in
the wrong direction.  I know that this is UNIX and we're duty-bound to
give everyone enough rope, but may we reconsider our assisted-suicide
policy just this one time? :-)

How well has blindly poking configuration registers worked for us in
the past?  I can think of a couple of instances where an knowledgeable
developer thought that they were writing a helpful value to a useful
register and getting a desirable result, but in the end it turned out to
be a no-op.  In one case, it was an Atheros WLAN adapter where somebody
added to Linux some code that wrote to a mysterious PCI configuration
register, and then some of the *BSDs copied it.  In the other case, I
think that somebody used pci_conf_write() to write a magic value to a
USB host controller register that wasn't on a 32-bit boundary.  ISTR
that some incorrect value was written, instead.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Dead code: double return

2014-08-18 Thread David Young
On Mon, Aug 18, 2014 at 11:28:13AM +0200, Maxime Villard wrote:
 Hi,
 my code scanner reports in several places lines like these:
 
   return ERROR_CODE/func(XXX);
   return VALUE;

In some of your examples, it looks like code may have been copied and
pasted.  Is some refactoring of the code called for?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: IRQ affinity (aka interrupt routing)

2014-07-25 Thread David Young
On Fri, Jul 25, 2014 at 07:15:02PM +0900, Kengo NAKAHARA wrote:
 Hi martin,
 
 Thank you very much for your comment.
 
 (2014/07/25 18:15), Martin Husemann wrote:
 A few general comments:
 
   - best UI is no UI - the kernel should distribute interrupts automatically
 (if possible) as fair as possible over time doiing some statistics
 
 I agree the computer should distribute interrupts automatically, but
 I think balancing interrupts is too complex for the kernel. So I think
 the balancing should be done by the userland daemon which use the UI.
 Implementing and tuning the userland daemon are future works.

What's the goal of balancing interrupts?  Controlling latency?  That's
important, but it seems like other considerations might apply.  For
example, funneling all interrupts to one core might allow the other
cores to idle in a power-saving state.  Also, it might help to avoid
cacheline motion for two interrupts involved in the same network flow to
fire on the same CPU.

Can you explain why you think that balancing interrupts is too complex
for the kernel?  I would not necessarily disagree, but it seems like
the kernel has the most immediate access to the relevant interrupt
statistics *and* the responsibility for CPU scheduling, so it's in a
pretty good position to react to imbalances.

Dave

   - a UI to wire some device interrupts to a special CPU would be ok,
 I prefer a new intrctrl for that
 
   - vmstat -i could gain an additional column with the current target cpu
 of the interrupt
 
 I am afraid of breaking backward compatibility, so I avoid to change
 existing commands.
 
   - the device name is nice, but what about shared interrupts?
 
 I forgot shared interrupts... In the current implement, the device name
 is overwritten by the device established later. Of course this is ugly,
 so I must fix it.
 
 
 Thanks,
 
 -- 
 //
 Internet Initiative Japan Inc.
 
 Device Engineering Section,
 Core Product Development Department,
 Product Division,
 Technology Unit
 
 Kengo NAKAHARA k-nakah...@iij.ad.jp

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Obtaining list of created sockets

2014-07-01 Thread David Young
On Mon, Jun 30, 2014 at 09:39:37AM -0700, Will Dignazio wrote:
 That would be an excellent start; I had considered it before, however I
 thought that netstat only listed listening connections. With lsof, you
 would only get sockets created with fsocket (those having file descriptors).
 
 I suppose combining the way the two get their information would yield a
 majority of the sockets created, however I would like to get all internal
 sockets that may not be listening yet, or never get a file descriptor.

Some years back, I modified gre(4) to use an actual socket instead of
rolling its own GRE or UDP packets.  IIRC, I made gre(4) *always* create
a file descriptor so that fstat(1) provided a comprehensive view of the
sockets in the system.

Having sockets in the system that appear neither in fstat(1) nor nor
netstat(8) output seems like an unnecessary mystery/complication to me.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: add MSI/MSI-X support to NetBSD

2014-06-06 Thread David Young
On Fri, May 30, 2014 at 05:55:25PM +0900, Kengo NAKAHARA wrote:
 Hello,
 
 I'm going to add MSI/MSI-X support to NetBSD. I list tasks about this.
 Would you comment following task list?

I think that MSI/MSI-X logically separates into a few pieces, what do
you think about these pieces?

1 An MI API for establishing mailboxes (or doorbells or whatever
  we may call them).  A mailbox is a special physical address (PA) or
  PA/data-pair in correspondence with a callback (function, argument).

  An MI API for mapping the mailbox into various address spaces,
  but especially the message-signalling devices.  In this way, the
  mailbox API is a use or an extension of bus_dma(9).

  Somewhere I have a draft proposal for this MI API, I will try to
  dig it up.

2 For each platform, an MD implementation of the MI mailbox API.

3 Extensions to pci(9) for establishing message-signalled interrupts
  using either a (function, argument) pair, a PA, or a (PA, data) pair.
  I am pretty sure that the implementation of these extensions can be
  MI.

 + [amd64 MD]  refactor INTRSTUB
   - currently, it walks the interrupt handler list in assembly code
 - I want to use NetBSD's list library, so I want to convert this assembly
   code to C code.

I support converting much of the interrupt dispatch code to C from
assembly.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: add MSI/MSI-X support to NetBSD

2014-06-06 Thread David Young
On Fri, Jun 06, 2014 at 12:40:54PM -0500, David Young wrote:
 On Fri, May 30, 2014 at 05:55:25PM +0900, Kengo NAKAHARA wrote:
  Hello,
  
  I'm going to add MSI/MSI-X support to NetBSD. I list tasks about this.
  Would you comment following task list?
 
 I think that MSI/MSI-X logically separates into a few pieces, what do
 you think about these pieces?
 
 1 An MI API for establishing mailboxes (or doorbells or whatever
   we may call them).  A mailbox is a special physical address (PA) or
   PA/data-pair in correspondence with a callback (function, argument).
 
   An MI API for mapping the mailbox into various address spaces,
   but especially the message-signalling devices.  In this way, the
   mailbox API is a use or an extension of bus_dma(9).
 
   Somewhere I have a draft proposal for this MI API, I will try to
   dig it up.

Here is the proposal that I came up with many months (a few years?) ago
with input from Matt Thomas.  I have tried to account for Matt's
requirements, but I'm not sure that I have done so.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981
BUS_MSI(9) NetBSD Kernel Developer's Manual BUS_MSI(9)

bus_msi(9) is a machine-independent interface for establishing in the
machine's physical address space a doorbell that when written with
a particular word, sends an interrupt vector to a set of CPUs.  Using
bus_msi(9), the interrupt vector can be tied to interrupt handlers.

bus_msi(9) is the basis for a machine-independent implementation
of PCI Message-Signaled Interrupts (MSI) and MSI-X, however, the
bus_msi(9) implementation itself is highly machine-dependent.  Any
NetBSD architecture that wants to support PCI MSI should provide a
bus_msi(9) implementation.

bus_msi(9) uses facilities provided by bus_dma(9).

typedef struct _bus_msi_t {
bus_addr_t  mi_addr;
uint32_tmi_data;
uint32_tmi_count;
};

int
bus_msi_alloc(bus_dma_tag_t tag, bus_msi_reservation_t *msirp, size_t n,
uint32_t data_min, uint32_t data_max,
uint32_t data_alignment, uint32_t data_boundary, int flags);

Reserve `number' interrupt vectors on up to `ncpumax' CPUs
in the set `cpusetin' and reserve corresponding message
address/message data pairs.  Record the message address/data-pair
reservations in up to `nintervals' consecutive bus_msi_interval_ts
beginning with `interval[0]'; overwrite `rintervals' with
the number of intervals used.  Overwrite `cpusetout' with
the set of CPUs where interrupt vectors were established.

Each bus_msg_interval_t tells a message address, mi_addr,
and the mi_count different 32-bit message data words,
[mi_data,�mi_data�+�mi_count�-�1], to write to trigger
mi_count different interrupt vectors.

Each message data interval, [mi_data, mi_data + mi_count�-�1]
will satisfy the constraints passed to bus_msg_alloc():
[data_min,�data_max] must enclose each interval, each
interval must start at a multiple of data_alignment, and
no interval may cross a data_boundary boundary.  A legal
value of data_alignment (or data_boundary) is either zero
or a power of 2.  When zero, data_alignment (or data_boundary)
has no effect.

`tag' is the bus_dma_tag_t passed by the parent driver via
the bus _attach_args.

`flags' may be one of BUS_DMA_WAITOK or BUS_DMA_NOWAIT.

bus_msi_handle_t
bus_msi_establish(bus_dma_tag_t tag, bus_msi_reservation_t msir, int idx,
const kcpuset_t *cpusetin, int ncpumax, kcpuset_t *cpusetout,
int ipl, int (*func)(void *), void *arg);

Establish a callback (func, arg) to run at interrupt priority
level `ipl' whenever the `idx'th message in `intervals' is
delivered.  Return an opaque handle for use with
bus_msi_disestablish().

You can establish more than one handler at each `idx'.

The correspondence between `idx's and message-address/data
pairs is like this:

idx 0 - (intervals[0].mi_addr, intervals[0].mi_data)
idx 1 - (intervals[0].mi_addr, intervals[0].mi_data + 1)
. . .
idx N - 1 - (intervals[0].mi_addr, intervals[0].mi_data +
intervals[0].mi_count - 1)
idx N - (intervals[1].mi_addr, intervals[1].mi_data)
idx N + 1 - (intervals[1].mi_addr, intervals[1].mi_data + 1)
. . .
idx N + K - 1 - (intervals[1].mi_addr, intervals[1].mi_data +
intervals[1].mi_count - 1)

void
bus_msi_disestablish(bus_dma_tag_t tag, bus_msi_handle_t);

Disestablish the callback established previously with
bus_msi_handle_t.

void
bus_msi_free(bus_dma_tag_t tag, bus_msi_reservation_t msir, int idx, size_t n);

Release intervals allocated with bus_msi_alloc

Re: RFC: add MSI/MSI-X support to NetBSD

2014-06-06 Thread David Young
On Fri, Jun 06, 2014 at 07:06:00PM +, Taylor R Campbell wrote:
Date: Fri, 6 Jun 2014 12:56:53 -0500
From: David Young dyo...@pobox.com
 
Here is the proposal that I came up with many months (a few years?) ago
with input from Matt Thomas.  I have tried to account for Matt's
requirements, but I'm not sure that I have done so.
 
 For those ignoramuses among us who remain perplexed by the apparent
 difficulty of using a new interrupt delivery mechanism, could you add
 some notes to your proposal about what driver authors would need to
 know about it and when  how one would use it in a driver?

Driver authors do not need to know anything about bus_msi(9) unless
they're doing something fancy.  bus_msi(9) will be invisible to the
author of a PCI driver because pci_intr(9) will establish the mailbox
and all of that.

bus_msi(9) hides differences between hardware platforms.

 Would all architectures with PCI support bus_msi(9), or would PCI
 device drivers need to conditionally use it?  Why isn't it just a
 matter of modifying pci_intr_map, or calling pci_intr_map_msi like in
 OpenBSD?

For a PCI driver, it *is* a matter of calling pci_intr_map(9), or
whatever NetBSD comes up with for MSI/MSI-X.

 Would there be other non-PCI buses with message-signalled
 interrupts too?

It's conceivable that there are existing non-PCI buses that use
message-signalled interrupts.  Any future bus probably will.

 (Still not having done my homework to study what this MSI business is
 all about, I'll note parenthetically that it seems FreeBSD and OpenBSD
 have supported MSI for a while, and I understand neither why it was so
 easy for them nor what advantage they lack by not having bus_msi(9).)

NetBSD has supported MSI (but not MSI-X) for a while, too, at least on
i386 or x86.

Here are the nice things about MSI/MSI-X in a nutshell: you can have
many interrupt sources per device (IIRC, there are just 4 interrupt
lines on a legacy PCI bus), each interrupt can signal a different
condition (so your interrupt service routine doesn't have to read
an interrupt condition register), you can route each condition on a
device to a different CPU, and the interrupt is a bus-master write
that flushes all of the previous bus-master writes by the same device
(according to the PCI ordering rules), so a driver doesn't have to poll
a device register to land buffered bus-master writes before examining
descriptor rings and other DMA-able regions.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Lockless IP input queue, the pktqueue interface

2014-05-29 Thread David Young
.

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: packet timestamps (was Re: Changes to make /dev/*random better sooner)

2014-04-15 Thread David Young
On Wed, Apr 09, 2014 at 04:36:26PM -0700, Dennis Ferguson wrote:
 What I would like to do is to make per-packet timestamps (which you
 are doing already) more widely useful by moving where the timestamp is
 taken closer to the input interrupt which signals a packet delivery
 and carrying it in each packet's mbuf metadata.  This has a nice effect
 on the rare occasion that the packet is delivered to a socket with a
 consumer with cares about packet arrival times (like an NTP or IEEE 1588
 implementation), but it can also provide a new measure of the performance
 of the network code when making changes to it (how long did it take packets
 to get from the input interface to the socket, or from the input interface
 to the output interface?) which doesn't exist now.  In fact it would be
 nice to have it right now so the effect of the changes you are making
 could be measured instead of just speculating.  I was also imagining that
 the random number machinery would harvest timestamps from the mbufs,
 but maybe only after it is determined if the timestamp is one a user
 is interested in so it didn't use those.

FWIW, based on a suggestion by Dennis, in August I added a timestamp
capability to MBUFTRACE for my use at $DAYJOB.  MBUFTRACE thus enhanced
shows me how long it takes (cycles maximum, cycles average) for packets
to make ownership transitions: e.g., from IP queue to wm0 transmission
queue.  This has been useful for finding out where to concentrate my
optimization efforts, and it has helped to rule-in or -out hypotheses
about networking bugs.  Thanks, Dennis.

Here and there I have also fixed an MBUFTRACE bug, and I have made some
changes designed to reduce counters' cache footprint.  I call my variant
of MBUFTRACE, MBUFTRACE3.  I hope to feed MBUFTRACE3 back one of these
days.

Here is a sample of 'netstat -ssm' output when MBUFTRACE3 is operating
on a box with background levels of traffic---there are two tables,
the first that you will recognize, and the second which is new:

 smallextcluster
unix   inuse 3  1  1
 arp hold  inuse 8  0  0
 wm8 rx ring   inuse  1024   1024   1024
 wm7 rx ring   inuse  1024   1024   1024
 wm6 rx ring   inuse  1024   1024   1024
 wm5 rx ring   inuse  1024   1024   1024
 wm4 rx ring   inuse  1024   1024   1024
 wm3 rx ring   inuse  1024   1024   1024
 wm2 rx ring   inuse  1024   1024   1024
 wm1 rx ring   inuse  1024   1024   1024
 wm0 rx ring   inuse  1024   1024   1024
 unknown data  inuse 34802  34802  0
 revoked   inuse   1273302  63033  22922
 microsecs/tr   max # transitions   previous owner - new owner
617 2,068 wm8 cpu5 rxq - wm8 rx  
827   199   udp rx - revoked 
954,71910,772 defer arp_deferral - revoked 
   132682   route  - revoked 
   70 1,627,735   835,683unix  - revoked 
  241   487,685 3,977  ixg1 rx - arp 
  260   260 1 arp hold - ixg0 tx 
1,410   456,389   772  ixg0 rx - arp 
   22,296 6,712,491 2,082  ixg1 tx - revoked 
  315,846 6,293,761   136  ixg0 tx - revoked 
  516,585   709,19313  wm4 tx ring - revoked 

There are microseconds in that table, but netstat reads the CPU
frequency, average and maximum cycles/transition from the kernel and
does the arithmetic.  I'm using the CPU cycle counter to make all of the
timestamps.  I'm not compensating for clock drift or anything.

A more suitable display for this information than a table is *probably*
a directed graph with a vertex corresponding to each mbuf owner, and
an edge corresponding to each owner-owner transition.  Set the area
of each vertex in proportion to the mbufs in-use by the corresponding
owner, and set the width of each edge in proportion to the rate of
transitions.  Label each vertex with an mbuf-owner name.  Graphs for
normal/high-performance and abnormal/low-performance machines will have
distinct patterns, and the graph will help to illuminate bottlenecks.
If anyone is interested in programming this, let me know, and I will
describe in more detail what I have in mind.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


rethinking resource limits (was Re: Shared resource limit implementation question)

2014-02-20 Thread David Young
On Wed, Feb 19, 2014 at 09:08:59PM +0700, Robert Elz wrote:
 The kernel code for handling resource limits attempts to share the
 limits structure between processes (wholly reasonable, as limits are
 inherited, and rarely changed).  A shared limits struct (which is all
 of them when a new one is created) is marked as !pl_writeable.
 (Then when a process modifies one of its limits, it is given a copy
 of the limits struct, marked pl_writeable that it can modify as needed).

I do not see anything wrong with your analysis, but I only skimmed it.

I skimmed your email expecting for you to mention a problem with process
resource limits that came up several years ago: after a process fork()s,
the child's resource use does not count against the parent's limits, but
it counts against the child's own copy of the parent's resource limits.

Also, we may configure a system-wide limit on the number of processes,
and we may individually limit the number of processes simultaneous
belonging to each user, but there is not a limit to the number of
processes created by a process and its descendants.

All of this means that a user has very little protection against a
program that constantly forks and allocates memory: where N is the
user's process limit, and M the bytes memory limit, the program and its
descendants can use N * M bytes of memory and all N of the processes
available to the user.  In this way a fork bomb can run away with all
of the user's resources, and it might cripple the system, too.

It seems to me that the whole area of resource limits is ripe for
reconsideration, if somebody had the time and level of interest.  These
days it makes more sense to arbitrate access to system resources using
power budgets, noise budgets/limits, and latency goals, than to enforce
some of the traditional limits.  Limits should be enforceable by users
on the processes that run on their behalf.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: ptrdiff_t in the kernel

2013-12-04 Thread David Young
On Wed, Dec 04, 2013 at 12:07:42PM -0200, Lourival Vieira Neto wrote:
  No, stddef.h is not allowed in the kernel.  Symbols from it are
  provided via other means.
 
  I know. In fact, I'm asking if it would be alright to allow that.
  AFAIK, it would be inoffensive if available in the kernel.
 
  Actually, it would be offensive.
 
 Why?

I would also like to know why that would be offensive!

I'm always disappointed when I have to write something like this in order
to share code between the userland and kernel,

#if defined(_KERNEL) || defined(_STANDALONE)
#include sys/types.h  /* for bool, size_t---XXX not right? */
#else
#include stdbool.h/* for bool */
#include sys/types.h  /* for size_t */
#endif

Apparently, stddef.h is the correct header for size_t, so that is more
properly written like this,

#if defined(_KERNEL) || defined(_STANDALONE)
#include sys/types.h  /* for bool, size_t---XXX not right? */
#else
#include stdbool.h/* for bool */
#include stddef.h  /* for size_t */
#endif

I would prefer for this to suffice both for the kernel and userland:

#include stdbool.h/* for bool */
#include stddef.h  /* for size_t */

ISTM that the reasons things are not that simple are merely historical
reasons, but I am open to other explanations.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: What's an MPSAFE driver need to do?

2013-02-28 Thread David Young
On Thu, Feb 28, 2013 at 09:43:49AM +0100, Manuel Bouyer wrote:
 On Thu, Feb 28, 2013 at 02:29:11AM -0500, Mouse wrote:
  [...]
  Well, assuming rwlock(9) is considered a subset of mutex(9) for the
  purposes of that sentence, I then have to ask, what else is there?
  spl(9), the traditional way, specifically calls out that those routines
  work on only the CPU they're executed on (which is what I'd expect,
  given what they have traditionally done - but, I gather from the
  manpage, no longer do).
  
  This then leads me to wonder how a driver can _not_ be MPSAFE, since
  the old way doesn't/can't actually work and the new way is MPSAFE.
 
 A driver not marked MPSAFE will be entered (especially
 its interrupt routine, but also from upper layers) with
 kernel_lock held. This is what makes spl(9) still work.
 In order to convert a driver using spl(9)-style calls, you have to replace
 spl(9) calls with a mutex of the equivalent IPL level (a rwlock won't work
 for this as it can't be used in interrupt routines, only thread context).

I want to complicate this idea of spl-mutex conversion a bit.  I used
to think that replacing spl calls by mutex calls would block the same
interrupts that traditionally spl blocks.  Then I realized that I'd
been misled both by the lore surrounding spl-mutex conversion, and
by reading (and re-reading) the manual: a mutex initialized with level
`ipl' does NOT necessarily block interrupts.  It will block them if
it is a spin mutex (initialized with one of the hardware interrupt
levels: IPL_VM, IPL_SCHED, IPL_HIGH), but it will not if it is an
adaptive mutex (initialized with one of the software interrupt levels,
IPL_SOFT*).  So things are not 100% symmetrical in mutex land.

Generally you're safe if both your interrupt handlers and your code
running in the a normal thread context acquire  release the same
mutex in critical sections.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


low-priority xcall(9)s and preemption

2013-02-08 Thread David Young
The xcall(9) manual page says,

 xcall provides a mechanism for making ``low priority'' cross calls.  The
 function to be executed runs on the remote CPU within a thread context,
 and not from a software interrupt, so it can ensure that it is not inter-
 rupting other code running on the CPU, and so has exclusive access to the
 CPU.  Keep in mind that unless disabled, it may cause a kernel preemp-
 tion.

I take that last sentence to mean that a low-priority cross call *may*
preempt a thread on the remote CPU.  Is that correct?

In other words, can we rephrase that, A low-priority cross call may
preempt a thread running on the remote CPU unless preemption is disabled
on that CPU.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: kcpuset(9) questions

2013-02-03 Thread David Young
On Sun, Feb 03, 2013 at 04:22:37PM -0800, Matt Thomas wrote:
 
 On Feb 3, 2013, at 3:33 PM, Mindaugas Rasiukevicius wrote:
 
  Any reason why do you need bitfield based iteration, as opposed to list
  or array based?
 
 Be nice to have a MI method instead a hodgepodge of MD methods.
 
 The CPU_FOREACH method is ugly.

What Matt said. :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: kcpuset(9) questions

2013-02-03 Thread David Young
On Sun, Feb 03, 2013 at 11:33:10PM +, Mindaugas Rasiukevicius wrote:
 David Young dyo...@pobox.com wrote:
   There are kcpuset_attached and kcpuset_running, which are MI.  All ports
   ought to switch to them replacing MD cpu_attached/cpu_running.  They can
   be wrapped into a routine, but globals seem harmless in this case too.
  
  It seems that if they are not wrapped in routines, they should be
  declared differently, e.g.,
  
  extern const kcpuset_t * const kcpuset_attached;
 
 Although we are far from this, but in the long term we would like to
 support run time attaching/detaching of CPUs, so it would not be const.

It would be nice to have the compiler's help to avoid adding/deleting
CPUs to/from kcpuset_attached or kcpuset_running by accident.  Only the
kcpuset_{attached,running} implementation code should be writing those
sets.  Users of kcpuset_{attached,running} should only be reading them.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


event counting vs. the cache

2013-01-17 Thread David Young
It's customary to use a 64-bit integer to count events in NetBSD because
we don't expect for the count to roll over in the lifetime of a box
running NetBSD.

I've been thinking about what these wide integers do to the cache
footprint of a system and wondering if we shouldn't make a couple of
changes:

1) Cram just as many counters into each cacheline as possible.
   Extend/replace evcnt(9) to allow the caller to provide the storage
   for the integer.

   On a multiprocessor box, you don't want CPUs sharing counter
   cachelines if you can help it, but do cram together each individual
   CPU's counters.

2) Split all counters into two parts: high-order 32 bits, low-order 32
   bits.  It's only necessary to touch the high-order part when the
   low-order part rolls over, so in effect you split the counters into
   write-often (hot) and write-rarely (cold) parts.  Cram together the
   cold parts in cachelines.  Cram together the hot parts in cachelines.
   Only the hot parts change that often, so the ordinary footprint of
   counters in the cache is cut almost in half.

I suppose you could split counters into four or more parts of 16 or
fewer bits each, and in that shrink the footprint even further, but it
seems that you would reach a point of diminishing returns very quickly.

Perhaps this has been tried before and found to (not) work reasonably
well?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: event counting vs. the cache

2013-01-17 Thread David Young
On Thu, Jan 17, 2013 at 11:10:24PM +, David Laight wrote:
 On Thu, Jan 17, 2013 at 03:43:13PM -0600, David Young wrote:
  
  2) Split all counters into two parts: high-order 32 bits, low-order 32
 bits.  It's only necessary to touch the high-order part when the
 low-order part rolls over, so in effect you split the counters into
 write-often (hot) and write-rarely (cold) parts.  Cram together the
 cold parts in cachelines.  Cram together the hot parts in cachelines.
 Only the hot parts change that often, so the ordinary footprint of
 counters in the cache is cut almost in half.
 
 That means have to have special code to read them in order to avoid
 having 'silly' values.

We can end up with silly values with the status quo, too, can't we?  On
32-bit architectures like i386, x++ for uint64_t x compiles to

addl $0x1, x
adcl $0x0, x

If the addl carries, then reading x between the addl and adcl will show
a silly value.

I think that you can avoid the silly values.  Say you're using per-CPU
counters.  If counter x belongs to CPU p, then avoid silly values by
reading x in a low-priority thread, t, that's bound to p and reads hi(x)
then lo(x) then hi(x) again.  If hi(x) changed, then t was preempted by
a thread or an interrupt handler that wrapped lo(x), so t has to restart
the sequence.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: re(4) MAC address

2012-12-28 Thread David Young
On Fri, Dec 28, 2012 at 04:45:14PM +0100, Frank Wille wrote:
 On Fri, 28 Dec 2012 23:33:01 +0900
 Izumi Tsutsui tsut...@ceres.dti.ne.jp wrote:
 
  The attached patch make re(4) always use IDR register values
  for its MAC address.
  
  We no longer have to link rtl81x9.c for eeprom read functions
  and I'm not sure if we should make the old behavoir optional
  or remove completely.
 
 I cannot imagine any case where it is needed. When an EEPROM is present,
 the IDR registers should be initialized with its MAC.
 
 Maybe somebody who owns an re(4) NIC with an EEPROM should confirm that.
 
 
  But for now I think it's almost harmless so please commit
  if it works on re(4) on your NAS boxes.
 
 Unfortunately, there is still a dependency with rtl81x9.c:
 
 rtl8169.o: In function `re_ioctl':
 rtl8169.c:(.text+0x680): undefined reference to `rtk_setmulti'
 rtl8169.o: In function `re_init':
 rtl8169.c:(.text+0x1bc4): undefined reference to `rtk_setmulti'
 
 As this is the only function needed from rtl81x9.c it probably makes
 sense to add rtk_setmulti() and the rtk_calchash macro to rtl8169.c.

Please, don't copy them.  Put them into a module the drivers can share.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: lua(4), non-invasive and invasive parts

2012-12-27 Thread David Young
On Mon, Dec 24, 2012 at 10:43:03AM +0100, Marc Balmer wrote:
 For such more invasive changes, I foresee to use a kernel option, 'options 
 LUA' which will compile such code only when the option is enabled.  It will 
 be commented out by default, besides maybe the ALL kernels.

Why not use a kernel module?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: KNF and the C preprocessor

2012-12-10 Thread David Young
On Mon, Dec 10, 2012 at 07:37:14PM +, David Laight wrote:
 On Mon, Dec 10, 2012 at 09:36:35AM -0600, David Young wrote:
  What do people think about setting stricter guidelines for using the
  C preprocessor than the guidelines from the past?  Example guidelines:
 ...
  4 Computed constants.  The result of a function call may not be used
in a case-statement, even if the function evaluates to a constant at
compile time.  You have to use a macro, instead.
 
 The alternative to constants would be C enums.
 However C enums are such 2nd class citizens that they have problems
 of their own.

I'm not sure you mean quite the same thing.  An example of what I mean
by computed constant would be something like f(Y) where Y is some
other constant and f(X) can always be evaluated to a constant at compile
time: f() may not be a function, not even a static/inline function, if
f(Y) appears in a case statement.

(Actually, if f() is an inline function and the compiler optimization
level is turned up, GCC will let you put f(Y) in a case statement.  Turn
the optimization level down, though, and you get a compile error.)

  The C preprocessor MUST NOT be used for
  
  1 In-line code: 'static inline' subroutines are virtually always better
than macros.
 
 That rather depends on your definition of better.

It comes down to the ease of reading/understanding/writing a macro like

#define M(x, y) \
do {\
... \
... \
... \
} while (0)

when something like

static inline void
M(int x, int y)
{
...
...
...
}

will do.  The guideline can be re-phrased, reach for a function
before a hairy macro; use a hairy macro only when nothing else will
do.  When I say hairy macro I mean one like WM_INIT_RXDESC() in
sys/dev/pci/if_wm.c: the extra underscores, parens, and backslashes
badly clutter the code.  Was the same code written as a static or static
inline function, first, found wanting, and converted to a macro?  Or was
the author in the habit of using a macro, first?  I'm pretty sure that
the code is a macro for the latter reason.

 a) #define macros tend to get optimised better.

Better even than an __attribute__((always_inline)) function?

 b) __LINE__ (etc) have the value of the use, not the definition.

I certainly don't want to rule out the careful use of __LINE__ or
__func__.

  2 Configuration management: use the compiler  linker to a greater
extent than the C preprocessor to configure your program for your
execution environment, your chosen compilation options, et cetera.
 
 Avoiding #ifdef inside code tends to be benefitial.
 But, IMHO, there isn't much wrong with using #defines in header files
 to remove function calls.

Example?

 Using the compiler gets to be a PITA because of the warning/errors
 about unreachable code.

I wrote the guidelines in 2010 and they sat in a draft form ever since.
I no longer remember what I had in mind when I wrote compiler above.

  3 Virtually anything else. :-)
 
 There are some very useful techniques that allow a single piece of
 source to be expanded in multiple ways.

I don't disagree.  I don't want to discourage the use of the C
preprocessor altogether, just to make sure its use is measured against
the potential headaches.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: KNF and the C preprocessor

2012-12-10 Thread David Young
On Mon, Dec 10, 2012 at 10:27:39PM +0200, Alan Barrett wrote:
 On Mon, 10 Dec 2012, David Young wrote:
 What do people think about setting stricter guidelines for using
 the C preprocessor than the guidelines from the past?
 
 Maybe.
 
 The C preprocessor MUST NOT be used for
 
 1 In-line code: 'static inline' subroutines are virtually always better
  than macros.
 
 I disagree with this one.  If you tone it down to SHOULD NOT or

Sure, let's make it SHOULD NOT.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: KNF and the C preprocessor

2012-12-10 Thread David Young
On Mon, Dec 10, 2012 at 03:50:00PM -0500, Thor Lancelot Simon wrote:
 On Mon, Dec 10, 2012 at 02:28:28PM -0600, David Young wrote:
  On Mon, Dec 10, 2012 at 07:37:14PM +, David Laight wrote:
  
   a) #define macros tend to get optimised better.
  
  Better even than an __attribute__((always_inline)) function?
 
 I'd like to submit that neither are a good thing, because human
 beings are demonstrably quite bad at deciding when things should
 be inlined, particularly in terms of the cache effects of excessive
 inline use.

I agree with that.  However, occasionally I have found when I'm
optimizing the code based on actual evidence rather than hunches, and
the compiler is letting me down, always_inline was necessary.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1

2012-11-28 Thread David Young
On Wed, Nov 28, 2012 at 07:27:56PM -0500, Greg Troxel wrote:
 
 dhcpd, last I checked, used bpf and not sockets.
 
 If dhcpd is bpf, I would suggest reading the bpf_tap calls in the
 driver.  It could be that if_wm.c has a spurious on.
 
 If it's not, I don't know what's going on.

I'll bet this has something to do with the hardware VLAN tagging.  I
don't think BPF groks the VLAN mbuf tags.

FWIW, I think that hardware VLAN tagging is a lot of pain for no gain
the way that NetBSD is doing it.

Dave 

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Making forced unmounts work

2012-11-27 Thread David Young
On Mon, Nov 26, 2012 at 03:06:34PM +0100, J. Hannken-Illjes wrote:
 Comments or objections?

I'm wondering if this will fix the bugs in 'mount -u -r /xyz' where a
FFS is mounted read-write at /xyz?  Sorry, I don't remember any longer
what the bugs were.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: fexecve, round 3

2012-11-26 Thread David Young
On Mon, Nov 26, 2012 at 10:18:42AM +0100, Martin Husemann wrote:
 Does anyone know of a setup that uses a process outside of a chroot doing
 descriptor passing to a chrooted process?

Yes.  I can point to the same example as Thor has described, but I think
that it is easy to cook up numerous useful examples.

 I wonder if we should disallow that completely (i.e. fail the anxiliary
 data send if sender and recipient have different p_cwdi-cwdi_rdir)?

This idea of failing the ancillary data transmission seems unnecessarily
inflexible to me.  I think that if process A has a send descriptors
privilege, and process B has a receive descriptors privilege, and
there is some communications channel from A to B, then A should be
able to send a descriptor to B regardless of the origin or properties
of that descriptor.  B's privileges may not be sufficient to use
certain methods of the descriptor---for example, to fexecve() the
descriptor---but I think that is ok, because B's entire purpose may be
to send the descriptor to a third process that can use the descriptor.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Questions about pci configuration and the mpt(4) driver

2012-11-17 Thread David Young
On Sat, Nov 17, 2012 at 02:18:18AM -0800, Brian Buhrow wrote:
   Hello.  I've been working on an issue with the mpt(4) driver, the
 driver for the LSI Fusion SCSI controller cards and raid cards.  In the
 process of working through the issue, I've discovered that the mpt(4)
 driver is very fragile if the need to reset the hardware arises.  In
 particular, if a hardware reset is done, all of the pci configuration
 registers get zorched, causing interrupt handling to fail and requests to
 get stuck in the driver and hardware's queue.

When does the need for a reset arise?  Is the cause a driver bug or a
hardware/firmware bug?

It sounds to me like you should detach the device (perhaps resetting
in the final stages of the detachment---i.e., before unmapping the
registers) and re-attach it.

Since detachment ordinarily loses all of the software state, you may
need to stash the outstanding requests somewhere that the re-attached
device can find them.

Dave

 I've been looking for
 examples of how to reset the PCI registers after such a reset, but neither
 the OpenBSD or FreeBSD drivers offer a clue.  All BSD drivers I've looked
 at lament the problem, but none provide a solution.  I've considered
 extracting the PCI initialization process from the mpt_pci_attach() routine
 into a separate function that can be called at any time while things are
 running, but there must be a reason this hasn't been done already and why I
 don't see any examples that look obvious to me of any drivers that do this.
 Is it safe to call pci_intr_disestablish() and pci_intr_establish() during
 the course of normal multi-user operation for a particular driver as a
 means of re-attaching interrupts to a device that's forgotten how to
 generate them?  Are there any examples of drivers that do a complete reset
 of the hardware, including pci and pci interrupt settings while continuing
 to operate in multi-user mode?
 -thanks
 -Brian
 

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: [PATCH] fexecve

2012-11-17 Thread David Young
On Sat, Nov 17, 2012 at 12:16:49AM +0100, Rhialto wrote:
 On Thu 15 Nov 2012 at 20:18:56 -0600, David Young wrote:
  Also, enforcing access along effective roots lines may be inflexible
  or unwieldy, maybe a more abstract notion of process coalition is
  better.  Let each new root have a corresponding new coalition, but
  perhaps we should be able to create a new coalition without changing
  root, and change root without changing coalition.
 
 That would make yet another process grouping, confusingly (dis)similar
 to process groups, controlling-terminal groups, sessions, (and am I
 forgetting more perhaps?)

Process groups, controlling-terminal groups, and sessions are not
already confusingly dissimilar from each other?  Perhaps coalitions
could subsume them all: process group, controlling-terminal groups, and
sessions could become coalitions of different privileges  properties.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: [PATCH] fexecve

2012-11-15 Thread David Young
On Thu, Nov 15, 2012 at 04:57:24PM -0500, Thor Lancelot Simon wrote:
 On Thu, Nov 15, 2012 at 09:46:13PM +, David Holland wrote:
  On Thu, Nov 15, 2012 at 11:03:15AM -0500, Thor Lancelot Simon wrote:
 Here is a patch that implements fexecve(2) for review:
 http://ftp.espci.fr/shadow/manu/fexecve.patch

This strikes me as profoundly dangerous.  Among other things, it
means you can't allow any program running in a chroot to receive
unix-domain messages any more since they might get passed a file
descriptor to code they should not be able to execute.
  
  I have two immediate reactions to this: (1) being able to pass
  executables to something untrusted in a controlled manner sounds
  useful, not dangerous
 
 Sorry to cherry-pick one more point for the moment:  Considered in a vacuum,
 I agree with your reaction #1 above.  The problem is that there is a great
 deal of existing code in the world which receives file descriptors and which
 is not designed with the possibility that they might then be used to exec.
 
 With that history, I don't see a clear way to make this safe (for example
 by restricting which descriptors can be passed to chrooted processes) 
 without breaking code that assumes it can pass file descriptors without such
 restrictions.

Why restrict what descriptors can be passed?  It seems that you could
restrict what you can with the descriptors after they are passed.

It seems like something like the following can be made to work:

Label a file descriptor with the root that was in effect when it was created
by, say, open(2).  The effective root will never change over the
lifetime of that descriptor.

Call the root of a descriptor z, root(z).

Let fexecve(zfd, ...) compare the root of the kernel file descriptor
corresponding to zfd with the effective root and return EPERM if they're
unequal.

Say that process 1 with effective root pqr opens an executable,
fd = open(./setuidprog, ...). Call fd's corresponding kernel
descriptor z.  Now process 1 passes z to process 2, whose effective
root is stu != pqr.  Process 2 tries to fexecve() the descriptor, but
root(z) != stu so fexecve() returns EPERM.

Maybe we can weaken fexecve()'s requirement on the effective root of z
to root(z) must be reachable from the effective root, but I think that
that might be much more complicated.

fexecve() isn't the only call on which you may want to enforce a
root(descriptor) == effective root restriction.  You may want to
enforce it on read(2) and write(2), too.

Also, enforcing access along effective roots lines may be inflexible
or unwieldy, maybe a more abstract notion of process coalition is
better.  Let each new root have a corresponding new coalition, but
perhaps we should be able to create a new coalition without changing
root, and change root without changing coalition.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: ETHERCAP_* ioctl()

2012-11-01 Thread David Young
On Wed, Oct 31, 2012 at 07:24:51PM +0900, Masanobu SAITOH wrote:
  Hi, all.
 
  I sent the followin mail more than two years ago.
 
  http://mail-index.netbsd.org/tech-kern/2010/07/28/msg008613.html
 
  As the starting point to solve this problem, I committed the change to
 add SIOCGETHERCAP stuff.
 
  Example:
  msk0: flags=8802BROADCAST,SIMPLEX,MULTICAST mtu 1500
  ec_capabilities=5VLAN_MTU,JUMBO_MTU
  ec_enabled=0
  address: 00:50:43:00:4b:c5
  media: Ethernet autoselect
  status: no carrier
  wm0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500
  
  capabilities=7ff80TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6
  
  enabled=7ff80TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6
  ec_capabilities=7VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU
  ec_enabled=0
  address: 00:1b:21:58:68:34
  media: Ethernet autoselect (1000baseT 
  full-duplex,flowcontrol,rxpause,txpause)
  status: active
  inet 192.168.1.5 netmask 0xff00 broadcast 192.168.1.255
  inet6 fe80::21b:21ff:fe58:6834%wm0 prefixlen 64 scopeid 0x2
  inet6 2001:240:694:1:21b:21ff:fe58:6834 prefixlen 64
 
 
  What do you think about this output?

I think that these flags belong within a service hatch rather than on
the dashboard.  That is, shown via sysctl or ifconfig -v instead of in
the normal output of ifconfig.

What are the use-cases for reading/changing these flags?  I don't see
what an operator is supposed to do with this new information and with
these new controls.

I am curious whether these flags good for anything except diagnosing and
working around driver bugs?  I ask because I don't think the operator
can ordinarily make a better selection of hardware-capability flags than
the OS can, except insofar as the OS has bugs and forces the user to
work around them.  BTW, I think that it is the same for the checksum
offload / TSO flags as for the ethernet capability flags, but I guess
that we're kind of stuck with the checksum/TSO flags by now.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


[patch] MI root filesystem detection

2012-10-04 Thread David Young
This change to sys/kern/init_main.c:rootconf_handle_wedges()
lets the kernel select a partition in a BSD disklabel using the
BTINFO_BOOTWEDGE-type information from the bootloader if the _BOOTWEDGE
information matches no dk(4) instance.

At least on i386, the bootloader passes both BTINFO_BOOTWEDGE-
and BTINFO_BOOTDISK-type information to the kernel, but the
_BOOTWEDGE information supercedes the _BOOTDISK information: thus the
booted_partition is left at 0, and if the kernel fails to match the
_BOOTWEDGE info to a wedge it blithely selects the 0th partition on the
booted_device for its root filesystem.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981
--- sys/kern/init_main.c2012/09/05 20:48:18
+++ sys/kern/init_main.c2012/10/02 19:59:05
@@ -835,7 +853,7 @@
daddr_t startblk;
uint64_t nblks;
device_t dev; 
-   int error;
+   int error, partition;
 
if (booted_nblks) {
/*
@@ -882,6 +900,36 @@
if (dev != NULL) {
booted_device = dev;
booted_partition = 0;
+   return;
+   }
+   if (booted_nblks == 0)
+   return;
+
+   /*
+* Use the geometry to try to locate a partition.
+*/
+   vp = opendisk(booted_device);
+
+   if (vp == NULL)
+   return;
+
+   error = VOP_IOCTL(vp, DIOCGPART, dpart, FREAD, NOCRED);
+   VOP_CLOSE(vp, FREAD, NOCRED);
+   vput(vp);
+   if (error)
+   return;
+
+   for (partition = 0; partition  MAXPARTITIONS; partition++) {
+
+   p = dpart.disklab-d_partitions[partition];
+
+   startblk = p-p_offset;
+   nblks= p-p_size;
+
+   if (startblk == booted_startblk  nblks == booted_nblks) {
+   booted_partition = partition;
+   break;
+   }
}
 }
 


Re: WAPBL/cache flush and mfi(4)

2012-08-24 Thread David Young
On Fri, Aug 24, 2012 at 10:38:50PM +0200, Manuel Bouyer wrote:
 On Fri, Aug 24, 2012 at 04:26:07PM -0400, Thor Lancelot Simon wrote:
   I think in this case you have to flush both: if you flush only the
   disks, the data you want to be on stable storage may still be in the
   controller's cache.
  
  That doesn't make sense to me.  If you consider the controller cache
  to be stable storage, then you clearly need to flush only the disks'
  caches for all the data expected to be in stable storage to actually
  be in stable storage.
 
 Immagine the following scenario:
 - wapbl writes to its journal.
 - mfi(4) sends the write to controller, which keeps it in its
   (battery-backed) cache and return completion of the command
 - wapbl requests a cache flush
 - mfi(4) translate this to a disk cache flush (but not controller cache
   flush).
 - the controller sends a cache flush to disk. at this time, the data wapbl
   cares about is still in the controller's cache
 - some time later, the controller flushes its data to disks. Now the
   data from wapbl is in the unsafes disks caches, and not in the controller
   cache any more.
 
 So you still need to flush the controller's cache before disks caches,
 otherwise data can migrate from safe storage to unsafe one.

Will a controller really empty its cache into the attached disks'
caches, or will it issue the disk writes, wait for the disks to
acknowledge that the data is on the platter, and then empty the cache?

I have the following vague idea in mind for how an operating system
should treat disk writes: it seems to me that our disks subsystem(s)
should treat streams of disk writes kind of like TCP sessions in
that the receiver, which is either an instance of some disk driver
(e.g., sd(4)) or a non-volatile cache, tells the sender (some user
process that write(2)s, a filesystem, or the pager) that it is open to
receive up to X megabytes.  The sender sends the receiver X-megabytes'
worth of bufs, but holds onto a copy of the bufs itself until each is
acknowledged.  Ordinarily an acknowledgement will come back saying you
may go ahead and send me Y more kilobytes, sender.  A sender may also
get a NACK (sorry, the backup disk was unplugged before it acknowledged
that buffers P, Q, and R hit the media); then it has to indicate the
exception or else retransmit the buffers.

Here and there in the system you will have software (a filesystem) or
hardware (a battery-backed cache) that proxies disk-write streams.  A
filesystem will proxy because it's probably going to either serialize
writes (say to write them to a journal) or to augment them (say to
update corresponding metadata).  Typically a filesystem will proxy, too,
because we don't expect for a user process to block in write(2) until
all the bytes written have landed on the platter.  A battery-backed
cache will proxy because it's going to guarantee disk-write completion
to the sender.

I have the following doubt about a battery-backed cache: what if I
yank the disk?  I have never met a controller with battery-backed
cache where I could not pull some of the disks right out of the front
of the chassis.  I guess that usually those disks were redundant,
too.  So, what if I yank two disks? :-) It seems like receivers and
proxy receivers ought to advertise the guarantees that they do and do
not make (e.g., I guarantee that barring disk-yankage, I will put
your bytes on the platter OR barring power failure or disk-yankage
and non-replacement, I will put your bytes on the platter), and
senders requirements ought to be matched to receivers guarantees when a
disk-write session is established.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


PCI MSI musings

2012-07-15 Thread David Young
I'm writing to share some thoughts I've had about PCI Message-Signaled
Interrupts (MSI), MSI-X, and their application:

Establishment of an MSI/MSI-X handler routine customarily happens
in stages like this:

a) Query MSI/MSI-X device capabilities: find out any
   limitations to MSI/MSI-X message data  message address.
   E.g., 16-bit message width with 0..n^2-1 lower bits
   reserved (MSI), either 32-bit or 64-bit address width.

b) Establish a mapping,
   (message address, message data) - (CPU(s),
   handler routine), in the interrupt controller (e.g.,
   IOAPIC) and in the CPU (interrupt vector table).

c) Program the MSI/MSI-X registers with the message data 
   address.

d) Enable MSI.

MSI/MSI-X are really useful when we use them in a customary mode, but
I think that there are useful ways that we can modify stage (b), above: 

1) Device chaining: one PCI bus-master processes a memory buffer
   and, when it has finished processing, triggers processing by a
   second device.  For example, a cryptographic coprocessor and a
   network interface (NIC) share a network buffer.  The cryptographic
   coprocessor encrypts the buffer and signals completion by sending
   a message.  The target of the message is a memory location
   corresponding to the NIC register that either triggers DMA
   descriptor-ring polling or advances the descriptor-ring tail pointer.

2) Device polling 1: low-cost polling for coprocessor completions:
   say that you have a userland driver for a PCI 3D graphics
   coprocessor whose pattern of operation is to write a list of
   triangles to render into memory that is shared with the device,
   to issue a Draw Polygons command, to do other work until the
   command completes, and to repeat.  The driver tests for completion
   of commands by polling a memory-mapped device register.  Usually
   polling a register is a costly operation at best.  At worst,
   polling may introduce variable latency: the host CPU may have
   to retry its transaction once or more while PCI a bus bridge
   forwards pending PCI transactions upstream.

   In a much more efficient arrangement, the userland driver polls
   a memory word that is the target for the coprocessor's
   message-signaled completion interrupts.  At least on DMA-coherent
   systems like x86, the memory word can be cached, so polling it
   is quite cheap.

3) Device polling 2: like above, but let us say that you have drivers
   polling a bunch of NICs.  Instead of polling with register reads, let
   them check a shared word for changes.

4) Timer invalidation: sometimes reading hardware time sources involves
   register reads that are costly.  If I have an application that uses
   the current time often but that doesn't need the time with equal
   accuracy as the time source provides, then the app may spend an
   inordinate amount of time reading and re-reading the registers of the
   time source.

   If the time source can be programmed to interrupt at intervals
   corresponding to the accuracy of time that your application
   wants, and if the source supports MSI, then we can direct its
   interrupt messages to a memory word that the app can treat as
   a cache invalidated flag:  when the app needs the current
   time, it refers to the flag.  If the flag is 0, then it reads
   the current time from the time-source registers and caches it.
   If the flag is 1, then it reads the current time from its cache.
   Let the interrupt's message data be 0, so that signalling the
   interrupt invalidates the app's cache.

5) I have been turning over and over in my head the idea that if there
   are no processes eligible to run on a CPU except for a userland
   device driver, if we want that device driver to wake and process an
   interrupt with very low latency, if we are allergic for some reason
   to spinning while waiting for the interrupt, and if MSI is available,
   then maybe on x86 we can MONITOR/MWAIT the cacheline containing an
   MSI target in the last few instructions of a return to userland.  The
   CPU will just hang there until either there is some other interrupt
   (the hardclock ticks, say) or the message signalling the interrupt
   lands.

   Granted, I may have described such a rare alignment of conditions
   that this is never worth it.  The latency of waking a CPU from
   its MWAIT may be very long, too: I think that typically MWAIT
   is used to put the CPU into a power-saving state.  I think that
   the amount of power-saving is adjustable, though.

   I think on most x86 CPUs, MONITOR/MWAIT are only available in
   the privileged context, so another problem is that you may have
   to MWAIT right on the brink of a kernel-user return.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


avoiding bus_dmamap_sync() costs

2012-07-12 Thread David Young
 by the `ops' argument won't
perform the DMA synchronization (`dmaops') as a side effect,
then bus_dmamap_barrier() just has to do the equivalent of a
bus_dmamap_sync(..., dmaops).

So that's my current thinking about bus_dma(9).  Please let me know
your thoughts.

Dave

[1] Or bounce buffers are involved.

[2] Store buffer is Intel terminology.  Write buffer is AMD terminology
for the same thing.

[3] RUMP is unpopular at $DAYJOB for various reasons.  One reason
is that there is not a MKRUMP option for disabling it, so it is
necessary to wait for it to build and install even if it isn't
wanted.  Another reason is that sometimes changes made to the kernel
have to be replicated in RUMP, and having to double any effort
is both expensive and demoralizing.  Please don't read this as a
criticism of RUMP overall, just a wish for some improvements in
modularity and code sharing.

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: avoiding bus_dmamap_sync() costs

2012-07-12 Thread David Young
On Thu, Jul 12, 2012 at 09:05:10PM -0500, David Young wrote:
 At $DAYJOB I am working on a team that is optimizing wm(4).
 
 In an initial pass over the driver, we found that on x86,
 bus_dmamap_sync(9) calls issued some unnecessary LOCK-prefix
 instructions, and those instructions were expensive.  Some of the
 locked instructions were redundant---that is, there were effectively
 two in a row---and others were just unnecessary.  What we found by
 reading the AMD  Intel processor manuals is that bus_dmamap_sync() can
 be a no-op unless you're doing a _PREREAD or _PREWRITE operation[1].
 _PREREAD and _PREWRITE operations need to flush the store buffer[2].
 The cache-coherency mechanism will take care of the rest.  We will
 feed back a patch with these changes and others just as soon as local
 NetBSD-current tree is compilable[3].
 
 In a second pass over the driver, a teammember noted that even with
 the bus_dmamap_sync(9) optimizations already in place, some of the
 LOCK-prefix instructions were still unnecessary.  Just for example, take
 this sequence in wm_txintr():
 
 status =
 sc-sc_txdescs[txs-txs_lastdesc].wtx_fields.wtxu_status;
 if ((status  WTX_ST_DD) == 0) {
 WM_CDTXSYNC(sc, txs-txs_lastdesc, 1,
 BUS_DMASYNC_PREREAD);
 break;
 }
 
 Here we are examining the status field of a Tx descriptor and, if we
 find that the descriptor still belongs to the NIC, we synchronize
 the descriptor.  It's correct and persuasive code, however, the x86
 implementation will issue a locked instruction that is unnecessary under
 these particular circumstances.
 
 In general, it is necessary on x86 to flush the store buffer on a
 _PREREAD operation so that if we write a word to a DMA-able address and
 subsequently read the same address again, the CPU will not satisfy the
 read with store-buffer content (i.e., the word that we just wrote), but
 with the last word written at that address by any agent.
 
 In these particular circumstances, however, we do not modify the
 DMA-able region, so flushing the store buffer is not necessary.
 
 Let us consider another processor architecture.  On some ARM variants,
 the _PREREAD operation is necessary to invalidate the cacheline
 containing the descriptor whose status we just read so that if we come
 back and read it again after a DMA updates the descriptor, content from
 a stale cacheline does not satisfy our read, but actual descriptor
 content does.
 
 One idea that I have for avoiding the unnecessary instruction on x86
 is to add a MI hint to the bus_dmamap_sync(9) API, BUS_DMASYNC_CLEAN.
 The hint tells the MD bus_dma(9) implementation that it may treat the
 DMA region like it has not been written (dirtied) by the CPU.  The code
 above would change to this code:
 
 status =
 sc-sc_txdescs[txs-txs_lastdesc].wtx_fields.wtxu_status;
 if ((status  WTX_ST_DD) == 0) {
 WM_CDTXSYNC(sc, txs-txs_lastdesc, 1,
 BUS_DMASYNC_PREREAD);

Oops, line should be:

 BUS_DMASYNC_PREREAD|BUS_DMASYNC_CLEAN);

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


software interrupts scheduling oddities

2012-07-05 Thread David Young
Maybe this has happened to you: you tune your NetBSD router for fastest
packet-forwarding speed.  Presented with a peak packet load, your
router does really well for 30 seconds.  Then it reboots because the
user-tickle watchdog timer expires.  OR, your router doesn't reboot but
you cannot change any parameters because the user interface doesn't get
any CPU cycles.  This is the problem that I am dealing with today: while
the system is doing a lot of network processing, the userland doesn't
make any progress at all.  Userland seems to be starved of CPU cycles
because there is a non-stop software-interrupt load caused by the high
level of network traffic.  At least on i386, if there is any software
interrupt pending, then it will run before any user process gets a
timeslice.  So if the softint rate is really high, then userland will
scarcely ever run.  Or that is my current understanding.  Is it incorrect?

Ordinarily, under high packet load, processes that I need to stay
interactive, such as my shell, freeze up after the network load reaches
a certain level.  If I change the scheduling (class, priority) for my
shell to (SCHED_RR, 31), then the shell stays responsive, even though it
still runs at a lower priority than softints.  Ok, so maybe that makes
sense: of all the userland processes, my shell is the only one running
with a real-time priority, so if there are any cycles leftover after the
softints run, my shell is likely to get them.

I thought that maybe, if I run every process at (SCHED_RR, 31) so that
my shell has to share the leftover cycles with every other user process,
then my shell will freeze up again under high packet load.  I set
every user process to (SCHED_RR, 31), though, and the shell remained
responsive.

I'm using the SCHED_M2 scheduler, btw, on a uniprocessor.  SCHED_M2 is
kind of an arbitrary choice.  I haven't tried SCHED_4BSD, yet, but I
will.

I don't really expect for changing any process class/priority to
SCHED_RR/31 to make any difference in the situation I describe, so there
must be something that I am missing about the workings of the scheduler.

One more thing: userland processes get a priority bump when they enter
the kernel.  No problem.  But it seems like a bug that the kernel will
also raise the priority of a low-priority *kernel* thread if it, say,
waits on a condition variable.  I think that happens because cv_wait()
calls cv_enter(..., l) that sets l-l_kpriority, which is only reset by
mi_userret().  Kernel threads never go through mi_userret() so at some
point the kernel will call lwp_eprio() to compute an effective priority:

static inline pri_t
lwp_eprio(lwp_t *l)
{
pri_t pri;

pri = l-l_priority;
if (l-l_kpriority  pri  PRI_KERNEL)
pri = (pri  1) + l-l_kpribase;
return MAX(l-l_inheritedprio, pri);
}

Since my low-priority kernel thread has lower priority than PRI_KERNEL,
and l_kpriority is set, it gets bumped up.  Perhaps lwp_eprio() should
test for kernel threads (LW_SYSTEM) before elevating the priority?

static inline pri_t
lwp_eprio(lwp_t *l)
{
pri_t pri;

pri = l-l_priority;
if (l-l_kpriority  pri  PRI_KERNEL  (l-l_flag  LW_SYSTEM) == 0)
pri = (pri  1) + l-l_kpribase;
return MAX(l-l_inheritedprio, pri);
}

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: software interrupts scheduling oddities

2012-07-05 Thread David Young
On Thu, Jul 05, 2012 at 07:40:11PM -0400, Mouse wrote:
  Maybe this has happened to you: you tune your NetBSD router for
  fastest packet-forwarding speed.  Presented with a peak packet load,
  [...] the user interface doesn't get any CPU cycles.  [...]  [I]f
  there is any software interrupt pending, then it will run before any
  user process gets a timeslice.  So if the softint rate is really
  high, then userland will scarcely ever run.  Or that is my current
  understanding.  Is it incorrect?
 
 No, I think.  At least, that's how I'd expect it to work, and I've
 occasionally seen behaviour close enough to that to make me think it's
 reasonably accurate.
 
 I find your discovery about changing a user process's priority making a
 difference surprising.

Me, too.  Before I made that discovery, I had intended to defer packet
processing to a kernel thread that ran at a middling user priority.  The
kernel would shift packet processing from softints to the kernel thread
if it became apparent that the system wasn't switching to userland.
User programs that needed to stay interactive would run at a higher
priority than packet processing; programs that could afford to be
delayed (cron, syslogd, ...) would run at a lower priority than packet
processing.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: {send,recv}mmsg patch.

2012-06-08 Thread David Young
On Thu, Jun 07, 2012 at 10:34:52PM -0400, Christos Zoulas wrote:
 
 Hi,
 
 Linux has grown those two, and claim 20% performance improvement on some
 workloads. Some programs already use them, so we are going to need them
 for emulation anyway...
 
 http://www.netbsd.org/~christos/mmsg.diff

Can you provide some documentation for these calls?

ISTM that {send,recv}mmsg(), {read,write}v(), and AIO could be subsumed
by a general-purpose scatter-gather system call, and that might be a
good direction to go.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Fixing pool_cache_invalidate(9), final step

2012-06-04 Thread David Young
On Mon, Jun 04, 2012 at 10:34:06AM +0100, Jean-Yves Migeon wrote:
 [General]
 - Dumped pool_cache_invalidate_cpu() in favor of pool_cache_xcall()
 which transfers CPU-bound objects back to the global pool.
 - Adapt comments accordingly.

pool_cache_xcall() seems to describe how the function works rather than
what it does.  Is it a new part of the pool_cache(9) API?  If so, I
think that the name should say what the function does.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: lwp resource limit

2012-06-03 Thread David Young
On Wed, May 23, 2012 at 07:37:19PM -0400, Christos Zoulas wrote:
 Hello,
 
 This is a new resource limit to prevent users from exhausting kernel
 resources that lwps use.
 
 - The limit is per uid
 - The default is 1024 per user unless the architecture overrides it
 - The kernel is never prohibited from creating threads
 - Exceeding the thread limit does not prevent process creation, but
   it will prevent processes from creating additional threads. So the
   effective thread limit is nlwp + nproc
 - The name NTHR was chosen to follow prior art
 - There could be atomicity issues for setuid and lwp exits
 - This diff also adds a sysctl kern.uidinfo.* to show the user the uid
   limits
 
 comments?
 
 christos
 Index: kern/init_main.c
 ===
 RCS file: /cvsroot/src/sys/kern/init_main.c,v
 retrieving revision 1.442
 diff -u -p -u -r1.442 init_main.c
 --- kern/init_main.c  19 Feb 2012 21:06:47 -  1.442
 +++ kern/init_main.c  23 May 2012 23:19:31 -
 @@ -256,6 +256,7 @@ int   cold = 1;   /* still 
 working on star
  struct timespec boottime;/* time at system startup - will only 
 follow settime deltas */
  
  int  start_init_exec;/* semaphore for start_init() */
 +int  maxlwp;
  
  cprng_strong_t   *kern_cprng;
  
 @@ -291,6 +292,12 @@ main(void)
  #endif
   l-l_pflag |= LP_RUNNING;
  
 +#ifdef __HAVE_CPU_MAXLWP
 + maxlwp = cpu_maxlwp();
 +#else
 + maxlwp = 1024;
 +#endif
 +

Configuring the kernel with the preprocessor is just so ... wordy. :-)
Maybe use the linker instead?  E.g., provide a weak alias to a default
implementation of cpu_maxlwp() and a strong alias to the MD override on
architectures that have one.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Module name - recommendations/opinions sought

2012-04-25 Thread David Young
On Wed, Apr 25, 2012 at 05:52:31PM -0700, Paul Goyette wrote:
 I'm in the process of modularizing the ieee80211 (Wireless LAN)
 code, and would like some feedback on what the module's name should
 be.  I can think of at least three or four likely candidates:
 
   net80211
   ieee80211

I'd vote for one of these, myself, since there's a correspondence with a
directory name in the first case and with a prefix on a lot of the API
names in the second case.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: introduce device_is_attached()

2012-04-16 Thread David Young
On Mon, Apr 16, 2012 at 06:52:28PM +0200, Christoph Egger wrote:
 
 Hi,
 
 I want to introduce a new function to sys/devices.h:
 
 bool device_is_attached(device_t parent, cfdata_t cf);
 
 The purpose is for bus drivers who wants to attach children
 and ensure that only one instance of it will attach.
 
 'parent' is the bus driver and 'cf' is the child device
 as passed to the submatch callback via config_search_loc().
 
 The return value is true if the child is already attached.
 
 I implemented a reference usage of it in amdnb_misc.c to ensure
 that amdtemp only attaches once on rescan.

Don't add that function.  Just use a small amdnb_misc softc to track
whether or not the amdtemp is attached or not:

struct amdnb_misc_softc {
device_t sc_amdtemp;
};

Don't pass a pci_attach_args to amdtemp.  Just pass it the chipset tag
and PCI tag if that is all that it needs.

I'm not sure I fully understand the purpose of amdnb_miscbus.
Are all of the functions that do/will attach at amdnb_miscbus
configuration-space only functions, or are they something else?  Please
explain what amdnb_miscbus is for.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


loadable verbose message modules (was Re: Kernel panic codes)

2012-04-16 Thread David Young
On Sun, Apr 15, 2012 at 10:57:54PM +1000, Nat Sloss wrote:
 Hi.
 
 I have been working on a program that uses bluetooth sco sockets and I am 
 having frequent kernel panics relating to usb.
 
 I am receiving trap type 6 code 0 and trap type 6 code 2 errors.

I've been thinking that it would be nice if there were more kernel
modules that replaced or supplemented anonymous numbers with their name
or description.  Thus

trap type 6 code 0

and
trap type 6 code 2

would become something like

trap type 6(T_PAGEFLT) code 0

and

trap type 6(T_PAGEFLT) code 2PGEX_W

if and only if the module was loaded.  The existing printf() in
trap_print()

printf(trap type %d code %x ...\n, type, frame-tf_err, ...);

would change (just for example) to

printf(trap type %d%s code %x%s ...\n, type, trap_type_string(type),
frame-tf_err, trap_code_string(type, frame-tf_err), ...);

By default, the number - string conversion functions,

const char *trap_type_string(int type);
const char *trap_code_string(int type, int code);

would be weak aliases for a single function that returns the empty
string.  The kernel module would override the defaults by providing
strong aliases to actual implementations.

For that weak/strong alias thing to work on a loadable module, I
think that Someone(TM) will need to make the kernel linker DTRT when
a modules with overriding strong aliases is added.  If the module is
not unloadable, Someone(TM)'s work is through.  There are some gotchas
making the kernel *un*loadable.  BTW, I also desire this function in the
kernel linker for Other Reasons.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: rbus support one-cardbus only?

2012-04-09 Thread David Young
On Tue, Apr 10, 2012 at 12:01:20AM +0900, KIYOHARA Takashi wrote:
 Hi! all,
 
 
 I have a question.
 
 OPENBLOCKS266 supports two cardbus slots optionaly.
 But I think, MI-rbus not supports multiple cardbus slots.
 [Y/N]?

It supports 1 slot on i386.

I'm trying to get rid of rbus, btw.  I'm already able to run i386
without it.  If you make bus_space_tag_create(9), bus_space_reserve(9),
bus_space_release(9), bus_space_reservation_map(9), and
bus_space_reservation_unmap(9) work on OPENBLOCKS266, then I think that
you can use my patches to avoid rbus.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: pmf(9) clarifications

2012-04-05 Thread David Young
On Wed, Apr 04, 2012 at 07:25:55AM +0100, Iain Hibbert wrote:
 Hi
 
 Regarding pmf(9) API, is it safe to call pmf_device_deregister() if the
 device was not successfully registered as a power handler? The
 documentation does not mention this (though the code looks as if that
 would work fine), nor the device_pmf_is_registered() function which may
 not be actually required? Some driver detach functions use it and some do
 not..

I think that drivers should never call device_pmf_is_registered(), but
sometimes they do anyway.

I don't know if it's safe to pmf_device_deregister() if you haven't
successfully pmf_device_register()'d, but it seems that config_detach()
could help drivers out with the de-registration.

 Also, is it allowed to sleep during suspend/resume?

Yes. fdc(4), for example, sleeps until the disk has stopped I/O or
seeking, and turned off the motor.

 documentation does not mention this, but it seems that shutting down
 cleanly might involve a flush of some kind. (I see that
 pmf_system_suspend() does flush disk caches specifically before the
 suspend, which sidesteps the issue a little)

I think the idea is that by the time pmf_system_suspend() calls
do_sys_sync(), no more buffers can be queued on the filesystems, and
after do_sys_sync(); bus_syncwait(), no more buffers can be queued on
the drivers except by pseudo-disks such as cgd, dk, and raid.  Disk
drivers will ordinarily stop servicing their buffer queue and will order
the hardware to flush its cache.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: CVS commit: src/tests/modules

2012-03-26 Thread David Young
On Mon, Mar 26, 2012 at 11:58:40AM +, YAMAMOTO Takashi wrote:
 hi,
 
  On Mar 25, 2012, at 10:35 PM, YAMAMOTO Takashi wrote:
  
  hi,
  
  doesn't modctl/modload return some error which indicate the reason
  of failure?
  
  EPERM which isn't really useful.
 
 then how about changing it so that it's more useful?

ENXIO seems appropriate.

To add a sysctl to check for MODULAR/non-MODULAR seems unnecessarily
complicated.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Snapshots in tmpfs

2012-03-05 Thread David Young
On Mon, Mar 05, 2012 at 06:14:04AM +, David Holland wrote:
 The problem with that scheme is that you rewrite everything to the
 flash over and over again anytime something changes, which is going to
 generate vastly more write cycles than just using a normal fs.

This scheme doesn't write anytime something changes, it writes
periodically.  The number of write cycles over/under a normal fs depends
on the period, on the rate of application writes, on the proportion of
files changed v. unchanged in a typical period.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Snapshots in tmpfs

2012-02-29 Thread David Young
On Thu, Feb 23, 2012 at 08:04:01PM -0500, Thor Lancelot Simon wrote:
 On Fri, Feb 24, 2012 at 12:45:32AM +, David Holland wrote:
  On Thu, Feb 23, 2012 at 11:20:18PM +, David Holland wrote:
  Is CHFS really suitable for CompactFlash?  Is LFS even usable?
 
 No 

I thought the whole point of chfs was to be able to operate on raw
flash devices that don't have their own flash translation layer.
  
  Oh, my mistake, since there was concern about filesystem type I
  thought you were talking about raw flash, but apparently CompactFlash
  is not raw flash, same as USB sticks aren't.
  
  In that case, just use wapbl.
 
 That doubles the write rate for the common create new version of
 file and rename into place pattern...
 
 Translation layer or not, doubling the write rate to any type of
 flash is not a great idea.

One way to hold writes to flash down to a very low rate is to keep files
that change in a tmpfs, and everything else in a read-only FFS.

Sometimes the files that change need to persist across reboots and power
failures.  One way to make them persist is to periodically write a
checkpoint of the tmpfs containing those files to flash.  After a reset
or power failure, use the last checkpoint to restore the tmpfs.

One way to store the checkpoints is to reserve a partition on flash for
receiving them.  You don't put a filesystem on the checkpoint partition,
but you treat it like a (circular) tape with big blocks.  Ideally, the
block size is a multiple of the biggest block size that the flash uses.

To create a checkpoint of your tmpfs, first you create a (possibly
read-only) snapshot of it: in this way you can write a self-consistent
checkpoint, containing the tmpfs contents at a moment in time, without
suspending tmpfs activity.  Write the checkpoint to the first half of
the checkpoint partition with something like this:

{
checkpoint_header   # writes checkpoint magic, a checkpoint
# generation number, checkpoint date  time
cd $tmpfs_mountpoint
pax -w . | gzip
checkpoint_trailer  # SHA1 sum of previous
} | dd obs=$big_block_size seek=$checkpoint_offset of=$checkpoint_partition

Finally, destroy the snapshot.

Write checkpoints to alternate halves of the checkpoint partition: the
2nd checkpoint to the 2nd half of the checkpoint partition, the 3rd to
the 1st half, 4th to the 2nd half, and so on.

The latest complete checkpoint is the one with the greatest generation
number of all checkpoints with a correct sum.

(It's possible to be fancy, reserving space both for complete
checkpoints and for partials---think partial backups.)

This checkpoint scheme has the interesting property that once the kernel
part, the tmpfs snapshots, is done, you can write the rest using a
Bourne shell script, and there are countless alternate scripts that you
could write.  Also, you can write the checkpoints at the full bandwidth
of whichever device receives them, which can be very fast indeed!

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Snapshots in tmpfs

2012-02-23 Thread David Young
On Thu, Feb 23, 2012 at 07:58:11AM +, David Holland wrote:
 On Wed, Feb 22, 2012 at 08:17:15AM -0600, David Young wrote:
   On Wed, Feb 22, 2012 at 01:42:45PM +0100, Manuel Wiesinger wrote:
*)
What is it good for? The only practical use I can imagine are
backups on thin clients, which operate without a hard disk. But this
is clearly far-fetched, in my eyes.
   
   It's good for writing checkpoints of a tmpfs to non-volatile (NV)
   storage in an embedded system where writing to the NV storage is costly
   (it wears out, or it is slow, or both).  When you have a snapshot, you
   can stream it to NV storage using pax(1).  This is the best practical
   way that I can think of in NetBSD at this time.
 
 other than, say, chfs or lfs?

Is CHFS really suitable for CompactFlash?  Is LFS even usable?

 That sounds like a horrible hack, anyhow, and prone to dying horribly
 if you crash or lose power in the middle of a writeback. (plus you'd
 want to use rsync to transfer, or so I'd think, or rewriting
 unmodified blocks will burn write cycles faster than not bothering to
 do anything special.)

I agree that whatever you have in mind sounds like a horrible hack. :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Snapshots in tmpfs

2012-02-22 Thread David Young
On Wed, Feb 22, 2012 at 01:42:45PM +0100, Manuel Wiesinger wrote:
 *)
 What is it good for? The only practical use I can imagine are
 backups on thin clients, which operate without a hard disk. But this
 is clearly far-fetched, in my eyes.

It's good for writing checkpoints of a tmpfs to non-volatile (NV)
storage in an embedded system where writing to the NV storage is costly
(it wears out, or it is slow, or both).  When you have a snapshot, you
can stream it to NV storage using pax(1).  This is the best practical
way that I can think of in NetBSD at this time.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


project idea: auto-size SYMBTAB_SPACE

2012-02-04 Thread David Young
I have some scripts that I use to build ~every kernel for ~every
architecture that uses PCI, and a recent commit by Christos reminded me
just how many times I had to fiddle with SYMTAB_SPACE in several kernels
just to keep the script from failing halfway through.

Maybe somebody with strong (linker?) fu can set the SYMTAB_SPACE
automatically and save developers lots of SYMTAB_SPACE-fiddling in the
future?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


MI non-interlocked atomic ops?

2012-01-24 Thread David Young
Does NetBSD run on any processor architectures where it is difficult
or impossible in the kernel[1] to provide non-interlocked atomic
operations?  That is, operations that are atomic with respect to other
operations *on the same processor*, but possibly divisible by operations
on other processors?

On x86, creating non-interlocked variants of atomic operations is
usually a small matter of leaving the LOCK prefix off of an instruction,
because traps, interrupts, et cetera, only occur between instructions.
(Do correct me if I am wrong about this!)  I figure that it is more
complicated on a RISC machine where there may be no/few instructions
that read-modify-write RAM.

I ask because the other day I stumbled on the statistics-counting
function mbstat_type_add(),

void
mbstat_type_add(int type, int diff)
{
struct mbstat_cpu *mb;
int s;

s = splvm();
mb = percpu_getref(mbstat_percpu);
mb-m_mtypes[type] += diff;
percpu_putref(mbstat_percpu);
splx(s);
}

which spends (on x86) a small but not immeasurable amount of time in
splx(s), so I was tempted to rewrite it like this:

void
mbstat_type_add(int type, int diff)
{
struct mbstat_cpu *mb;

mb = percpu_getref(mbstat_percpu);
atomic_add_uint_ni(mb-m_mtypes[type], diff);
percpu_putref(mbstat_percpu);
}

There is no such routine as atomic_add_uint_ni(), though, and I don't
know if every NetBSD architecture can supply that routine.

Dave

[1] Or in userland, too, I guess, but I'm mainly interested
in the kernel.

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: New bus_space routine: bus_space_sync

2012-01-20 Thread David Young
On Fri, Jan 20, 2012 at 11:18:38AM +0100, Manuel Bouyer wrote:
 On Thu, Jan 19, 2012 at 08:45:41PM +0100, Martin Husemann wrote:
  Even if originally intended for something else, like Matt says, wouldn't it
 
 Why do you think BUS_SPACE_BARRIER_SYNC was intended for something else ?
 I can't see how a write barrier that doesn't ensure the write has
 reached the target (main or device memory) can be usefull.

My understanding of BUS_SPACE_BARRIER_SYNC is that no read issued
before the barrier may satisfy or follow any read after the barrier,
and no write before the barrier may follow or be combined with
any write after the barrier.  Likewise, no read or write before
the barrier may follow a write or read, respectively, after the
barrier.  The reads and writes do NOT have to be completed when
bus_space_barrier(...BUS_SPACE_BARRIER_SYNC...) returns.

My interpretation of the manual is not very literal, but I believe
that it's a fair description of what to expect on any non-fanciful
implementation of bus_space(9) for memory-mapped PCI space, where writes
can be posted.

bus_space_barrier() is used so little that it may be better to document
the semantics that are useful and feasible, and make sure that the
implementations guarantte those semantics, than to spend a lot of time
on the interpretation.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: New bus_space routine: bus_space_sync

2012-01-20 Thread David Young
On Fri, Jan 20, 2012 at 06:57:59PM +, Eduardo Horvath wrote:
 The semantics seem pretty clear to me.  Now we may have a bunch of buggy 
 implementations, but the man page seems pretty clear to me.

Eduardo,

Oh, good grief.

I realize that is what it SAYS, what I am saying is that perhaps that
is not what the author INTENDED, or else the author believed that what
they wrote was EASIER TO IMPLEMENT than it in fact is.  In other words,
I'm making certain ALLOWANCES for mistakes, mistakes which, BTW, don't
really SURPRISE ME, because we are all after all HUMAN.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Equivalent of Linux Workqueues

2012-01-11 Thread David Young
On Wed, Jan 11, 2012 at 10:24:33AM +, Emmanuel Dreyfus wrote:
 Hello
 
 Another caveat with DADHI porting: that require something like 
 Linux Workqueues feature:
 http://www.kernel.org/doc/htmldocs/device-drivers/ch01s06.html
 (It only uses schedule_work, cancel_work_sync and flush_work).
 
 This is about queuing function execution, cancel it, or wait for it 
 to complete,  Do we have something similar in our kernel, or should
 I implement a dedicated thread with a queue of functions to run?

Others have already mentioned workqueue(9).  I find it difficult to use,
myself.  Sometimes I have used softint(9), instead, as in ixgbe(4) where
I was ported a FreeBSD driver that used taskqueues to NetBSD.

On a related note, it would be very nice if somebody would write an
implementation of kcont(9) that Matt Thomas proposes at
http://www.netbsd.org/~matt/smpnet.html#kcont.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Network drivers and memory allocation in interrupt context

2011-12-08 Thread David Young
On Thu, Dec 08, 2011 at 07:06:29PM -0700, Sverre Froyen wrote:
 Hi,
 
 I now have a semi-working pool_cache based memory allocator for network 
 drivers (tested using the iwn driver). I am uncertain, however, about how to 
 get it fully functioning. The issue is memory allocation in interrupt context.

Is there some reason that the code in ixgbe(4) cannot be adapted to your
needs?  For ixgbe(4), I had to solve the precise problem that stops you,
now.

 3) Rewrite the network drivers so that they do not request memory in
 interrupt context.

This might be a good idea, however, there are a lot of network drivers
to rewrite! :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Importing chewiefs

2011-12-02 Thread David Young
On Thu, Nov 24, 2011 at 11:56:13AM +0100, Magnus Eriksson wrote:
 Why not simply ChipFS?
 
 Seems more reasonable pronunciation-wise than the alphabet soup that
 is see-age-eff-ess or the cough and spit of ch-fs.

I may be chiming in too late, but I suggest calling the filesystem
SneezeFS, because chfs makes me think of the sound of a sneeze. :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: language bindings (fs-independent quotas)

2011-11-17 Thread David Young
On Thu, Nov 17, 2011 at 09:07:05AM -0600, Eric Haszlakiewicz wrote:
 On Thu, Nov 17, 2011 at 03:51:52PM +0100, Manuel Bouyer wrote:
  On Thu, Nov 17, 2011 at 01:02:33PM +, David Holland wrote:
   Writing language bindings for a simple and straightforward library is
   a simple and straightforward undertaking.
  
  OK, so prove it by writing a perl binding format :)
  I've never written a language binding, so it's not going to be
  straightforward for me anyway.
 
 It's probably not as hard as you think:
 http://www.swig.org/tutorial.html
 
 See especially the SWIG for the truly lazy section, where you basically
 just need to point it at a header file.
 Of course, this will result in a rather raw binding, and often times it can
 be useful to customize things to make them more natural for a particular
 language.

I don't think it matters whether it is simple and straightforward to
create a language binding or not.

The advantage to using some standard format for quotas, be it
tab-delimited tables or plists, is that if you know the standard tools
for that format, you can whip up scripts that process it in useful ways.
No language bindings necessary.

Nhat Minh Lê made a great start on stream-oriented XML tools during his
GSoC 2009 project.  IMO, time spent fighting over plists v. simple and
straightforward libraries is time better spent creating decent tools
for current formats like XML  JSON.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: fs-independent quotas (binary plists)

2011-11-17 Thread David Young
On Thu, Nov 17, 2011 at 05:22:24AM -0500, Matthew Mondor wrote:
 On Thu, 17 Nov 2011 10:50:17 +0100
 Manuel Bouyer bou...@antioche.eu.org wrote:
 
  In this context, text format means a key/value pair format, in which
  some keys are optionnal and values can be of arbitrary types. Maybe you can
  do this with a binary format too, but it doesn't exists yet.
 
 This reminds me that years ago someone implemented support to save
 plists in a binary format[1] (this doesn't necessarily mean that it
 would help solve this problem, though).  But I'm surprised that since
 all these years the support wasn't added; anyone know if there is
 general resistance to an optional compact and portable binary format,
 and if so, the reasons?

You know, I *thought* that there was general resistance to a binary
format because it had been repeated so often, but recently I re-read the
actual discussion that occurred and it seemed that there was *not* much
resistance at all.

I think that sometimes people bring up prior absolute prohibitions or
intractable resistance because they are too easily discouraged or else
they don't want to argue an issue on its merits.

 If such a format was supported, it wouldn't be harder to machine or
 human-process (proplib could be used as it is now for code, and bplists
 could be easily exported to an xml format as requested to edit in an
 editor, i.e. via a viplist, plistctl or such command (which also could
 use advisory locking, of course, and save back to binary format if the
 system is configured to use a binary format).  In theory, it could also
 increase performance, and a binary format would be simpler to parse by
 the kernel than xml, minimizing bugs...

I do think that a binary plist format would be a handy option, however,
a binary plist is not as useful by itself as an XML plist, because there
less that you can do to it without specialized tools.

I have said it before, but I would hate to see further reduplication of
editors in UNIX, and that includes a plist editor.  We need a way to
compose filters with editors!

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: pool_cache_invalidate(9) (wrong?) semantic -- enable xcall invalidation

2011-11-17 Thread David Young
On Fri, Nov 18, 2011 at 01:09:49AM +0100, Jean-Yves Migeon wrote:
 - force all xcall(9) API consumers to pass dynamically allocated
 arguments, a bit like workqueue(9) enqueues works. Scheduling
 xcall(9) is now managed by a SIMPLEQ() of requests.
 
 - extends softint(9) API so we can pass arguments to it as well as
 the targetted CPU(s) (optional argument).
 
 The last two points make me think that the softint(9), workqueue(9)
 and xcall(9) APIs have a potential for unification; all of these are
 somewhat redundant, they all schedule/signal/dispatch stuff to other
 threads, albeit under different conditions though.

See Kernel Continuations in http://www.netbsd.org/~matt/smpnet.html.
IIRC, the document does not contemplate putting xcall(9) under
the proposed kernel-continuations framework, but it does mention
both softints and workqueues as candidates for unification under
continuations.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


  1   2   3   >