Re: kcpuset(9) questions

2013-02-01 Thread Jukka Ruohonen
On Fri, Feb 01, 2013 at 06:25:24PM -0600, David Young wrote:
  There was no use case, when I added it.  Can you describe your use case?
  Usually we iterate all CPUs with CPU_INFO_FOREACH() anyway (which should
  also be replaced with a MI interface, but that requires non-trivial
  invasion into all ports).

Another use case is iterating the processor sets for per-CPU-group power
states.

- Jukka.


Re: lua(4), non-invasive and invasive parts

2012-12-28 Thread Jukka Ruohonen
On Fri, Dec 28, 2012 at 10:05:36AM +0100, Marc Balmer wrote:
 If, however, existing software is to use Lua as part of its
 implementation, it needs to be made lua(4) aware, because it is going to
 use the lua(4) API.  If the existing software is a kernel module, it needs
 to record a dependency on the lua(4) kernel module. 

Why is this a problem? Most kernel modules have dependencies. 

 The source code of software using lua(4) needs to be modified, which is
 why I called this scenario invasive.  The lua(4) aware parts can be
 put into #ifdef LUA/#endif sections if a LUA configuration option is being
 used.

This sounds like a moot point since there is no lua(4) yet nor any software
using it...

- Jukka.


Re: lua(4), non-invasive and invasive parts

2012-12-28 Thread Jukka Ruohonen
On Fri, Dec 28, 2012 at 10:05:36AM +0100, Marc Balmer wrote:
 Using a kernel module is not possible in all cases.

By a closer look, you have:

+#ifdef LUA
+MODULE(MODULE_CLASS_DRIVER, gpiosim, gpio,lua);
+#else
 MODULE(MODULE_CLASS_DRIVER, gpiosim, gpio);
-
+#endif
 #ifdef _MODULE

What does this mean? Also the kernel modules using lua(4) will be
conditionally compiled? I think this is fairly strongly against the design
principles of module(7).

- Jukka.


Re: Path to kernel modules (second attempt)

2012-07-07 Thread Jukka Ruohonen
On Sat, Jul 07, 2012 at 08:57:10PM +0100, Mindaugas Rasiukevicius wrote:
 Regarding the PR/38724, I propose to change the path to /kernel/.
 Can we reach some consensus quickly for netbsd-6?

I'd vote for /lib/modules noted in the PR (or maybe under /libdata?)
simply because in my opinion the root hierarchy has already been abused too
much in NetBSD. On the other hand, I don't see anything wrong with /stand
either.

Two cents,

- Jukka.


Re: link-sets in modules

2012-05-29 Thread Jukka Ruohonen
On Tue, May 29, 2012 at 03:00:58PM -0700, Paul Goyette wrote:
 Well, at least for sysctl's SYSCTL_SETUP() stuff, you probably don't 
 want to use the same initialization call for modules as is used for 
 built-ins.  The built-ins are initialized with an explicit NULL argument 
 passed for the sysctl_clog argument, which makes it difficult for a 
 module to do its clean-up.  Modular code needs to (or at least, should?) 
 pass a non-null module-specific clog so it can be used during an undo 
 at MODULE_CMD_UNLOAD time.

Indeed I consider it a bug if a module uses SYSCTL_SETUP() (and does not
tear down the nodes after unload). This applies more generally to drivers
too.

- Jukka.


Re: introduce device_is_attached()

2012-04-17 Thread Jukka Ruohonen
On Mon, Apr 16, 2012 at 07:49:28PM +0100, Iain Hibbert wrote:
 I'm kind of with David Young, surely this is what the softc is for.. so
 that the parent can keep track of which resources it has allocated (and by
 inference, not reallocate them to another device)

I agree with this too; numerous drivers/frameworks use the above scheme.
If you add this function, please refactor these to follow this new
(superfluous) idiom.

- Jukka.


Re: CVS commit: src/tests/modules

2012-04-16 Thread Jukka Ruohonen
On Mon, Mar 26, 2012 at 12:10:30AM -0700, Matt Thomas wrote:
  doesn't modctl/modload return some error which indicate the reason
  of failure?
 
 EPERM which isn't really useful.

Oddly enough, it actually fails with EPERM on Sparc, but with ENOSYS on Xen.
Manuel pointed out that it might be kobj_load_vfs(), kobj_load_mem(), or
kobj_stat() that returns ENOSYS.

- Julka.


Re: CVS commit: src

2012-03-14 Thread Jukka Ruohonen
On Wed, Mar 14, 2012 at 09:55:21AM +, Martin Husemann wrote:
 This seems to cause deadlocks in the *fs_rename_dir tests.

Also the page residency check written by thorpej years ago now fails for the
first time.

- Jukka.


Re: A simple cpufreq(9)

2011-09-30 Thread Jukka Ruohonen
On Fri, Sep 30, 2011 at 11:27:46AM -0500, David Young wrote:
 I don't think that the division of responsibility for power management
 between kernel  userland is obvious.

It may not be, but the arguments against kernel-level implementation are
largely practical. There is no one size fits all. The kernel can not decide
when the screen is too dark to read or what the battery level should be.
In short: I think there are too many variables to do it in the kernel.

I think almost a consensus was reached about making a rc.conf(5)-like
configuration file for powerd(8):

http://mail-index.netbsd.org/tech-userlevel/2011/05/06/msg005009.html

Some imaginary examples:

suspend_lid=YES   # Suspend when the lid is closed
suspend_button=YES# Activate suspend-button

battery_backlight=90  # Backlight (%) when AC is off
battery_stop_daemons=bluetooth# Daemons to stop when AC is off

None of these really belong to the kernel.

 How do you hope for cpufreq(9) to be used?

I have a patch ready that transforms the existing MD implementations to use
it, so at first iteration, it will provide a consistent user interface via
cpuctl(8). Later, different governors can be implemented.

 While reading the API and discussion, it occurred to me that if
 cpufreq(9) is chiefly used for making power/performance trade-offs,
 maybe the API should be concerned with the goal (power savings) instead
 of an independent variable (frequency).

I think it is more about utilizing increase() and decrease() functions
in well chosen sections of kernel code. Or rather, pass them via a governor
that does the selection whether it is plausible to frequency--.

 Then maybe you can use one API---cpupm(9)?---to set the objective, and let
 the implementation choose the variables (C-state, P-state, frequency) to
 tweak.

Obviously each one may exist regardless of the other. For instance, currently
in NetBSD only P-states are (or can be) used. There are no C-states in
arch/macppc. Generally, for practical reasons, I'd vote for a bottom-up
approach here. If someone later realizes that all this fits to a perfect
single API, I am all for it.

But to read between the lines, I think you are approaching what could be
called power management quality of services. Shameless Linux-plug again,
but the slides are worth a look:

http://elinux.org/images/f/f9/Elc2008_pm_qos_slides.pdf

- Jukka.

PS. P-states == frequency.


Re: A simple cpufreq(9)

2011-09-29 Thread Jukka Ruohonen
On Thu, Sep 29, 2011 at 03:36:03PM -0500, David Young wrote:
 What's the difference in power savings between changing C-state and
 changing frequency?  Do the power savings from every change in C-state
 dominate the savings from any change in frequency?

Depends on the machine. But generally on x86, C-states appear to be now the
dominant form. But these go side by side. To cut the corners short: the
general (hardware) idea is that while few CPUs are in a deep C-state (i.e. 
idle), a group of other CPUs can enter a high-performance P-state. The net
result should be increase of performance, despite of the power management.

But obviously for instance ARM may do this all differenly, using only
frequency scaling.

 It seems that ultimately we need an API for telling a power-savings goal
 and constraints (latency, throughput, battery life, the screen isn't too
 dark to read) for the system to meet.  Do you hope for someone to build
 that into the kernel on top of cpufreq(9)?

Not really; as I've written before, my opinion is that most of this should
be in the user space. The CPU PM is an exception for obvious reasons. There
could be a more involved API for cpu_idle(9) though (cf [1]).

- Jukka.

[1] The Linux cpuidle-subsystem; http://lwn.net/Articles/384146/


Re: A simple cpufreq(9)

2011-09-26 Thread Jukka Ruohonen
On Mon, Sep 26, 2011 at 10:03:06AM -0500, David Young wrote:
 Instead, provide an API routine for finding out the number of states
 (nstates) and a routine for selecting a state [0, nstates - 1].

The code is ready and it is available in [1]. However, I can not complete it
because when trying to upgrade, I encounter PR kern/45361.

All existing drivers were converted, expect ichlpcib(4) and piixpcib(4)
(for these, I think first the SpeedStep should be splitted as a child
device of the bridge).

This breaks COMPAT_50 of cpuctl(8). How to handle that? #ifdefs?

- Jukka.

[1] ftp://ftp.NetBSD.org/pub/NetBSD/misc/jruoho/codeanddiff

* * *

CPUFREQ(9) NetBSD Kernel Developer's Manual  CPUFREQ(9)

NAME
 cpufreq, cpufreq_register, cpufreq_deregister, cpufreq_suspend,
 cpufreq_resume, cpufreq_get, cpufreq_set, cpufreq_set_all -- interface
 for CPU frequency scaling

SYNOPSIS
 #include sys/cpufreq.h

 int
 cpufreq_register(struct cpufreq_if *cif);

 void
 cpufreq_deregister(void);

 void
 cpufreq_suspend(struct cpu_info *ci);

 void
 cpufreq_resume(struct cpu_info *ci);

 void
 cpufreq_get(struct cpu_info *ci, uint16_t *freq);

 int
 cpufreq_get_if(struct cpufreq_if *cif);

 void
 cpufreq_set(struct cpu_info *ci, uint16_t freq);

 void
 cpufreq_set_all(uint16_t freq);

DESCRIPTION
 The machine-independent cpufreq interface provides a framework for CPU
 frequency scaling done by a machine-dependent backend implementation.
 User space control is available via cpuctl(8).

 The cpufreq interface is a per-CPU framework.  It is implicitly assumed
 that the frequency can be set independently for all processors in the
 system.  However, cpufreq does not imply any restrictions upon whether
 this information is utilized by the actual machine-dependent
 implementa-
 tion.  It is possible to use cpufreq with frequency scaling implemented
 via pci(4).  In addition, it assumed that the available frequency
 levels
 are shared uniformly by all processors in the system, even when it is
 possible to control the frequency of individual processors.

 It should be noted that the cpufreq interface is generally stateless.
 This implies for instance that possible caching should be done in the
 machine-dependent backend.  The cpufreq_suspend() and cpufreq_resume()
 functions are exceptions.  These can be integrated with pmf(9).

FUNCTIONS
 cpufreq_register(cif)
  The cpufreq_register() function initializes the interface by
  associating a machine-dependent backend with the framework.
  Only one backend can be registered.  Upon successful completion,
  cpufreq_register() returns 0 and sets the frequency to the maxi-
  mum available level.

  The following elements in the cpufreq_if structure should be
  filled prior to the call:

char name[CPUFREQ_NAME_MAX];
struct cpufreq_state state[CPUFREQ_STATE_MAX];
uint16_t state_count;
bool mp;
void*cookie;
xcfunc_t get_freq;
xcfunc_t set_freq;

  ·   The name of the backend is required.

  ·   The cpufreq_state structure conveys descriptive information
  about the frequency states.  The following fields can be
  used for the registration:

uint16_t freq;
uint16_t power;

  From these freq (the clock frequency in MHz) is mandatory,
  whereas the optional power can be filled to describe the
  power consumption (in mW) of each state.

  ·   The state_count defines the number of states that the back-
  end has filled in the state array.

  ·   The mp boolean should be set to false if it is known that
  the backend can not handle per-CPU frequency states; changes
  should always be propagated to all processors in the system.

  ·   The cookie field is an opaque pointer passed to the backend
  when cpufreq_get() cpufreq_set(), or cpufreq_set_all() is
  called.

  ·   The get_freq and set_freq are function pointers that should
  be associated with the machine-dependent functions to get
  and set a frequency, respectively.  The xcfunc_t type con-
  forms to xcall(9).  When the function pointers are invoked
  by cpufreq, the first parameter is always the cookie and
  the second parameter is the frequency, defined as uint16_t *.


Re: A simple cpufreq(9)

2011-09-26 Thread Jukka Ruohonen
On Mon, Sep 26, 2011 at 05:51:13PM +, Christos Zoulas wrote:
 Why advertise uint16_t, are we trying to save memory? I would just do
 them uint32_t...

While few things are certain in computing, I don't think we are going to
see a 65535 MHz processor any time soon. But sure, uint32_t is fine too.

- Jukka.


Re: A simple cpufreq(9)

2011-09-25 Thread Jukka Ruohonen
On Sat, Sep 24, 2011 at 07:53:47PM +0200, Joerg Sonnenberger wrote:
 I was listening possible decision making factors. Depending on your
 environment, you have all or none of them. The main point is that good
 decision making needs more than just You can toogle this.

So here is a quick draft for the first iteration with the cpuctl(8). If there
are issues, speak now, otherwise I'll proceed with something based on this.

- Jukka.
/*  $NetBSD$ */

/*-
 * Copyright (c) 2011 Jukka Ruohonen jruoho...@iki.fi
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in the
 *documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */
#include sys/cdefs.h
__KERNEL_RCSID(0, $NetBSD: subr_cpufreq.c,v 1.15 2011/09/02 22:25:08 Exp $);

#include sys/param.h
#include sys/cpu.h
#include sys/cpufreq.h
#include sys/kmem.h
#include sys/module.h
#include sys/mutex.h
#include sys/time.h
#include sys/xcall.h

static struct cpufreq_if *cpufreq_if = NULL;
static intcpufreq_latency(void);

int
cpufreq_register(struct cpufreq_if *cif)
{
size_t i, j;
int rv;

KASSERT(cif != NULL);
KASSERT(cif-get_freq != NULL);
KASSERT(cif-set_freq != NULL);
KASSERT(cif-state_count  0);
KASSERT(cif-state_count  CPUFREQ_STATE_MAX);

mutex_enter(cpu_lock);

if (cpufreq_if != NULL) {
mutex_exit(cpu_lock);
return EALREADY;
}

mutex_exit(cpu_lock);
cpufreq_if = kmem_zalloc(sizeof(*cif), KM_SLEEP);

if (cpufreq_if == NULL)
return ENOMEM;

mutex_enter(cpu_lock);

cpufreq_if-cookie = cif-cookie;
cpufreq_if-get_freq = cif-get_freq;
cpufreq_if-set_freq = cif-set_freq;

for (i = j = 0; i  cif-state_count; i++) {

if (cif-state[i].freq == 0)
continue;
else {
j++;
}

cpufreq_if-state[i].freq = cif-state[i].freq;
cpufreq_if-state[i].power = cif-state[i].power;
}

cpufreq_if-state_count = j;
rv = cpufreq_latency();
mutex_exit(cpu_lock);

return rv;
}

void
cpufreq_unregister(struct cpufreq_if *cif)
{

mutex_enter(cpu_lock);

if (cpufreq_if == NULL) {
mutex_exit(cpu_lock);
return;
}

mutex_exit(cpu_lock);
kmem_free(cpufreq_if, sizeof(*cif));
}

static int
cpufreq_latency(void)
{
struct cpufreq_state *state;
struct timespec nta, ntb;
const size_t n = 10;
uint64_t sample;
size_t i, j;

/*
 * Sample the transition latency for each state.
 */
for (i = 0; i  cpufreq_if-state_count; i++) {

state = cpufreq_if-state[i];
KASSERT(state-freq  0  state-freq  );

for (j = 0, sample = 0; j  n; j++) {

nta.tv_sec = nta.tv_nsec = 0;
ntb.tv_sec = ntb.tv_nsec = 0;

nanotime(nta);

mutex_exit(cpu_lock);
cpufreq_set(curcpu(), state[i].freq);
mutex_enter(cpu_lock);

nanotime(ntb);
timespecsub(ntb, nta, ntb);

/*
 * If the transition latency is measured
 * in seconds, the backend is not suitable.
 */
if (ntb.tv_sec != 0)
continue;

sample += ntb.tv_nsec;
}

if (sample == 0)
return EMSGSIZE;

state-latency = sample

Re: A simple cpufreq(9)

2011-09-25 Thread Jukka Ruohonen
On Sun, Sep 25, 2011 at 10:50:44AM +0200, Alan Barrett wrote:
 On Sun, 25 Sep 2011, Jukka Ruohonen wrote:
 So here is a quick draft for the first iteration with the cpuctl(8). If 
 there
 are issues, speak now, otherwise I'll proceed with something based on this.
 
 You forgot to include the documentation.

It is not production code, but it should be pretty straightforward
to see the design blocks without documentation. From the issues dicussed,
MHz, power reduction (if available), and estimated transition latency is
exported to userland.

- Jukka.


Re: A simple cpufreq(9)

2011-09-24 Thread Jukka Ruohonen
On Sat, Sep 24, 2011 at 06:10:46AM +, Michael van Elst wrote:
 Pick one low and one high value and select these when the core/cpu/machine
 is idle vs. when the machine is busy. Selecting the low and the high
 value is much easier for a human when you label it in terms of the
 specific platform intead of a misleading percentage.

I agree: this is naturally the simplest and most consistent approach.
(And most of all, it is that especially for the kernel.)

In my opinion this would satisfy well the requirements of CPU frequency
scaling in NetBSD. But do we have strong cases in favor of the intermediate
values? As I see it, stepping always to the highest value when in load would
only trade some minor power consumption increase for simplicity.

- Jukka.


Re: A simple cpufreq(9)

2011-09-24 Thread Jukka Ruohonen
On Sat, Sep 24, 2011 at 04:33:07PM +1000, matthew green wrote:
 i think that joerg's point is that the kernel-user API is the
 wrong place to be making these sorts of humanisations.

You all miss the point that these humanisations (a.k.a. abstractions) are
primarily targeted for the use in sys/kern. The kernel-user API is just a
convenient side product of these abstractions.

- Jukka.


Re: A simple cpufreq(9)

2011-09-24 Thread Jukka Ruohonen
On Sat, Sep 24, 2011 at 10:14:13AM +0200, Joerg Sonnenberger wrote:
 For the in-kernel use it should be completely irrelevant what the
 property is.

Sigh. Answer my question then.

 Any automatic mechanism should only care about the associated properties
 (latency, reduction in power consumption etc).  I don't see any of that
 addressed at this point either.

If you read the original email again, it was not even the goal to address this.

But all specific things like latency and the specified reduction in power
consumption are going to be MD-specific implementation details. An MI
implementation should be designed so that it will work also when these are
missing. A basic design principle.

In order to be able to proceed incrementally, you need to start somewhere.

- Jukka.


Re: A simple cpufreq(9)

2011-09-24 Thread Jukka Ruohonen
On Sat, Sep 24, 2011 at 08:35:11AM +, Michael van Elst wrote:
 What is wrong with the abstraction of having a number of ordered
 performance states? Hiding the states behind something pretending to
 be a continuum (wether this is a MHz value or a percentage doesn't matter)
 causes confusion. You never know what performance state you selected
 and you start to assume that 100% is twice as fast as 50% even when
 there is no correlation.

Nothing. As I wrote, (an integer) percentage is an ordered scale. I agree:
perhaps not the best one. But you have to satisfy:

1. Boolean scale (already: ichlpcib(4), piixpcib(4), PowerPC).

2. Interval scale (usually x86).

3. Interval scale with nonuniform intervals (can be on x86/ARM).

If I'd have to pick, I would take the first one, as you originally wrote.

- Jukka.

PS.

And this assertion holds: a machine-independent implementation that can not
satisfy the six machine-dependent implementations currently in the tree is
inherently and completely broken. Thus, mixing latencies and whatnot to the
soup can be done later, preferably in the MD implementations.


Re: A simple cpufreq(9)

2011-09-24 Thread Jukka Ruohonen
On Sat, Sep 24, 2011 at 10:14:13AM +0200, Joerg Sonnenberger wrote:
 For the in-kernel use it should be completely irrelevant what the
 property is. Any automatic mechanism should only care about the
 associated properties (latency, reduction in power consumption etc). I
 don't see any of that addressed at this point either.

So let's look at this from a different POV.

Here is a datasheet for the reliable information:

1. Maximum and minimum (or on/off).

2. Idle times, etc. (MI stuff).

3. Transition latency (easily simulated).

Here is a datasheet for unreliable and unknown data:

1. Actual clock rate.

   Unreliable.

   We all know what est(4) and powernow(4) are like; no policy
   should rely on any intermediate values that comes out of these
   two. As usual, with ACPI there are a lot of known bugs about
   incorrect values.

   Unknown.

   This is the case with ichlpcib(4) and piixpcib(4). Might be the
   case on some embedded/exotic things as well.

   Maintenance.

   If the information is available from a datasheet, requires
   tabulation for each CPU model.

2. Power per state.

   In majority of cases this is entirely unknown, and if coming from
   the BIOS, too unreliable.

I am quite familiar with the Linux cpufreq subsystem and I am not convinced
at all that we want something like that. I vote for a simple on/off boolean.

- Jukka.


Re: A simple cpufreq(9)

2011-09-24 Thread Jukka Ruohonen
On Sat, Sep 24, 2011 at 07:53:47PM +0200, Joerg Sonnenberger wrote:
 It's not relevant what the exact clock rate is. It's an approximation.
 Just like the TSC frequency won't be measured the same on every boot.

It would be relevant if the interval would be guaranteed to be uniform.

 Ignoring intermediate values can be literally a lot of unwanted noise.
 On my old laptop, I couldn't play all medium quality H.264 streams at
 smallest CPU frequency. It worked with some of the intermediate levels
 and those create enough heat less, that it makes a difference in terms
 of fan activity. My point is that not every load is switches between
 idle and 100%.

Naturally. But given the lack of proper information, you end up doing crazy
guessing game with the steppings in order to see whether the frequency is at
a sustainable level w.r.t. the load. This is what the Linux governors are all
about. And Linux people are rewriting the ondemand governor since it really
doesn't work that well, even after all these years.

What a boolean gives you is: simplicity and a bias towards performance
(which I think should be the priority on NetBSD generally). This is traded
for minor power consumption increase and possibly heat. Should be fine for
servers and most laptop users.

And as far as x86 is concerned, the power savings from CPU are really coming
from C-states today. So one can debate whether the 4000 LOC complexity of
Linux's cpufreq subsystem is really worth the trouble.

- Jukka.


Re: A simple cpufreq(9)

2011-09-24 Thread Jukka Ruohonen
On Sun, Sep 25, 2011 at 07:06:35AM +1000, matthew green wrote:
 (FWIW, ultrasparcIIIi has cpufreq features, iirc, it allows the freq
 to run at 1/2 and 1/16th normal.  i'm sure that the modern fijitsu
 SPARC64 also has it, but i don't know much about it.)

When one thinks about the modern world and especially AMD CPUs that can
already do per-CPU(group) states, setting the minimum and maximum sounds
attractive and reasonable. For instance, if you set CPUs offline, also the
frequency should scale down.

 i'd like to re-iterate what i said earlier though -- i'd really much
 rather this became real-code in the tree sooner than when it becomes
 a perfect API.

Fine. Let's start by importing a proplist of MHzs to cpuctl(8). At least the
current mess is solved.

- Jukka.


A simple cpufreq(9)

2011-09-23 Thread Jukka Ruohonen
Hello.

The kernel needs a MI interface for CPU frequency scaling. Below is a draft
that is deliberately as simple as possible.

This is NOT about frequency scaling done by the kernel as a governor
(although the long-term goal should point to that direction). The present
goal is just to add a simple MI interface (or rather, wrapper) and abstract
away machine- and platform-dependent code and user interfaces.

As far as the implementation goes, this would add two simple MD callback
functions to cpu_info. All ugly MD sysctls would be deprecated and setting
the frequency would be done by cpuctl(8) via these callbacks.[1]

The only interesting detail is that the term CPU frequency is defined as
a percentage value: 100 % implies full performance and 0 % denotes the
lowest performance. This follows OpenBSD, being also one of the few
possible ways to define CPU frequency levels independently from the
machine. All details are left to the MD implementations, including how to
translate the percentage to actual frequency (and/or voltage, etc.). For
instance, for systems with only two states, all values higher than zero would
set the high performance mode.[2]

Comments?

- Jukka.

[1] The currently known users would include x86 (with five (!) different
implementations) and PowerPC, but frequency scaling is nowadays widely
used also in the ARM realm.

[2] Note that even in the x86 land it is no longer necessarily known which
is the exact MHz at which the CPU currently runs (cf. TurboBoost, etc.).

* * *

CPUFREQ(9) NetBSD Kernel Developer's Manual  CPUFREQ(9)

NAME
 cpufreq -- interface for CPU frequency scaling

SYNOPSIS
 #include sys/cpufreq.h

 void
 cpufreq_register(cpufreq_get_cb *get, cpufreq_set_cb *set, void *aux);

 bool
 cpufreq_get(struct cpu_info *ci, uint8_t *valp);

 bool
 cpufreq_set(struct cpu_info *ci, const uint8_t val);

DESCRIPTION
 The machine-independent cpufreq interface provides a simple framework for
 CPU frequency scaling.

   1.   The cpufreq interface uses percentage values in place of
actual frequencies.  Thus, values 100 % and 0 % denote the
highest and lowest frequency supported by the CPU. It is the
responsibility of machine-dependent implementations to trans-
late the percentage values to actual frequencies or other
related performance levels.

   2.   The cpufreq interface is stateless and does no locking while
calling the machine-dependent callbacks.

   3.   The cpufreq interface is a per-CPU framework.  It is implic-
itly assumed that the frequency can be set independently for
all processors in the system.  However, cpufreq does not imply
any restrictions upon whether this information is utilized by
the actual machine-dependent implementation.


FUNCTIONS
 cpufreq_register(get, set)
  The cpufreq_register() function initializes the subsystem by
  associating the machine-dependent callback functions get and set
  with the machine-independent cpufreq_get() and cpufreq_set(),
  respectively.  The cpufreq_set_cb and cpufreq_get_cb types are
  function pointers defined as:

   bool (*get)(struct cpuinfo_t *ci, void *aux, uint8_t *valp)
   bool (*set)(struct cpuinfo_t *ci, void *aux, const uint8_t val)

  Note that cpufreq does not keep track of the registered call-
  backs.  Each call to cpufreq_register() will override any exist-
  ing callbacks.

 cpufreq_get(ci, valp)
  The cpufreq_get() function obtains the current frequency level
  of the CPU pointed by ci in the parameter valp.

 cpufreq_set(ci, val)
  The cpufreq_set() function sets the performance level of ci to
  val.  The value val is guaranteed to be in the range [0, 100].

CODE REFERENCES
 The cpufreq subsystem is implemented within sys/kern/subr_cpufreq.c.

SEE ALSO
 cpuctl(8)

HISTORY
 The cpufreq subsystem first appeared in NetBSD 6.0.


Re: core's decision on modular kernels

2011-09-23 Thread Jukka Ruohonen
On Thu, Sep 22, 2011 at 08:35:02PM +0100, David Laight wrote:
  I think that by MODULAR with built-in modules, you mean a barebones
  kernel linked with some .kmod's?  I would love to see that.  What has to
  happen to make it so?
 
 Probably just some 'round tuits'.
 Mostly in the area of config() and the kernel makefile.
 
 First stage would be linking an existing kmod into the kernel and sorting
 out the required data area linkage to get it initialised.

Wouldn't this provide an answer also the difficult question of autoloading
driver modules? Assuming a robust mechanism, link everything and then
selectively unload what did not attach during autoconfiguration?

- Jukka.


Re: A simple cpufreq(9)

2011-09-23 Thread Jukka Ruohonen
On Fri, Sep 23, 2011 at 08:42:16PM +0200, Joerg Sonnenberger wrote:
 On Fri, Sep 23, 2011 at 01:02:52PM +0300, Jukka Ruohonen wrote:
  [2] Note that even in the x86 land it is no longer necessarily known which
  is the exact MHz at which the CPU currently runs (cf. TurboBoost, etc.).
 
 TurboBoost is a good reason for why percent is a bad measurement as
 well. In fact, I find it more confusing. If a tool reports to the user
 that NetBSD runs the CPU at 85% during a build.sh -j16, that's going to
 result in surprising questions...

Well this is not really the case with TurboBoost; while there are few
somewhat vague means to know the turbo in the kernel (not as a MHz
though), this would show in the userland as 100 %, like it would show for
the users of the MI functions.

 Consider making the unit of scaling an additional attribute of the
 list and provide userland with:
 
 id, data, unit
 
 as list, get/set is using the id.

I particularly wanted to avoid importing any lists to the MI kernel parts or
to the userland. Using a percentage like done in OpenBSD would greatly
simplify further uses in the kernel. But of course an unit of measurement
(MHz, voltage, etc.) can be imported for heuristic purposes.

Nor do I see any real difference whether an user sets the CPU frequency to
{ 16 %, 35 %, 100 % } MHz or to { 821, 922, 1657 } MHz, expect that the
former is clearer and more user friendly.

Note also that for instance some ARM systems may use very fine grained lists.

- Jukka.


Re: A simple cpufreq(9)

2011-09-23 Thread Jukka Ruohonen
On Sat, Sep 24, 2011 at 07:20:16AM +0200, Joerg Sonnenberger wrote:
 You can not avoid providing a list of available states. Such an
 interface is inherently and completely broken.

Heh, right. Why?

  Nor do I see any real difference whether an user sets the CPU frequency to
  { 16 %, 35 %, 100 % } MHz or to { 821, 922, 1657 } MHz, expect that the
  former is clearer and more user friendly.
 
 Sorry, it is not.

So you propose sailing to the dark waters of sysmon_envsys(9)? You need to
export integers (e.g. MHz), booleans (on/off), triplets (low/medium/high),
and so on, all depending on the machine and/or platform.

 Removing data because it is more user friendly kind of misses the point
 that the kernel is not the UI.

The exported data is not reliable. On x86 it is typically rounded and
approximated by the BIOS writers.

  Note also that for instance some ARM systems may use very fine grained 
  lists.
 
 ...and?

Try to think beyond cpuctl(8). This should be done so that there is a
*simple* MI interface that can be extended in the future.

If you want a MI interface, the first thing is to agree upon a common scale.
Frequency scaling is a step function, so you could start from

1. { LOW, HIGH } or { DISABLED, ENABLED }.

Maybe you could then proceed with

2. { LOW, MEDIUM, HIGH }
or
   { 800 MHz, 1200 MHz, 1600 MHz }.

But how about

3. { 800 MHz, 805 MHz, 810 MHz, 815 MHz, 820 MHz, 1300 MHz, 1800 MHz }?

4. And how about some ARM system that may export over 30 states,
   possibly with non-uniform intervals?

Can outline a consistent algorithm for an user-space or in-kernel governor
to choose a state from these four examples?

As such, a percentage here is nothing more than a scale from zero to
hundred. It would be the responsibility of the MD implementation to
interpolate this to whatever scale it may be using.

But of course it is inherently and completely broken.

- Jukka.


Re: strnlen(3) in kernel

2011-09-06 Thread Jukka Ruohonen
On Tue, Sep 06, 2011 at 01:39:14PM +0200, Jean-Yves Migeon wrote:
 Is there a way to know what functions are available from libkern, and
 those only found in userland libs? Except by looking at libkern.h?

No. While there are memcpy(9) etc., I think we should have a single
libkern(3) with references to the section 3 (with some notes, if necessary).

- Jukka.


Re: autoclean mode for tmpfs

2011-08-07 Thread Jukka Ruohonen
On Sun, Aug 07, 2011 at 03:10:29AM +, David Holland wrote:
 So I just had an idea: since cleaning /tmp on a live system is very
 dangerous unless done so (and even then somewhat dangerous), plus
 there are other possible uses for automatically disappearing files:
 
 How hard would it be to add a mount option for tmpfs to automatically
 drop files after a given timeout? It seems to me that it shouldn't be
 very difficult, but I haven't looked at the tmpfs innards in a while.
 
 Anyone think this is worthwhile?

Sounds like a job for the userland and cron(8).

- Jukka.


Re: autoclean mode for tmpfs

2011-08-07 Thread Jukka Ruohonen
On Sun, Aug 07, 2011 at 07:09:14AM +, David Holland wrote:
   Sounds like a job for the userland and cron(8).
 
 uh no.
 
 See: since cleaning /tmp on a live system is very dangerous

So care to elaborate what is dangerous about it?

I do clean /tmp daily, but it needs to be done selectively.

- Jukka.


Re: pchb@acpi

2011-08-01 Thread Jukka Ruohonen
On Mon, Aug 01, 2011 at 08:59:57PM +0200, Matthias Drochner wrote:
 
 I think it is OK to attach the PCI buses which are defined by ACPI
 at acpi. The attachment frontend can install hooks to get interrupt
 routing right. This would also help wakeup support for eg USB
 and ethernet devices.

Indeed. We need this for all PCI buses and devices. That is why hacks like
device_is_a() etc. won't do. And as you noted, there is awful lot of ugly
duplication because ACPI is already heavily required for x86 interrupt
routing. That said, I don't think this kind of attachment is required for
the IRQ setup per se (at least not in my branch).

- Jukka.


Re: RFC: New security model secmodel_securechroot(9)

2011-07-25 Thread Jukka Ruohonen
On Sat, Jul 23, 2011 at 09:35:43PM +0300, Aleksey Cheusov wrote:
  * Exec logging within chroot
 What's this?

It has been quite a while since I used Grsecurity, but it logs a message
every time a program is executed within a chroot. This may be useful to
audit chroot'ed daemons, but if I remember correctly, this was a compile-
time option in Linux.

- Jukka.


Re: Dutch keymap not imported into NetBSD :p

2011-07-21 Thread Jukka Ruohonen
 On the X30 I got NetBSD-5.1 and there is no nl
 keymap. Google pointed me to NetBSD problem report
 number 35473.
 
 Could Spanny patch be included into NetBSD-current ?

Yes, I will commit it shortly.

- Jukka. 


Re: RFC: New security model secmodel_securechroot(9)

2011-07-13 Thread Jukka Ruohonen
On Thu, Jul 14, 2011 at 12:07:56AM +0300, Aleksey Cheusov wrote:
  So what is the security policy you mean to enforce by blocking paths
  into the kernel with kauth?  For every `destructive modification' that
  can be done to the system, what is every path into the kernel that
  leads to that modification?
   Have you blocked all such paths in your kauth secmodel?

 I'm open for concrete ideas and references.

I haven't followed the discussion that closely, but the following list appears
in the chroot(2) restrictions of the PaX/Grsecurity (Linux) project:

* No attaching shared memory outside of chroot
* No kill outside of chroot
* No ptrace outside of chroot (architecture independent)
* No capget outside of chroot
* No setpgid outside of chroot
* No getpgid outside of chroot
* No getsid outside of chroot
* No sending of signals by fcntl outside of chroot
* No viewing of any process outside of chroot, even if /proc is mounted
* No mounting or remounting
* No pivot_root
* No double chroot
* No fchdir out of chroot
* Enforced chdir(/) upon chroot
* No (f)chmod +s
* No mknod
* No sysctl writes
* No raising of scheduler priority
* No connecting to abstract unix domain sockets outside of chroot
* Removal of harmful privileges via capabilities
* Exec logging within chroot

- Jukka.


Re: IOC_CPU_SETSTATE

2011-07-04 Thread Jukka Ruohonen
On Sun, Jul 03, 2011 at 10:49:59PM +0100, Alexander Nasonov wrote:
 BTW, intr/nointr is not documented in cpuctl(8).

One possible reason for this is that per-CPU intr/nointr is not yet
supported on e.g. x86, AFAIK.

- Jukka.


Re: add DIAGNOSTIC back to GENERIC/INSTALL

2011-07-03 Thread Jukka Ruohonen
On Sun, Jul 03, 2011 at 07:27:00PM +0200, Manuel Bouyer wrote:
 it's not only about Xen, it's about all kernels for any port which
 already have DIAGNOSTIC and want to keep it even for release
 (e.g. i386 ALL).

As far as I understand, i386/ALL is just for testing the compilation of
various options and drivers. I doubt whether it even boots.

- Jukka.


Re: uvm locking inconsistency

2011-06-15 Thread Jukka Ruohonen
On Wed, Jun 15, 2011 at 09:30:17PM +0200, Manuel Bouyer wrote:
 I fear so, sadly. I think DIAGNOSTIC should be back in x86 GENERIC
 kernels on HEAD (this can be switched off in release branches)

Contrary, I think every viable debug option (DIAGNOSTIC + LOCKDEBUG at
least) should be enabled in HEAD, but disabled in release kernels. An easy
way to catch obvious regression that should never enter a release kernel.
The so-called HEAD is the main development branch, after all...

- Jukka.


Re: Merge of rmind-uvmplock branch

2011-06-01 Thread Jukka Ruohonen
On Tue, May 31, 2011 at 10:15:36PM +0100, Mindaugas Rasiukevicius wrote:
 Unless anyone objects, I will merge rmind-uvmplock branch.  The technical
 objectives of the branch are described here:

Indeed, and as usual, extraordinary work!

- Jukka.


Re: pmf(9) vs sysmon for power events (especially sleep when powerd(8) is not running)

2011-05-07 Thread Jukka Ruohonen
On Sat, May 07, 2011 at 09:03:42PM +0200, Jean-Yves Migeon wrote:
 - sysmon_pswitch(9) can still be used to register power switch events,
 these events being modeled following a switch functionality e.g. when
 a threshold is passed.

Yes. Although I don't know what you mean by thresholds.

 - pmf(9) is focused on device states, so it's lower level than
 sysmon_pswitch(9) events. pmf(9) event injection is not supposed to be
 called directly, but rather through sysmon (for switch-like
 functionality), or within pmf(9) itself for inter-device signaling.

No. Device drivers are calling pmf(9) event injections directly. I think
Jared or Jörg should clarify this, but I think the pmf(9) calls you cited
earlier were added to the sysmon routines for compatibility-like reasons.
To be effective, there needs to be also a listener for the injected events.

 So, in the current form, power switches/buttons are not supposed to
 register as devices and implement their own hooks for registration with
 pmf(9)?

I am not sure what you mean by this. For instance, a platform/laptop-specific
driver registers naturally with pmf(9), but it may also use the sysmon
routines for various tasks (e.g. also some hotkeys are handled by the sysmon
routines). There is no grand scheme of things. It is just duplicity.

- Jukka.


Re: pmf(9) vs sysmon for power events (especially sleep when powerd(8) is not running)

2011-05-07 Thread Jukka Ruohonen
On Fri, May 06, 2011 at 04:45:55PM +0100, Jean-Yves Migeon wrote:
 1 - I shall patch sysmon_pswitch_event and add a callback for sleep 
 that MD code can register,
 2 - or register a pmf(9) event handler during hypervisor attachment, 
 and just use pmf_event_inject() in the /* XXX */ sleep path that will 
 trigger this handler.

Either one is fine by me. Perhaps the latter approach sounds slightly better,
as it uses the already existing KPI and avoids patching the already convoluted
sysmon routines.

- Jukka.


Re: pmf(9) vs sysmon for power events (especially sleep when powerd(8) is not running)

2011-05-06 Thread Jukka Ruohonen
On Fri, May 06, 2011 at 10:35:30AM +0100, Jean-Yves Migeon wrote:
 Yes. However, in the Xen domU case, it is quite unacceptable. Anyone 
 willing to suspend a domain would launch xm save from dom0. If 
 powerd(8) is not running, the xm save will wait ~forever for the domU 
 to signal it's ready for suspension. I'd like to have a shortcut that 
 handles the powerd id not running step, even if that means that 
 specific services have not been turned off cleanly via 
 scripts/sleep_button.

Speaking about normal x86 and other architectures, we should pick good
defaults but not tie things to the kernel. Formulating one-and-true policy
or power-event state machine is not a goal that can be even reached. I want
my laptop to suspend when the lid is closed, but someone else may not like.
It is more than natural that things like this are handled in user space.
Like is done currently with powerd(8), it is also a good idea to shutdown
other daemons before entering a suspended state.

 This situation also applies to power button too, but this case is 
 already handled [1]. Albeit, not sleep, hence the XXX I believe.

As I've written already, powerd(8) should be enabled by default on the stock
rc.conf(5). This is again something that should not require manual tuning.

 I respectfully disagree. The PSWITCH_TYPE_LID event is first handled by 
 sysmon(9), then injected in pmf(9). See [2].

 [...]

 The sysmon_pswitch_register(9) is indeed a NOP (it is supposedly there to
 account some possible future use).  But sysmon_pswitch_event() is not a
 NOP.  It does not inject anything to pmf(9).
 
 It does. See [2].

Ah, right. Of course you should follow what those injections actually do
and where the listeners are? The main function in sysmon_power.c is:

936 if (sysmon_power_daemon != NULL) {
937 /*
938  * Create a new dictionary for the event.
939  */
940 ped = kmem_zalloc(sizeof(*ped), KM_NOSLEEP);
941 if (!ped)
942 return;
943 ped-dict = prop_dictionary_create();
944 
945 if (sysmon_power_daemon_task(ped, smpsw, event) ==
0)
946 return;
947 }


- Jukka.

[1] In lack of a better reference see e.g.

http://lists.xensource.com/archives/html/xen-devel/2010-05/msg00115.html


Re: pmf(9) vs sysmon for power events (especially sleep when powerd(8) is not running)

2011-05-05 Thread Jukka Ruohonen
On Thu, May 05, 2011 at 05:56:43PM +0100, Jean-Yves Migeon wrote:
 i am experiencing some difficulties regarding the somewhat duplicity of
 functionality provided by sysmon_*(9) and pmf(9) APIs, for everything that
 has to deal with power management event.

The duplicity is a known and unfortunate issue. Also many drivers suffer
from this. My personal opinion is that we should either rework and cleanup
sysmon's power-related KPI or slowly deprecate it. But, still, pmf(9) can
not do the job alone (at least currently).

 Disclaimer: this is for suspend/save events, whatever you name them; each
 implementation has its own way of specifying them: Xen domU assume that
 sleep/suspension is a serialization of VM memory state to a disk file,
 while ACPI have different expectations depending on level (suspend to RAM,
 suspend to disk, states, etc.)

So you take the stance that there will never be normal (APM/ACPI/XXX)
suspend states in Xen? I think Linux supports this already. Thus, generally,
any KPI should handle multiple backends with maybe slightly diverging
conceptual definitions.

 Currently, we have two frameworks: pmf(9) and the different sysmon_(9)
 routines.  As I see them, pmf(9) is fairly lower level, and covers only
 device attach/detach/suspension (and inter driver signaling).  sysmon_*(9)
 are userlevel oriented, and certain events can even be managed by
 userland through powerd(8) (please confirme about these goals/non goals).

This is quite adequate description. Note that it is still desirable to have
some (but not necessarily all) events delivered to user space. This is the
main task that is currently handled by the sysmon-routines + powerd(8).

 Except for specific situation, high level events (LID open/close, power
 button press) are first handled via sysmon, then injected to drivers via
 pmf.

In most cases it is either, not both.

 Would the sysmon_power backends be a long term replacement for the
 various shutdown/reboot/sleep/power control (power-on scheduling, sleep
 states) hooks, or should it be just regarded as the registration of a
 sleep handler, and nothing more?

As said, the first approach requires a major cleanup and rationalization of
the sysmon_power backend. The second approach may sound reasonable as an
intermediate or a temporary solution for the immediate requirements of 6.0.
That is, I think no one expects you to write a full-blown KPI for this --
a task that is quite non-trivial, as is manifested by the current duplicity.

 I am also having a hard time figuring out the different between the goals
 of sysmon_pswitch_register(9) and pmf_device_register(9).  Both are
 supposed to handle power events, but sysmon_pswitch_register(9) is now a
 NO-OP, with everything directly injected into pmf(9).

The sysmon_pswitch_register(9) is indeed a NOP (it is supposedly there to
account some possible future use). But sysmon_pswitch_event() is not a NOP.
It does not inject anything to pmf(9).

 BTW, would the handler be supposed to be called only when powerd(8) is
 running (with the sleep_button script execing zzz(8)), or could it be used
 when it is not, including situation where there's no real thread context
 (on interrupts)?

Do not confuse the sleep_button script with the issue at hand. As the
names indicates, it delivers events from buttons that are physically present
on a computer. I think there should be no requirements for this to work on
interrupt context (if there is, the drivers should do something about it).

- Jukka.


Re: kernel bitreverse function

2011-04-03 Thread Jukka Ruohonen
On Sun, Apr 03, 2011 at 05:09:55PM +0200, Frank Wille wrote:
 Did somebody already try to implement it? If not, I would suggest the
 following code for src/sys/lib/libkern:

 [...]
 
 Any comments? Then please speak now. :)

Just a footnote: wouldn't sys/bitops.h be a better place logically?

- Jukka.


Re: kernel bitreverse function

2011-04-03 Thread Jukka Ruohonen
On Sun, Apr 03, 2011 at 06:12:03PM +0200, Frank Wille wrote:
 Don't know about others, but my goal was to eliminate double code from
 the kernel. The use of the new functions should also be restricted
 to the kernel.

While I have no real opinion for or against, I can certainly imagine finding
use for a well-defined bit function like this also in user space.

- Jukka.


Re: sysmon_pswitch_event(): provide a sleep routine when powerd(8) is not running

2011-03-28 Thread Jukka Ruohonen
On Mon, Mar 28, 2011 at 01:33:45PM +0100, Jean-Yves Migeon wrote:
 1 - modify sysmon_pswitch_event prototype so it can return an error 
 (therefore leaving the possibility for the caller to fix the event by 
 itself), OR
 2 - add a MD system_suspend() routine, define it to NULL by default, 
 and which can be overriden by MD should there be a need to call the 
 suspend code without going through powerd(8) via sysmon_pswitch_event(), 
 OR
 3 - alternatively, add a RB_SLEEP flags to cpu_reboot(), which will 
 basically do the same as the above, except that we could reuse part of 
 the cpu_reboot function.

I would go for (3), perhaps with a -s flag to halt(8). This would also solve
the user interface issue that remains unresolved in options (1) and (2).
Extending halt(8) has been discussed also previously (cf. e.g. [1]).

- Jukka.

[1] http://www.netbsd.org/contrib/projects.html#shutdowntime


Re: high sys time, very very slow builds on new 24-core system

2011-03-23 Thread Jukka Ruohonen
On Wed, Mar 23, 2011 at 05:24:12PM -0400, Thor Lancelot Simon wrote:
 All cores spend well over 50% time in 'sys', even when all or almost
 all are running cc1 processes.  The kernel is amd64 -current GENERIC
 from about 1 week ago -- no DIAGNOSTIC, DEBUG, KMEMSTATS, LOCKDEBUG,
 etc.
 
 Does anyone have any idea what might be wrong here?

Another shot in the dark: AMD's so-called C1E is known to cause issues
like this (in which case you might want to enable acpicpu(4)).

- Jukka.


Re: BIOS/ACPI interrupt conflict

2011-02-09 Thread Jukka Ruohonen
On Wed, Feb 09, 2011 at 04:47:12PM -0800, Cliff Wright wrote:
 Bios is correct, and ACPI wrong, I have seen this on other
 machines. And as I said in the 2007 email, even if ACPI had
 been the correct one, it still was not going to setup the
 interrupt.

In this area, and with the current code base, it is very difficult to
say who is wrong... Note that in theory the PCI interrupt link devices
may contain different IRQ sets depending on whether PIC or I/O APIC is
used, but I don't know how well the current code handles this.

 It occurred to me that maybe a test for an apic needs to be
 done. In my case where I have no apic, then the BIOS data
 has to be accepted because nothing else sets up the interrupt.

Yes, if only 8259A PICs are used, probably no calls should be even
made to mess with the (ACPI) PCI interrupt link devices.

I am slowly working with an entirely new implementation, so while the
patch looks reasonable enough, I think it might be best to generally
leave the current regression-prone code intact.

- Jukka.


Re: BIOS/ACPI interrupt conflict

2011-02-09 Thread Jukka Ruohonen
On Wed, Feb 09, 2011 at 10:54:08PM -0800, Brian Buhrow wrote:
   I note that at the time, I received strong objections to my patch on
 the grounds that it didn't account for bioses which didn't setup the
 interrupts and reported that they had.  That's true, but in my patch, you
 had to build a custom kernel and add the option ACPI_BELIEVE_BIOS to turn it
 on.

In general, and in my opinion, we definitely do not want such tunable
options, especially for something as essential as this. I have already
cleaned most of these options from the acpi(4) stack, and in the long-run
the remaining ones should be removed as well.

- Jukka.


Re: Capsicum: practical capabilities for UNIX

2010-10-25 Thread Jukka Ruohonen
On Mon, Oct 25, 2010 at 07:28:56PM -0500, David Young wrote:
 The chief difference I see between a process limited by Capsicum and
 a process limited by Systrace is that the Capsicum-limited process
 has only the privileges that the parent process grants it, while the
 Systrace-limited process has a system-call firewall applied.  It's
 easier with the Capsicum-limited process than with the Systrace-limited
 process to reason about what the process can do, and to adjust the
 process privileges, because it's easier to name and count capabilities
 than to read, interpret, and re-write systrace rules.

Does this mean that every program that wants to use Capsicum needs to be
patched to use Capsicum? This is the main problem I have with MACs and
related frameworks; to gain full advantage from these, you need the
resources of Red Hat. Are we going to patch third-party software to use
Capsicum? Who knows what should be allowed or disallowed in a monster like
Firefox? Apache? X.org? Bind? Who would be maintaining these patches?

- Jukka.


Re: acpivga(4) v. MI display controls

2010-10-20 Thread Jukka Ruohonen
On Sat, Oct 16, 2010 at 05:45:51PM -0500, David Young wrote:
  Another thing is the actual device tree. For instance, currently, even with
  the fine work done with pmf(9), in some corner cases we may power off a
  device before its children are turned off because the device tree is
  partially arbitrary.
 
 What devices do you have in mind?

The canonical example is perhaps the LPC bridge. This is also the case
brought up by Quentin in an earlier revision of this discussion. The
following takes a very specific point of view to demonstrate the issue.

Now raise the abstraction so that we do not talk about any specific chip. 
The so-called power resource, if it exists, is shared by all devices under
the bridge. The concept of power resource itself can be just bad abstractions 
used in the ACPI code, but there are no guarantees that manipulating it
won't turn off the chip (or stop processing in the chip or whatever this may
mean). (Actually, I have seen several systems where turning power resources
on/off actually turns hardware on/off.)

The power resource code implements several sanity checks, namely (a) a
parent can not be turned on/off if its children are not on/off and (b)
reference counting prevents turning anything off if something else is using
the power resource.  Neither (a) nor (b) really works in NetBSD due reasons
mentioned. Because the ACPI tree is not synchronized with the real tree,
none of the devices under the bridge claim the power resource when they
attach. But the real trick is that the firmware may turn a power resource
off for instance when we enter a sleep state. Upon resume, we need to turn
it back on, but we can not do it blindly. Another question is whether we
have sufficient abstractions for device power state in the real tree.

For example, most of the devices are incorrectly attached (to acpi0) here:

LPC   [06] [  ] (PCI) @ 0x00:0x00:0x1F:0x00 ichlpcib0
SIO   [06] [  ]
PIC   [06] [  ]
TIMR  [06] [  ] attimer1
HPET  [06] [  ] hpet0
DMAC  [06] [  ] 
SPKR  [06] [  ] pcppi1
FPU   [06] [  ] npx1
RTC   [06] [  ]
KBD   [06] [  ] pckbc1
MOU   [06] [  ] pckbc2
DURT  [06] [ W]
DLPT  [06] [  ]
DECP  [06] [  ]
FIR   [06] [  ]
TPM   [06] [  ]
EC[06] [  ] acpiec0
PUBS  [11] [  ]
BAT0  [06] [  ] acpibat0
BAT1  [06] [  ]
BAT2  [06] [  ]
AC[06] [  ] acpiacad0
HKEY  [06] [  ] thinkpad0

The above example also reveals the devices (in this machine) that reference
the ACPI embedded controller's operation regions. Thus, the three children
should be attached under acpiec(4), or more conservatively, these should at
least never be attached before acpiec(4).

Hope the above made some sense,

Jukka.


Re: acpivga(4) v. MI display controls

2010-10-16 Thread Jukka Ruohonen
On Fri, Oct 15, 2010 at 07:53:53PM -0500, David Young wrote:
   OK, what this code is doing is essentially attach a device to the acpi
  tree that really refers to a PCI device. Can we please get this to
  attach as child of vga0 by checking for a device matching the PCI
  address of vga0, that also provides _DOD and _DOS. This would prevent
  accessing vga0 on resume before it has been reset.

 Joerg calls attention in that last sentence to the possibility of
 defects in suspend/resume that arise when a device is represented twice
 in the device tree.  Sounds familiar. :-)

The above scheme is easily achieved if we start dropping #ifdefs to the
device tree. (Hopefully everyone can agree that this is out of the
question.) As I wrote, if we start to implement hacks specific to one
acpi(4) driver, we end up with a big mess. It is much better to have the
whole acpi(4) uniformly at 'acpinodebus' even with the risks involved, so
that once we have a solution, everything can be transformed in a single
sweep.

You do realize that our suspend/resume paths are full of defects due reasons
I outlined? For instance, because drivers do not inform the firmware upon
suspend(), we have several cases where devices resume in a power off state
(cf. PR #37891). Complaining about a single driver prevents one from seeing
the forest.

 ISTM that more than one developer can, and has, described in a broad
 outline how it should be done.  For example, I can outline how
 device_register() can be used to put ACPI information into MI device
 properties for device-attachment hooks to read back out.  I'm happy to
 give more detailed suggestions, too. 

I think everyone groks this. Opening up an editor and doing the work is
another thing. I emphasize that this is not entirely about autoconfiguration.

 I'm not sure I understand what you mean by the 'natural' device tree.
 I think you may have drawn a line between virtual and real device
 hierarchies and assigned ACPI to a different category than I would.
 Again, I'm not sure I've taken your meaning right.

By natural I refer to the discussion on this list about (semi-random)
thoughts on device tree structure (and the several inconsistencies in it).
See appendix.

 It's just occurred to me that it may help to form a group to discuss
 how BIOS information should be encoded and conveyed from MD code to MI
 drivers in NetBSD.  By setting standards, we may help developers on
 every port leverage others' knowledge and work.  What do you think?

Sounds good, albeit talk tends to be cheap.

I take the above quote to clear some misunderstandings:

(b) This is not about passing something from MD to MI -- it goes to
the other direction also.

(a) This is not only about passing information, but applies to
controls (callbacks, etc.) also.

(b) This is not only about autoconfiguration, but (a) and (b) are
present dynamically at runtime. When a driver writes to a
register, it may need to inform the firmware. When the firmware
writes to a register, it may need to inform the driver.

- Jukka.

Appendix: the natural device tree on a ThinkPad.

\ [06] [  ] 
CPU0  [12] [  ] 
CPU1  [12] [  ] 
_SB   [06] [  ] 
LNKA  [06] [  ] 
LNKB  [06] [  ] 
LNKC  [06] [  ] 
LNKD  [06] [  ] 
LNKE  [06] [  ] 
LNKF  [06] [  ] 
LNKG  [06] [  ] 
LNKH  [06] [  ] 
MEM   [06] [  ] 
LID   [06] [ W] acpilid0 
SLPB  [06] [ W] acpibut0 
PCI0  [06] [  ] (PCI) @ 0x00:0x00:0x00:0x00 [R] [B] - 0x00 pchb0
LPC   [06] [  ] (PCI) @ 0x00:0x00:0x1F:0x00 ichlpcib0
SIO   [06] [  ] 
PIC   [06] [  ] 
TIMR  [06] [  ] attimer1 
HPET  [06] [  ] hpet0 
DMAC  [06] [  ] 
SPKR  [06] [  ] pcppi1 
FPU   [06] [  ] npx1 
RTC   [06] [  ] 
KBD   [06] [  ] pckbc1 
MOU   [06] [  ] pckbc2 
DURT  [06] [ W] 
DLPT  [06] [  ] 
DECP  [06] [  ] 
FIR   [06] [  ] 
TPM   [06] [  ] 
EC[06] [  ] acpiec0 
PUBS  [11] [  ] 
BAT0  [06] [  ] acpibat0 
BAT1  [06] [  ] 
BAT2  [06] [  ] 
AC[06] [  ] acpiacad0 
HKEY  [06] [  ] thinkpad0 
VID   [06] [  ] (PCI) @ 0x00:0x00:0x02:0x00 vga1
LCD0  [06] [  ] 
CRT0  [06] [  ] 
AGP   [06] [  ] (PCI) @ 0x00:0x00:0x01:0x00 
VID   [06] [  ] 
LCD0  [06] [  ] 
CRT0  [06] [  ] 
EXP0  [06] [ W] (PCI) @ 0x00:0x00:0x1C:0x00 [B] - 0x01 ppb0
EXP1  [06] [ W] (PCI) @ 0x00:0x00:0x1C:0x01 [B] - 0x02 ppb1
EXP2  [06] [ W] (PCI) @ 

Re: acpivga(4) v. MI display controls

2010-10-16 Thread Jukka Ruohonen
On Fri, Oct 15, 2010 at 08:29:57AM -0400, der Mouse wrote:
 ACPI may be the source of the information, but that doesn't mean it has
 to be how the autoconf tree is constructed.
 
 Compare and contrast with how NetBSD/sparc uses the OF (or is it OBP?
 I'm not sure) device tree to drive autoconf, but doesn't have a device
 node corresponding to OF that everything attaches under; it just uses
 the OF tree as the source of the data about what exists where.  (Well,
 much of it; autoconf doesn't totally mirror OF, eg, in SCSI device
 attachment.)

I do not know OF well, but my impression is that it is much, much less
invasive than what we have nowadays on x86 where close interaction between
the firmware and drivers are expected.

Several people seem to be under the false impression that this is something
only related to device attachment and autoconfiguration. It is not.

I tried to outline this in another mail, but frankly I think whether 'X
attachs to Y or Z' is just a little, largely irrelevant, detail in the face
of much bigger problems. In a nutshell: ACPI BIOS may access hardware directly,
with or without the consent from the system. In an entirely x86 based codebase
this is hardly a problem, but in NetBSD this comes down to the question on
how to maintain the clean MD/MI separation in the future.

- Jukka.


Re: acpivga(4) v. MI display controls

2010-10-15 Thread Jukka Ruohonen
On Fri, Oct 15, 2010 at 08:26:34AM +0300, Jukka Ruohonen wrote:
 The task is not trivial. On modern x86, practically *everything* that
 attachs has an ACPI counterpart. In a way we are thinking this backwards:
 the attachment should perhaps be done via ACPI that has information about
 the natural device tree (I recommend to boot with ACPIVERBOSE option and
 observe the output). This is how it is supposedly done in Windows. And
 consequently, *most* (MI) drivers that work on x86 need to eventually call
 (MD) ACPI callbacks, and vice versa. Bringing this all together in a clean
 (MI) implementation is hard and requires substantial changes, to say the
 least.

As an addition, due reasons stated above, I object anything that tries to
make a case for a single driver from acpi(4) -- be it acpivga(4), acpicpu(4),
or the ISA and PCI cases discussed previously. This should be solved once
and for all, for all acpi(4) and for all pci(4), isa(4), ... Otherwise we
end up with god-awful mess.

If such a solution comes to existence, we are happy to refactor acpi(4).
During the ten years that ACPI has been in NetBSD, several people have tried
a solution without much success. I have personally tried twice, and failed
already at the self-criticism stage.

- Jukka.


Re: acpivga(4) v. MI display controls

2010-10-15 Thread Jukka Ruohonen
On Fri, Oct 15, 2010 at 10:10:18AM +0200, Martin Husemann wrote:
 On Fri, Oct 15, 2010 at 08:26:34AM +0300, Jukka Ruohonen wrote:
  This was discussed during the development process.
 
 Where?

Already when this was first presented in 2008:

http://mail-index.netbsd.org/tech-kern/2008/12/05/msg003744.html

The issues noted back then are still present.

- Jukka.


Re: acpivga(4) v. MI display controls

2010-10-14 Thread Jukka Ruohonen
On Thu, Oct 14, 2010 at 06:50:30PM -0500, David Young wrote:
 Rather than attaching new nodes at acpi0, the system should let ACPI
 BIOS inform the autoconfiguration process, which should attach one or
 more instances of a new, MI device, display(4).  For example:
 
 vga0 at pci0 device ... function ...
 display0 at vga0: Ext. Monitor, head 0, bios detect (ACPI CRT1)
 display1 at vga0: TV, head 0, bios detect (ACPI DTV1)
 display2 at vga0: Unknown Output Device, head 0, bios detect (ACPI LCD)

 In this way, no single device has two representations in the device tree
 (think about the consequences, they're not pretty), and every device
 appears in the most appropriate place in the device tree for the purpose
 of suspending, resuming, detaching and re-attaching it.

This was discussed during the development process. Sure, the above is the
ideal case. Yet once again I need to remind that we can not hold back
important acpi(4) work because the perfect abstraction has not arrived, and
no one seems to really know how it should be done.

The task is not trivial. On modern x86, practically *everything* that
attachs has an ACPI counterpart. In a way we are thinking this backwards:
the attachment should perhaps be done via ACPI that has information about
the natural device tree (I recommend to boot with ACPIVERBOSE option and
observe the output). This is how it is supposedly done in Windows. And
consequently, *most* (MI) drivers that work on x86 need to eventually call
(MD) ACPI callbacks, and vice versa. Bringing this all together in a clean
(MI) implementation is hard and requires substantial changes, to say the
least.

- Jukka.


Re: Capsicum: practical capabilities for UNIX

2010-09-26 Thread Jukka Ruohonen
On Sun, Sep 26, 2010 at 08:48:45PM -0400, Perry E. Metzger wrote:
 They did Chrome in the paper, and it required very few lines of code
 (under 100). They did other tests too. It appears that they've had
 quite a bit of success in creating a very usable API here. I'm not
 entirely surprised, given the nature of what they're doing.

Just a little historical remark.

I am little puzzled why Watson et. al. did not bother to mention Linux
capabilities that have existed for a long time. The Linux API is almost
identical to the one proposed in the capsicum paper. And yet, Linux
capabilities are seldom used.

Perhaps a general perception would be that somehow these capabilities slided
to sidetracks from the very beginning. One probable cause for this was that
the vendor-independent committee that started the whole thing was unable to
provide something that could have become an actual standard across UNIX
platforms and their derivatives.  The result was only a draft POSIX
document, IEEE 1003.1e, released in 1997, which is considered a failure by
many.

Maybe there is something to learn from here.

- Jukka.


Re: 5.1_RC3 on Dell r610 fails

2010-08-30 Thread Jukka Ruohonen
On Tue, Aug 31, 2010 at 04:06:16PM +1200, Mark Davies wrote:
 Any suggestions on whats broke

This is again the so-called Enhanced SpeedStep (EST).

 how to fix?

Disable options(4) ENHANCED_SPEEDSTEP.

- Jukka.


Re: RFC: device flavours

2010-07-25 Thread Jukka Ruohonen
On Sun, Jul 25, 2010 at 09:22:53PM +, Quentin Garnier wrote:
 bridges (mostly on x86).  An even older idea of mine is to finally see
 legacy devices listed in the ACPI tables attached to the PCI-ISA bridge
 where they logically belong, and device flavours can be used for that,
 too.

I am not sure if I understand all of this, so bear with me.

While this is the direction we should go, it seems to me that the long-
standing issues in ACPI-PCI-ISA are not so much where the legacy drivers
actually logically attach, but that these, like majority of drivers on
modern x86, should utilize the information from ACPI.

Is this possible with flavours? Will the siblings still require a stub on
the ACPI side of things?

 pcib0 at pci0 dev 31 function 0: vendor 0x8086 product 0x27b9 (rev. 0x02)
 timecounter: Timecounter pcib0/ichlpc frequency 3579545 Hz quality 1000
 pcib0/ichlpc: 24-bit timer
 pcib0/ichlpc: TCO (watchdog) timer configured.
 gpio5 at pcib0: 64 pins
 pcib0/acpiib: ACPI node SBRG
 npx1 at pcib0 (COPR, PNP0C04): io 0xf0-0xff irq 13
 npx1: reported by CPUID; using exception 16
 SIOR (PNP0C02) at pcib0 not configured
 RMSC (PNP0C02) at pcib0 not configured
 OMSC (PNP0C02) at pcib0 not configured

In the above example it is known that the LPC bridge currently conflicts
with the ACPI PM registers.  So to put this to the logical end, the
derivation using ACPI should start from there, and the pci_mapreg_map(9)
call therein should use the information supplied by ACPI.

 There are other situations in which I think device flavours could bring
 clarity and also better modularisation.  For instance, support for CPU
 features on x86 like EST or PowerNow, or even ACPI P-states could be
 done that way, and it is more module-friendly because it wouldn't
 require the main CPU driver to explicitely call those feature-drivers.

Here I can see use. I was actually seeking this kind of granularity with the
ACPI CPU.

- Jukka.


Re: Modules loading modules?

2010-07-25 Thread Jukka Ruohonen
On Mon, Jul 26, 2010 at 06:41:11AM +1000, matthew green wrote:
 it seems to me the root problem is that module_mutex is held while
 calling into the module startup routines.

Here is one related question: is it ensured that the module lock is dropped
immediately after a modular device driver returns from its attachment
routine?  I am thinking of a case where a modular driver defers its
configuration by using config_interrupts(9) or config_finalize_register(9).

- Jukka.


Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-08 Thread Jukka Ruohonen
On Mon, Mar 08, 2010 at 10:54:13AM -0500, der Mouse wrote:
  Linux had a devfs and [dropped] it.  Now it has udevd(8).  Most
  likely the penguins had a reason for this.
 
 Surely there are mailing list messages or something that outline that
 reason?  (Not that I have any idea where they'd be, but don't we have
 at least a few people with feet in both camps?)

It is more like:

Linux had a devfs and [dropped] it. Now it has udevd(8). Most likely the
penguins had a reason for this. Linux had udevd(8) and reintroduced devfs.
Now it has udevd(8) and some kind of devfs. Most likely the penguins had a
reason for this.

- Jukka.

http://lwn.net/Articles/331818/


Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-07 Thread Jukka Ruohonen
On Sun, Mar 07, 2010 at 08:18:15PM +, Quentin Garnier wrote:
  As an example: one thing that holds back the ACPI CPU code I am working on
  is that I need to be sure that e.g. cpu3 that attaches to acpi0 is the same
  cpu3 that has attached to mainbus0. So:
 
 Well, the answer to that is simple:  there should only be one device.
 Anything design that doesn't produce that result can go to thrown out
 the window without further delay.

In the above example it would be acpicpu3 at acpi0 and cpu3 at mainbus0.

But as you know quite well what is involved, I am merely pointing out that
the current situation holds back many possibilities. And noting that I don't
have the competency to do anything about it.

- Jukka.


Re: CVS commit: src/sys/arch

2010-02-07 Thread Jukka Ruohonen
On Sat, Feb 06, 2010 at 01:07:08PM -0800, Paul Goyette wrote:
 If it matches a device, and there is also a native driver for the 
 underlying i2c controller, then there'll be two devices accessing the 
 same bus.  Bad things (tm) will happen.  This is noted in the BUGS 
 section of the acpismbus(4) man page.

On a related note, a similar warning should be probably added to aiboost(4).
At least on Linux it is known to cause weird problems and lockups if the
iic(4) is being accessed at the same time by a native driver (it87?).

It is also a reasonable assumption that things will get worse at this front.
The new ACPI 4.0 standard introduced a sensor framework of its own, and my
guess is that consumer PC manufacturers will jump on the bandwagon, trying
to hide these things in the abyss of ACPI.

- Jukka.


Re: regression (crash) in sysmon/acpiacad

2010-02-06 Thread Jukka Ruohonen
On Sun, Feb 07, 2010 at 08:30:27AM +0100, Joerg Sonnenberger wrote:
 On Sun, Feb 07, 2010 at 09:04:54AM +0200, Jukka Ruohonen wrote:
  * The following sensors should be removed: technology,
low capacity, and warning capacity. These are not really
something that should be sensed.
 
 Technology ok. I'm not too sure about low and warning, given that they
 normally can't be modified.

The idea here would be to use the sme_get_limits() and possibly
sme_set_limits(). This is exactly the rationale behind those callbacks.

This would also result a nicer output in envstat(8).

 
  * The design capacity should be the maximum of the last known full
charge capacity, which is the maximum of the present capacity. 
This is useful for checking the overall health of deteriorating
(lithium-ion) batteries.
 
 I disagree. Both batteries for my laptop had initially a higher capacity
 than designed for -- e.g. last full and design cap don't necessarily
 agree with each other.

I noticed the same thing with voltages. Yet, what is wrong with envstat(8)
or some other tool reporting last full charge capacity is 123 % of the
design capacity?

  * Sensors that have a maximum should report also percentages in
relation to these maximums. From the usability point of view, this
is probably almost always the right choice.
 
 That should be a task for userland, not the kernel.

It already is; in acpibat(4) this just implies setting the ENVSYS_FPERCENT
flag, nothing more.

- Jukka.


Re: regression (crash) in sysmon/acpiacad

2010-02-04 Thread Jukka Ruohonen
On Thu, Feb 04, 2010 at 10:15:03PM +0100, Matthias Drochner wrote:
 
 p...@whooppee.com said:
  Since the charge value was not updating, it  might be that the ACPI
  Notify isn't working here.

Since this involved running on battery power, I doubt it is about the
removal of the refresh routine in acpiacad(4). If the sensor value changes
when one plugs/unplugs the AC, it is easily verified to be working.

 For the critical shutdown, a call to _BTP might help.

The _BTP is just a custom warning trip-point that triggers a Notify once
reached. It is probably there to provide user space applications some
control over the limits, and to possibly avoid polling of the values.

Note though that nothing has changed in acpibat(4) with regards to the
refresh routine or the sensors generally.

 But anyway, from my limited experience with process control
 (SCADA) systems, it makes sense to maintain a timestamp
 for the last data value read (or delivered by asynchronous
 notification) and force a fresh read if it is older than
 a limit defined by the provider (and possibly overridden
 by the consumer).

Something like is already done in acpibat(4).

- Jukka.