Re: kcpuset(9) questions
On Fri, Feb 01, 2013 at 06:25:24PM -0600, David Young wrote: There was no use case, when I added it. Can you describe your use case? Usually we iterate all CPUs with CPU_INFO_FOREACH() anyway (which should also be replaced with a MI interface, but that requires non-trivial invasion into all ports). Another use case is iterating the processor sets for per-CPU-group power states. - Jukka.
Re: lua(4), non-invasive and invasive parts
On Fri, Dec 28, 2012 at 10:05:36AM +0100, Marc Balmer wrote: If, however, existing software is to use Lua as part of its implementation, it needs to be made lua(4) aware, because it is going to use the lua(4) API. If the existing software is a kernel module, it needs to record a dependency on the lua(4) kernel module. Why is this a problem? Most kernel modules have dependencies. The source code of software using lua(4) needs to be modified, which is why I called this scenario invasive. The lua(4) aware parts can be put into #ifdef LUA/#endif sections if a LUA configuration option is being used. This sounds like a moot point since there is no lua(4) yet nor any software using it... - Jukka.
Re: lua(4), non-invasive and invasive parts
On Fri, Dec 28, 2012 at 10:05:36AM +0100, Marc Balmer wrote: Using a kernel module is not possible in all cases. By a closer look, you have: +#ifdef LUA +MODULE(MODULE_CLASS_DRIVER, gpiosim, gpio,lua); +#else MODULE(MODULE_CLASS_DRIVER, gpiosim, gpio); - +#endif #ifdef _MODULE What does this mean? Also the kernel modules using lua(4) will be conditionally compiled? I think this is fairly strongly against the design principles of module(7). - Jukka.
Re: Path to kernel modules (second attempt)
On Sat, Jul 07, 2012 at 08:57:10PM +0100, Mindaugas Rasiukevicius wrote: Regarding the PR/38724, I propose to change the path to /kernel/. Can we reach some consensus quickly for netbsd-6? I'd vote for /lib/modules noted in the PR (or maybe under /libdata?) simply because in my opinion the root hierarchy has already been abused too much in NetBSD. On the other hand, I don't see anything wrong with /stand either. Two cents, - Jukka.
Re: link-sets in modules
On Tue, May 29, 2012 at 03:00:58PM -0700, Paul Goyette wrote: Well, at least for sysctl's SYSCTL_SETUP() stuff, you probably don't want to use the same initialization call for modules as is used for built-ins. The built-ins are initialized with an explicit NULL argument passed for the sysctl_clog argument, which makes it difficult for a module to do its clean-up. Modular code needs to (or at least, should?) pass a non-null module-specific clog so it can be used during an undo at MODULE_CMD_UNLOAD time. Indeed I consider it a bug if a module uses SYSCTL_SETUP() (and does not tear down the nodes after unload). This applies more generally to drivers too. - Jukka.
Re: introduce device_is_attached()
On Mon, Apr 16, 2012 at 07:49:28PM +0100, Iain Hibbert wrote: I'm kind of with David Young, surely this is what the softc is for.. so that the parent can keep track of which resources it has allocated (and by inference, not reallocate them to another device) I agree with this too; numerous drivers/frameworks use the above scheme. If you add this function, please refactor these to follow this new (superfluous) idiom. - Jukka.
Re: CVS commit: src/tests/modules
On Mon, Mar 26, 2012 at 12:10:30AM -0700, Matt Thomas wrote: doesn't modctl/modload return some error which indicate the reason of failure? EPERM which isn't really useful. Oddly enough, it actually fails with EPERM on Sparc, but with ENOSYS on Xen. Manuel pointed out that it might be kobj_load_vfs(), kobj_load_mem(), or kobj_stat() that returns ENOSYS. - Julka.
Re: CVS commit: src
On Wed, Mar 14, 2012 at 09:55:21AM +, Martin Husemann wrote: This seems to cause deadlocks in the *fs_rename_dir tests. Also the page residency check written by thorpej years ago now fails for the first time. - Jukka.
Re: A simple cpufreq(9)
On Fri, Sep 30, 2011 at 11:27:46AM -0500, David Young wrote: I don't think that the division of responsibility for power management between kernel userland is obvious. It may not be, but the arguments against kernel-level implementation are largely practical. There is no one size fits all. The kernel can not decide when the screen is too dark to read or what the battery level should be. In short: I think there are too many variables to do it in the kernel. I think almost a consensus was reached about making a rc.conf(5)-like configuration file for powerd(8): http://mail-index.netbsd.org/tech-userlevel/2011/05/06/msg005009.html Some imaginary examples: suspend_lid=YES # Suspend when the lid is closed suspend_button=YES# Activate suspend-button battery_backlight=90 # Backlight (%) when AC is off battery_stop_daemons=bluetooth# Daemons to stop when AC is off None of these really belong to the kernel. How do you hope for cpufreq(9) to be used? I have a patch ready that transforms the existing MD implementations to use it, so at first iteration, it will provide a consistent user interface via cpuctl(8). Later, different governors can be implemented. While reading the API and discussion, it occurred to me that if cpufreq(9) is chiefly used for making power/performance trade-offs, maybe the API should be concerned with the goal (power savings) instead of an independent variable (frequency). I think it is more about utilizing increase() and decrease() functions in well chosen sections of kernel code. Or rather, pass them via a governor that does the selection whether it is plausible to frequency--. Then maybe you can use one API---cpupm(9)?---to set the objective, and let the implementation choose the variables (C-state, P-state, frequency) to tweak. Obviously each one may exist regardless of the other. For instance, currently in NetBSD only P-states are (or can be) used. There are no C-states in arch/macppc. Generally, for practical reasons, I'd vote for a bottom-up approach here. If someone later realizes that all this fits to a perfect single API, I am all for it. But to read between the lines, I think you are approaching what could be called power management quality of services. Shameless Linux-plug again, but the slides are worth a look: http://elinux.org/images/f/f9/Elc2008_pm_qos_slides.pdf - Jukka. PS. P-states == frequency.
Re: A simple cpufreq(9)
On Thu, Sep 29, 2011 at 03:36:03PM -0500, David Young wrote: What's the difference in power savings between changing C-state and changing frequency? Do the power savings from every change in C-state dominate the savings from any change in frequency? Depends on the machine. But generally on x86, C-states appear to be now the dominant form. But these go side by side. To cut the corners short: the general (hardware) idea is that while few CPUs are in a deep C-state (i.e. idle), a group of other CPUs can enter a high-performance P-state. The net result should be increase of performance, despite of the power management. But obviously for instance ARM may do this all differenly, using only frequency scaling. It seems that ultimately we need an API for telling a power-savings goal and constraints (latency, throughput, battery life, the screen isn't too dark to read) for the system to meet. Do you hope for someone to build that into the kernel on top of cpufreq(9)? Not really; as I've written before, my opinion is that most of this should be in the user space. The CPU PM is an exception for obvious reasons. There could be a more involved API for cpu_idle(9) though (cf [1]). - Jukka. [1] The Linux cpuidle-subsystem; http://lwn.net/Articles/384146/
Re: A simple cpufreq(9)
On Mon, Sep 26, 2011 at 10:03:06AM -0500, David Young wrote: Instead, provide an API routine for finding out the number of states (nstates) and a routine for selecting a state [0, nstates - 1]. The code is ready and it is available in [1]. However, I can not complete it because when trying to upgrade, I encounter PR kern/45361. All existing drivers were converted, expect ichlpcib(4) and piixpcib(4) (for these, I think first the SpeedStep should be splitted as a child device of the bridge). This breaks COMPAT_50 of cpuctl(8). How to handle that? #ifdefs? - Jukka. [1] ftp://ftp.NetBSD.org/pub/NetBSD/misc/jruoho/codeanddiff * * * CPUFREQ(9) NetBSD Kernel Developer's Manual CPUFREQ(9) NAME cpufreq, cpufreq_register, cpufreq_deregister, cpufreq_suspend, cpufreq_resume, cpufreq_get, cpufreq_set, cpufreq_set_all -- interface for CPU frequency scaling SYNOPSIS #include sys/cpufreq.h int cpufreq_register(struct cpufreq_if *cif); void cpufreq_deregister(void); void cpufreq_suspend(struct cpu_info *ci); void cpufreq_resume(struct cpu_info *ci); void cpufreq_get(struct cpu_info *ci, uint16_t *freq); int cpufreq_get_if(struct cpufreq_if *cif); void cpufreq_set(struct cpu_info *ci, uint16_t freq); void cpufreq_set_all(uint16_t freq); DESCRIPTION The machine-independent cpufreq interface provides a framework for CPU frequency scaling done by a machine-dependent backend implementation. User space control is available via cpuctl(8). The cpufreq interface is a per-CPU framework. It is implicitly assumed that the frequency can be set independently for all processors in the system. However, cpufreq does not imply any restrictions upon whether this information is utilized by the actual machine-dependent implementa- tion. It is possible to use cpufreq with frequency scaling implemented via pci(4). In addition, it assumed that the available frequency levels are shared uniformly by all processors in the system, even when it is possible to control the frequency of individual processors. It should be noted that the cpufreq interface is generally stateless. This implies for instance that possible caching should be done in the machine-dependent backend. The cpufreq_suspend() and cpufreq_resume() functions are exceptions. These can be integrated with pmf(9). FUNCTIONS cpufreq_register(cif) The cpufreq_register() function initializes the interface by associating a machine-dependent backend with the framework. Only one backend can be registered. Upon successful completion, cpufreq_register() returns 0 and sets the frequency to the maxi- mum available level. The following elements in the cpufreq_if structure should be filled prior to the call: char name[CPUFREQ_NAME_MAX]; struct cpufreq_state state[CPUFREQ_STATE_MAX]; uint16_t state_count; bool mp; void*cookie; xcfunc_t get_freq; xcfunc_t set_freq; · The name of the backend is required. · The cpufreq_state structure conveys descriptive information about the frequency states. The following fields can be used for the registration: uint16_t freq; uint16_t power; From these freq (the clock frequency in MHz) is mandatory, whereas the optional power can be filled to describe the power consumption (in mW) of each state. · The state_count defines the number of states that the back- end has filled in the state array. · The mp boolean should be set to false if it is known that the backend can not handle per-CPU frequency states; changes should always be propagated to all processors in the system. · The cookie field is an opaque pointer passed to the backend when cpufreq_get() cpufreq_set(), or cpufreq_set_all() is called. · The get_freq and set_freq are function pointers that should be associated with the machine-dependent functions to get and set a frequency, respectively. The xcfunc_t type con- forms to xcall(9). When the function pointers are invoked by cpufreq, the first parameter is always the cookie and the second parameter is the frequency, defined as uint16_t *.
Re: A simple cpufreq(9)
On Mon, Sep 26, 2011 at 05:51:13PM +, Christos Zoulas wrote: Why advertise uint16_t, are we trying to save memory? I would just do them uint32_t... While few things are certain in computing, I don't think we are going to see a 65535 MHz processor any time soon. But sure, uint32_t is fine too. - Jukka.
Re: A simple cpufreq(9)
On Sat, Sep 24, 2011 at 07:53:47PM +0200, Joerg Sonnenberger wrote: I was listening possible decision making factors. Depending on your environment, you have all or none of them. The main point is that good decision making needs more than just You can toogle this. So here is a quick draft for the first iteration with the cpuctl(8). If there are issues, speak now, otherwise I'll proceed with something based on this. - Jukka. /* $NetBSD$ */ /*- * Copyright (c) 2011 Jukka Ruohonen jruoho...@iki.fi * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in the *documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include sys/cdefs.h __KERNEL_RCSID(0, $NetBSD: subr_cpufreq.c,v 1.15 2011/09/02 22:25:08 Exp $); #include sys/param.h #include sys/cpu.h #include sys/cpufreq.h #include sys/kmem.h #include sys/module.h #include sys/mutex.h #include sys/time.h #include sys/xcall.h static struct cpufreq_if *cpufreq_if = NULL; static intcpufreq_latency(void); int cpufreq_register(struct cpufreq_if *cif) { size_t i, j; int rv; KASSERT(cif != NULL); KASSERT(cif-get_freq != NULL); KASSERT(cif-set_freq != NULL); KASSERT(cif-state_count 0); KASSERT(cif-state_count CPUFREQ_STATE_MAX); mutex_enter(cpu_lock); if (cpufreq_if != NULL) { mutex_exit(cpu_lock); return EALREADY; } mutex_exit(cpu_lock); cpufreq_if = kmem_zalloc(sizeof(*cif), KM_SLEEP); if (cpufreq_if == NULL) return ENOMEM; mutex_enter(cpu_lock); cpufreq_if-cookie = cif-cookie; cpufreq_if-get_freq = cif-get_freq; cpufreq_if-set_freq = cif-set_freq; for (i = j = 0; i cif-state_count; i++) { if (cif-state[i].freq == 0) continue; else { j++; } cpufreq_if-state[i].freq = cif-state[i].freq; cpufreq_if-state[i].power = cif-state[i].power; } cpufreq_if-state_count = j; rv = cpufreq_latency(); mutex_exit(cpu_lock); return rv; } void cpufreq_unregister(struct cpufreq_if *cif) { mutex_enter(cpu_lock); if (cpufreq_if == NULL) { mutex_exit(cpu_lock); return; } mutex_exit(cpu_lock); kmem_free(cpufreq_if, sizeof(*cif)); } static int cpufreq_latency(void) { struct cpufreq_state *state; struct timespec nta, ntb; const size_t n = 10; uint64_t sample; size_t i, j; /* * Sample the transition latency for each state. */ for (i = 0; i cpufreq_if-state_count; i++) { state = cpufreq_if-state[i]; KASSERT(state-freq 0 state-freq ); for (j = 0, sample = 0; j n; j++) { nta.tv_sec = nta.tv_nsec = 0; ntb.tv_sec = ntb.tv_nsec = 0; nanotime(nta); mutex_exit(cpu_lock); cpufreq_set(curcpu(), state[i].freq); mutex_enter(cpu_lock); nanotime(ntb); timespecsub(ntb, nta, ntb); /* * If the transition latency is measured * in seconds, the backend is not suitable. */ if (ntb.tv_sec != 0) continue; sample += ntb.tv_nsec; } if (sample == 0) return EMSGSIZE; state-latency = sample
Re: A simple cpufreq(9)
On Sun, Sep 25, 2011 at 10:50:44AM +0200, Alan Barrett wrote: On Sun, 25 Sep 2011, Jukka Ruohonen wrote: So here is a quick draft for the first iteration with the cpuctl(8). If there are issues, speak now, otherwise I'll proceed with something based on this. You forgot to include the documentation. It is not production code, but it should be pretty straightforward to see the design blocks without documentation. From the issues dicussed, MHz, power reduction (if available), and estimated transition latency is exported to userland. - Jukka.
Re: A simple cpufreq(9)
On Sat, Sep 24, 2011 at 06:10:46AM +, Michael van Elst wrote: Pick one low and one high value and select these when the core/cpu/machine is idle vs. when the machine is busy. Selecting the low and the high value is much easier for a human when you label it in terms of the specific platform intead of a misleading percentage. I agree: this is naturally the simplest and most consistent approach. (And most of all, it is that especially for the kernel.) In my opinion this would satisfy well the requirements of CPU frequency scaling in NetBSD. But do we have strong cases in favor of the intermediate values? As I see it, stepping always to the highest value when in load would only trade some minor power consumption increase for simplicity. - Jukka.
Re: A simple cpufreq(9)
On Sat, Sep 24, 2011 at 04:33:07PM +1000, matthew green wrote: i think that joerg's point is that the kernel-user API is the wrong place to be making these sorts of humanisations. You all miss the point that these humanisations (a.k.a. abstractions) are primarily targeted for the use in sys/kern. The kernel-user API is just a convenient side product of these abstractions. - Jukka.
Re: A simple cpufreq(9)
On Sat, Sep 24, 2011 at 10:14:13AM +0200, Joerg Sonnenberger wrote: For the in-kernel use it should be completely irrelevant what the property is. Sigh. Answer my question then. Any automatic mechanism should only care about the associated properties (latency, reduction in power consumption etc). I don't see any of that addressed at this point either. If you read the original email again, it was not even the goal to address this. But all specific things like latency and the specified reduction in power consumption are going to be MD-specific implementation details. An MI implementation should be designed so that it will work also when these are missing. A basic design principle. In order to be able to proceed incrementally, you need to start somewhere. - Jukka.
Re: A simple cpufreq(9)
On Sat, Sep 24, 2011 at 08:35:11AM +, Michael van Elst wrote: What is wrong with the abstraction of having a number of ordered performance states? Hiding the states behind something pretending to be a continuum (wether this is a MHz value or a percentage doesn't matter) causes confusion. You never know what performance state you selected and you start to assume that 100% is twice as fast as 50% even when there is no correlation. Nothing. As I wrote, (an integer) percentage is an ordered scale. I agree: perhaps not the best one. But you have to satisfy: 1. Boolean scale (already: ichlpcib(4), piixpcib(4), PowerPC). 2. Interval scale (usually x86). 3. Interval scale with nonuniform intervals (can be on x86/ARM). If I'd have to pick, I would take the first one, as you originally wrote. - Jukka. PS. And this assertion holds: a machine-independent implementation that can not satisfy the six machine-dependent implementations currently in the tree is inherently and completely broken. Thus, mixing latencies and whatnot to the soup can be done later, preferably in the MD implementations.
Re: A simple cpufreq(9)
On Sat, Sep 24, 2011 at 10:14:13AM +0200, Joerg Sonnenberger wrote: For the in-kernel use it should be completely irrelevant what the property is. Any automatic mechanism should only care about the associated properties (latency, reduction in power consumption etc). I don't see any of that addressed at this point either. So let's look at this from a different POV. Here is a datasheet for the reliable information: 1. Maximum and minimum (or on/off). 2. Idle times, etc. (MI stuff). 3. Transition latency (easily simulated). Here is a datasheet for unreliable and unknown data: 1. Actual clock rate. Unreliable. We all know what est(4) and powernow(4) are like; no policy should rely on any intermediate values that comes out of these two. As usual, with ACPI there are a lot of known bugs about incorrect values. Unknown. This is the case with ichlpcib(4) and piixpcib(4). Might be the case on some embedded/exotic things as well. Maintenance. If the information is available from a datasheet, requires tabulation for each CPU model. 2. Power per state. In majority of cases this is entirely unknown, and if coming from the BIOS, too unreliable. I am quite familiar with the Linux cpufreq subsystem and I am not convinced at all that we want something like that. I vote for a simple on/off boolean. - Jukka.
Re: A simple cpufreq(9)
On Sat, Sep 24, 2011 at 07:53:47PM +0200, Joerg Sonnenberger wrote: It's not relevant what the exact clock rate is. It's an approximation. Just like the TSC frequency won't be measured the same on every boot. It would be relevant if the interval would be guaranteed to be uniform. Ignoring intermediate values can be literally a lot of unwanted noise. On my old laptop, I couldn't play all medium quality H.264 streams at smallest CPU frequency. It worked with some of the intermediate levels and those create enough heat less, that it makes a difference in terms of fan activity. My point is that not every load is switches between idle and 100%. Naturally. But given the lack of proper information, you end up doing crazy guessing game with the steppings in order to see whether the frequency is at a sustainable level w.r.t. the load. This is what the Linux governors are all about. And Linux people are rewriting the ondemand governor since it really doesn't work that well, even after all these years. What a boolean gives you is: simplicity and a bias towards performance (which I think should be the priority on NetBSD generally). This is traded for minor power consumption increase and possibly heat. Should be fine for servers and most laptop users. And as far as x86 is concerned, the power savings from CPU are really coming from C-states today. So one can debate whether the 4000 LOC complexity of Linux's cpufreq subsystem is really worth the trouble. - Jukka.
Re: A simple cpufreq(9)
On Sun, Sep 25, 2011 at 07:06:35AM +1000, matthew green wrote: (FWIW, ultrasparcIIIi has cpufreq features, iirc, it allows the freq to run at 1/2 and 1/16th normal. i'm sure that the modern fijitsu SPARC64 also has it, but i don't know much about it.) When one thinks about the modern world and especially AMD CPUs that can already do per-CPU(group) states, setting the minimum and maximum sounds attractive and reasonable. For instance, if you set CPUs offline, also the frequency should scale down. i'd like to re-iterate what i said earlier though -- i'd really much rather this became real-code in the tree sooner than when it becomes a perfect API. Fine. Let's start by importing a proplist of MHzs to cpuctl(8). At least the current mess is solved. - Jukka.
A simple cpufreq(9)
Hello. The kernel needs a MI interface for CPU frequency scaling. Below is a draft that is deliberately as simple as possible. This is NOT about frequency scaling done by the kernel as a governor (although the long-term goal should point to that direction). The present goal is just to add a simple MI interface (or rather, wrapper) and abstract away machine- and platform-dependent code and user interfaces. As far as the implementation goes, this would add two simple MD callback functions to cpu_info. All ugly MD sysctls would be deprecated and setting the frequency would be done by cpuctl(8) via these callbacks.[1] The only interesting detail is that the term CPU frequency is defined as a percentage value: 100 % implies full performance and 0 % denotes the lowest performance. This follows OpenBSD, being also one of the few possible ways to define CPU frequency levels independently from the machine. All details are left to the MD implementations, including how to translate the percentage to actual frequency (and/or voltage, etc.). For instance, for systems with only two states, all values higher than zero would set the high performance mode.[2] Comments? - Jukka. [1] The currently known users would include x86 (with five (!) different implementations) and PowerPC, but frequency scaling is nowadays widely used also in the ARM realm. [2] Note that even in the x86 land it is no longer necessarily known which is the exact MHz at which the CPU currently runs (cf. TurboBoost, etc.). * * * CPUFREQ(9) NetBSD Kernel Developer's Manual CPUFREQ(9) NAME cpufreq -- interface for CPU frequency scaling SYNOPSIS #include sys/cpufreq.h void cpufreq_register(cpufreq_get_cb *get, cpufreq_set_cb *set, void *aux); bool cpufreq_get(struct cpu_info *ci, uint8_t *valp); bool cpufreq_set(struct cpu_info *ci, const uint8_t val); DESCRIPTION The machine-independent cpufreq interface provides a simple framework for CPU frequency scaling. 1. The cpufreq interface uses percentage values in place of actual frequencies. Thus, values 100 % and 0 % denote the highest and lowest frequency supported by the CPU. It is the responsibility of machine-dependent implementations to trans- late the percentage values to actual frequencies or other related performance levels. 2. The cpufreq interface is stateless and does no locking while calling the machine-dependent callbacks. 3. The cpufreq interface is a per-CPU framework. It is implic- itly assumed that the frequency can be set independently for all processors in the system. However, cpufreq does not imply any restrictions upon whether this information is utilized by the actual machine-dependent implementation. FUNCTIONS cpufreq_register(get, set) The cpufreq_register() function initializes the subsystem by associating the machine-dependent callback functions get and set with the machine-independent cpufreq_get() and cpufreq_set(), respectively. The cpufreq_set_cb and cpufreq_get_cb types are function pointers defined as: bool (*get)(struct cpuinfo_t *ci, void *aux, uint8_t *valp) bool (*set)(struct cpuinfo_t *ci, void *aux, const uint8_t val) Note that cpufreq does not keep track of the registered call- backs. Each call to cpufreq_register() will override any exist- ing callbacks. cpufreq_get(ci, valp) The cpufreq_get() function obtains the current frequency level of the CPU pointed by ci in the parameter valp. cpufreq_set(ci, val) The cpufreq_set() function sets the performance level of ci to val. The value val is guaranteed to be in the range [0, 100]. CODE REFERENCES The cpufreq subsystem is implemented within sys/kern/subr_cpufreq.c. SEE ALSO cpuctl(8) HISTORY The cpufreq subsystem first appeared in NetBSD 6.0.
Re: core's decision on modular kernels
On Thu, Sep 22, 2011 at 08:35:02PM +0100, David Laight wrote: I think that by MODULAR with built-in modules, you mean a barebones kernel linked with some .kmod's? I would love to see that. What has to happen to make it so? Probably just some 'round tuits'. Mostly in the area of config() and the kernel makefile. First stage would be linking an existing kmod into the kernel and sorting out the required data area linkage to get it initialised. Wouldn't this provide an answer also the difficult question of autoloading driver modules? Assuming a robust mechanism, link everything and then selectively unload what did not attach during autoconfiguration? - Jukka.
Re: A simple cpufreq(9)
On Fri, Sep 23, 2011 at 08:42:16PM +0200, Joerg Sonnenberger wrote: On Fri, Sep 23, 2011 at 01:02:52PM +0300, Jukka Ruohonen wrote: [2] Note that even in the x86 land it is no longer necessarily known which is the exact MHz at which the CPU currently runs (cf. TurboBoost, etc.). TurboBoost is a good reason for why percent is a bad measurement as well. In fact, I find it more confusing. If a tool reports to the user that NetBSD runs the CPU at 85% during a build.sh -j16, that's going to result in surprising questions... Well this is not really the case with TurboBoost; while there are few somewhat vague means to know the turbo in the kernel (not as a MHz though), this would show in the userland as 100 %, like it would show for the users of the MI functions. Consider making the unit of scaling an additional attribute of the list and provide userland with: id, data, unit as list, get/set is using the id. I particularly wanted to avoid importing any lists to the MI kernel parts or to the userland. Using a percentage like done in OpenBSD would greatly simplify further uses in the kernel. But of course an unit of measurement (MHz, voltage, etc.) can be imported for heuristic purposes. Nor do I see any real difference whether an user sets the CPU frequency to { 16 %, 35 %, 100 % } MHz or to { 821, 922, 1657 } MHz, expect that the former is clearer and more user friendly. Note also that for instance some ARM systems may use very fine grained lists. - Jukka.
Re: A simple cpufreq(9)
On Sat, Sep 24, 2011 at 07:20:16AM +0200, Joerg Sonnenberger wrote: You can not avoid providing a list of available states. Such an interface is inherently and completely broken. Heh, right. Why? Nor do I see any real difference whether an user sets the CPU frequency to { 16 %, 35 %, 100 % } MHz or to { 821, 922, 1657 } MHz, expect that the former is clearer and more user friendly. Sorry, it is not. So you propose sailing to the dark waters of sysmon_envsys(9)? You need to export integers (e.g. MHz), booleans (on/off), triplets (low/medium/high), and so on, all depending on the machine and/or platform. Removing data because it is more user friendly kind of misses the point that the kernel is not the UI. The exported data is not reliable. On x86 it is typically rounded and approximated by the BIOS writers. Note also that for instance some ARM systems may use very fine grained lists. ...and? Try to think beyond cpuctl(8). This should be done so that there is a *simple* MI interface that can be extended in the future. If you want a MI interface, the first thing is to agree upon a common scale. Frequency scaling is a step function, so you could start from 1. { LOW, HIGH } or { DISABLED, ENABLED }. Maybe you could then proceed with 2. { LOW, MEDIUM, HIGH } or { 800 MHz, 1200 MHz, 1600 MHz }. But how about 3. { 800 MHz, 805 MHz, 810 MHz, 815 MHz, 820 MHz, 1300 MHz, 1800 MHz }? 4. And how about some ARM system that may export over 30 states, possibly with non-uniform intervals? Can outline a consistent algorithm for an user-space or in-kernel governor to choose a state from these four examples? As such, a percentage here is nothing more than a scale from zero to hundred. It would be the responsibility of the MD implementation to interpolate this to whatever scale it may be using. But of course it is inherently and completely broken. - Jukka.
Re: strnlen(3) in kernel
On Tue, Sep 06, 2011 at 01:39:14PM +0200, Jean-Yves Migeon wrote: Is there a way to know what functions are available from libkern, and those only found in userland libs? Except by looking at libkern.h? No. While there are memcpy(9) etc., I think we should have a single libkern(3) with references to the section 3 (with some notes, if necessary). - Jukka.
Re: autoclean mode for tmpfs
On Sun, Aug 07, 2011 at 03:10:29AM +, David Holland wrote: So I just had an idea: since cleaning /tmp on a live system is very dangerous unless done so (and even then somewhat dangerous), plus there are other possible uses for automatically disappearing files: How hard would it be to add a mount option for tmpfs to automatically drop files after a given timeout? It seems to me that it shouldn't be very difficult, but I haven't looked at the tmpfs innards in a while. Anyone think this is worthwhile? Sounds like a job for the userland and cron(8). - Jukka.
Re: autoclean mode for tmpfs
On Sun, Aug 07, 2011 at 07:09:14AM +, David Holland wrote: Sounds like a job for the userland and cron(8). uh no. See: since cleaning /tmp on a live system is very dangerous So care to elaborate what is dangerous about it? I do clean /tmp daily, but it needs to be done selectively. - Jukka.
Re: pchb@acpi
On Mon, Aug 01, 2011 at 08:59:57PM +0200, Matthias Drochner wrote: I think it is OK to attach the PCI buses which are defined by ACPI at acpi. The attachment frontend can install hooks to get interrupt routing right. This would also help wakeup support for eg USB and ethernet devices. Indeed. We need this for all PCI buses and devices. That is why hacks like device_is_a() etc. won't do. And as you noted, there is awful lot of ugly duplication because ACPI is already heavily required for x86 interrupt routing. That said, I don't think this kind of attachment is required for the IRQ setup per se (at least not in my branch). - Jukka.
Re: RFC: New security model secmodel_securechroot(9)
On Sat, Jul 23, 2011 at 09:35:43PM +0300, Aleksey Cheusov wrote: * Exec logging within chroot What's this? It has been quite a while since I used Grsecurity, but it logs a message every time a program is executed within a chroot. This may be useful to audit chroot'ed daemons, but if I remember correctly, this was a compile- time option in Linux. - Jukka.
Re: Dutch keymap not imported into NetBSD :p
On the X30 I got NetBSD-5.1 and there is no nl keymap. Google pointed me to NetBSD problem report number 35473. Could Spanny patch be included into NetBSD-current ? Yes, I will commit it shortly. - Jukka.
Re: RFC: New security model secmodel_securechroot(9)
On Thu, Jul 14, 2011 at 12:07:56AM +0300, Aleksey Cheusov wrote: So what is the security policy you mean to enforce by blocking paths into the kernel with kauth? For every `destructive modification' that can be done to the system, what is every path into the kernel that leads to that modification? Have you blocked all such paths in your kauth secmodel? I'm open for concrete ideas and references. I haven't followed the discussion that closely, but the following list appears in the chroot(2) restrictions of the PaX/Grsecurity (Linux) project: * No attaching shared memory outside of chroot * No kill outside of chroot * No ptrace outside of chroot (architecture independent) * No capget outside of chroot * No setpgid outside of chroot * No getpgid outside of chroot * No getsid outside of chroot * No sending of signals by fcntl outside of chroot * No viewing of any process outside of chroot, even if /proc is mounted * No mounting or remounting * No pivot_root * No double chroot * No fchdir out of chroot * Enforced chdir(/) upon chroot * No (f)chmod +s * No mknod * No sysctl writes * No raising of scheduler priority * No connecting to abstract unix domain sockets outside of chroot * Removal of harmful privileges via capabilities * Exec logging within chroot - Jukka.
Re: IOC_CPU_SETSTATE
On Sun, Jul 03, 2011 at 10:49:59PM +0100, Alexander Nasonov wrote: BTW, intr/nointr is not documented in cpuctl(8). One possible reason for this is that per-CPU intr/nointr is not yet supported on e.g. x86, AFAIK. - Jukka.
Re: add DIAGNOSTIC back to GENERIC/INSTALL
On Sun, Jul 03, 2011 at 07:27:00PM +0200, Manuel Bouyer wrote: it's not only about Xen, it's about all kernels for any port which already have DIAGNOSTIC and want to keep it even for release (e.g. i386 ALL). As far as I understand, i386/ALL is just for testing the compilation of various options and drivers. I doubt whether it even boots. - Jukka.
Re: uvm locking inconsistency
On Wed, Jun 15, 2011 at 09:30:17PM +0200, Manuel Bouyer wrote: I fear so, sadly. I think DIAGNOSTIC should be back in x86 GENERIC kernels on HEAD (this can be switched off in release branches) Contrary, I think every viable debug option (DIAGNOSTIC + LOCKDEBUG at least) should be enabled in HEAD, but disabled in release kernels. An easy way to catch obvious regression that should never enter a release kernel. The so-called HEAD is the main development branch, after all... - Jukka.
Re: Merge of rmind-uvmplock branch
On Tue, May 31, 2011 at 10:15:36PM +0100, Mindaugas Rasiukevicius wrote: Unless anyone objects, I will merge rmind-uvmplock branch. The technical objectives of the branch are described here: Indeed, and as usual, extraordinary work! - Jukka.
Re: pmf(9) vs sysmon for power events (especially sleep when powerd(8) is not running)
On Sat, May 07, 2011 at 09:03:42PM +0200, Jean-Yves Migeon wrote: - sysmon_pswitch(9) can still be used to register power switch events, these events being modeled following a switch functionality e.g. when a threshold is passed. Yes. Although I don't know what you mean by thresholds. - pmf(9) is focused on device states, so it's lower level than sysmon_pswitch(9) events. pmf(9) event injection is not supposed to be called directly, but rather through sysmon (for switch-like functionality), or within pmf(9) itself for inter-device signaling. No. Device drivers are calling pmf(9) event injections directly. I think Jared or Jörg should clarify this, but I think the pmf(9) calls you cited earlier were added to the sysmon routines for compatibility-like reasons. To be effective, there needs to be also a listener for the injected events. So, in the current form, power switches/buttons are not supposed to register as devices and implement their own hooks for registration with pmf(9)? I am not sure what you mean by this. For instance, a platform/laptop-specific driver registers naturally with pmf(9), but it may also use the sysmon routines for various tasks (e.g. also some hotkeys are handled by the sysmon routines). There is no grand scheme of things. It is just duplicity. - Jukka.
Re: pmf(9) vs sysmon for power events (especially sleep when powerd(8) is not running)
On Fri, May 06, 2011 at 04:45:55PM +0100, Jean-Yves Migeon wrote: 1 - I shall patch sysmon_pswitch_event and add a callback for sleep that MD code can register, 2 - or register a pmf(9) event handler during hypervisor attachment, and just use pmf_event_inject() in the /* XXX */ sleep path that will trigger this handler. Either one is fine by me. Perhaps the latter approach sounds slightly better, as it uses the already existing KPI and avoids patching the already convoluted sysmon routines. - Jukka.
Re: pmf(9) vs sysmon for power events (especially sleep when powerd(8) is not running)
On Fri, May 06, 2011 at 10:35:30AM +0100, Jean-Yves Migeon wrote: Yes. However, in the Xen domU case, it is quite unacceptable. Anyone willing to suspend a domain would launch xm save from dom0. If powerd(8) is not running, the xm save will wait ~forever for the domU to signal it's ready for suspension. I'd like to have a shortcut that handles the powerd id not running step, even if that means that specific services have not been turned off cleanly via scripts/sleep_button. Speaking about normal x86 and other architectures, we should pick good defaults but not tie things to the kernel. Formulating one-and-true policy or power-event state machine is not a goal that can be even reached. I want my laptop to suspend when the lid is closed, but someone else may not like. It is more than natural that things like this are handled in user space. Like is done currently with powerd(8), it is also a good idea to shutdown other daemons before entering a suspended state. This situation also applies to power button too, but this case is already handled [1]. Albeit, not sleep, hence the XXX I believe. As I've written already, powerd(8) should be enabled by default on the stock rc.conf(5). This is again something that should not require manual tuning. I respectfully disagree. The PSWITCH_TYPE_LID event is first handled by sysmon(9), then injected in pmf(9). See [2]. [...] The sysmon_pswitch_register(9) is indeed a NOP (it is supposedly there to account some possible future use). But sysmon_pswitch_event() is not a NOP. It does not inject anything to pmf(9). It does. See [2]. Ah, right. Of course you should follow what those injections actually do and where the listeners are? The main function in sysmon_power.c is: 936 if (sysmon_power_daemon != NULL) { 937 /* 938 * Create a new dictionary for the event. 939 */ 940 ped = kmem_zalloc(sizeof(*ped), KM_NOSLEEP); 941 if (!ped) 942 return; 943 ped-dict = prop_dictionary_create(); 944 945 if (sysmon_power_daemon_task(ped, smpsw, event) == 0) 946 return; 947 } - Jukka. [1] In lack of a better reference see e.g. http://lists.xensource.com/archives/html/xen-devel/2010-05/msg00115.html
Re: pmf(9) vs sysmon for power events (especially sleep when powerd(8) is not running)
On Thu, May 05, 2011 at 05:56:43PM +0100, Jean-Yves Migeon wrote: i am experiencing some difficulties regarding the somewhat duplicity of functionality provided by sysmon_*(9) and pmf(9) APIs, for everything that has to deal with power management event. The duplicity is a known and unfortunate issue. Also many drivers suffer from this. My personal opinion is that we should either rework and cleanup sysmon's power-related KPI or slowly deprecate it. But, still, pmf(9) can not do the job alone (at least currently). Disclaimer: this is for suspend/save events, whatever you name them; each implementation has its own way of specifying them: Xen domU assume that sleep/suspension is a serialization of VM memory state to a disk file, while ACPI have different expectations depending on level (suspend to RAM, suspend to disk, states, etc.) So you take the stance that there will never be normal (APM/ACPI/XXX) suspend states in Xen? I think Linux supports this already. Thus, generally, any KPI should handle multiple backends with maybe slightly diverging conceptual definitions. Currently, we have two frameworks: pmf(9) and the different sysmon_(9) routines. As I see them, pmf(9) is fairly lower level, and covers only device attach/detach/suspension (and inter driver signaling). sysmon_*(9) are userlevel oriented, and certain events can even be managed by userland through powerd(8) (please confirme about these goals/non goals). This is quite adequate description. Note that it is still desirable to have some (but not necessarily all) events delivered to user space. This is the main task that is currently handled by the sysmon-routines + powerd(8). Except for specific situation, high level events (LID open/close, power button press) are first handled via sysmon, then injected to drivers via pmf. In most cases it is either, not both. Would the sysmon_power backends be a long term replacement for the various shutdown/reboot/sleep/power control (power-on scheduling, sleep states) hooks, or should it be just regarded as the registration of a sleep handler, and nothing more? As said, the first approach requires a major cleanup and rationalization of the sysmon_power backend. The second approach may sound reasonable as an intermediate or a temporary solution for the immediate requirements of 6.0. That is, I think no one expects you to write a full-blown KPI for this -- a task that is quite non-trivial, as is manifested by the current duplicity. I am also having a hard time figuring out the different between the goals of sysmon_pswitch_register(9) and pmf_device_register(9). Both are supposed to handle power events, but sysmon_pswitch_register(9) is now a NO-OP, with everything directly injected into pmf(9). The sysmon_pswitch_register(9) is indeed a NOP (it is supposedly there to account some possible future use). But sysmon_pswitch_event() is not a NOP. It does not inject anything to pmf(9). BTW, would the handler be supposed to be called only when powerd(8) is running (with the sleep_button script execing zzz(8)), or could it be used when it is not, including situation where there's no real thread context (on interrupts)? Do not confuse the sleep_button script with the issue at hand. As the names indicates, it delivers events from buttons that are physically present on a computer. I think there should be no requirements for this to work on interrupt context (if there is, the drivers should do something about it). - Jukka.
Re: kernel bitreverse function
On Sun, Apr 03, 2011 at 05:09:55PM +0200, Frank Wille wrote: Did somebody already try to implement it? If not, I would suggest the following code for src/sys/lib/libkern: [...] Any comments? Then please speak now. :) Just a footnote: wouldn't sys/bitops.h be a better place logically? - Jukka.
Re: kernel bitreverse function
On Sun, Apr 03, 2011 at 06:12:03PM +0200, Frank Wille wrote: Don't know about others, but my goal was to eliminate double code from the kernel. The use of the new functions should also be restricted to the kernel. While I have no real opinion for or against, I can certainly imagine finding use for a well-defined bit function like this also in user space. - Jukka.
Re: sysmon_pswitch_event(): provide a sleep routine when powerd(8) is not running
On Mon, Mar 28, 2011 at 01:33:45PM +0100, Jean-Yves Migeon wrote: 1 - modify sysmon_pswitch_event prototype so it can return an error (therefore leaving the possibility for the caller to fix the event by itself), OR 2 - add a MD system_suspend() routine, define it to NULL by default, and which can be overriden by MD should there be a need to call the suspend code without going through powerd(8) via sysmon_pswitch_event(), OR 3 - alternatively, add a RB_SLEEP flags to cpu_reboot(), which will basically do the same as the above, except that we could reuse part of the cpu_reboot function. I would go for (3), perhaps with a -s flag to halt(8). This would also solve the user interface issue that remains unresolved in options (1) and (2). Extending halt(8) has been discussed also previously (cf. e.g. [1]). - Jukka. [1] http://www.netbsd.org/contrib/projects.html#shutdowntime
Re: high sys time, very very slow builds on new 24-core system
On Wed, Mar 23, 2011 at 05:24:12PM -0400, Thor Lancelot Simon wrote: All cores spend well over 50% time in 'sys', even when all or almost all are running cc1 processes. The kernel is amd64 -current GENERIC from about 1 week ago -- no DIAGNOSTIC, DEBUG, KMEMSTATS, LOCKDEBUG, etc. Does anyone have any idea what might be wrong here? Another shot in the dark: AMD's so-called C1E is known to cause issues like this (in which case you might want to enable acpicpu(4)). - Jukka.
Re: BIOS/ACPI interrupt conflict
On Wed, Feb 09, 2011 at 04:47:12PM -0800, Cliff Wright wrote: Bios is correct, and ACPI wrong, I have seen this on other machines. And as I said in the 2007 email, even if ACPI had been the correct one, it still was not going to setup the interrupt. In this area, and with the current code base, it is very difficult to say who is wrong... Note that in theory the PCI interrupt link devices may contain different IRQ sets depending on whether PIC or I/O APIC is used, but I don't know how well the current code handles this. It occurred to me that maybe a test for an apic needs to be done. In my case where I have no apic, then the BIOS data has to be accepted because nothing else sets up the interrupt. Yes, if only 8259A PICs are used, probably no calls should be even made to mess with the (ACPI) PCI interrupt link devices. I am slowly working with an entirely new implementation, so while the patch looks reasonable enough, I think it might be best to generally leave the current regression-prone code intact. - Jukka.
Re: BIOS/ACPI interrupt conflict
On Wed, Feb 09, 2011 at 10:54:08PM -0800, Brian Buhrow wrote: I note that at the time, I received strong objections to my patch on the grounds that it didn't account for bioses which didn't setup the interrupts and reported that they had. That's true, but in my patch, you had to build a custom kernel and add the option ACPI_BELIEVE_BIOS to turn it on. In general, and in my opinion, we definitely do not want such tunable options, especially for something as essential as this. I have already cleaned most of these options from the acpi(4) stack, and in the long-run the remaining ones should be removed as well. - Jukka.
Re: Capsicum: practical capabilities for UNIX
On Mon, Oct 25, 2010 at 07:28:56PM -0500, David Young wrote: The chief difference I see between a process limited by Capsicum and a process limited by Systrace is that the Capsicum-limited process has only the privileges that the parent process grants it, while the Systrace-limited process has a system-call firewall applied. It's easier with the Capsicum-limited process than with the Systrace-limited process to reason about what the process can do, and to adjust the process privileges, because it's easier to name and count capabilities than to read, interpret, and re-write systrace rules. Does this mean that every program that wants to use Capsicum needs to be patched to use Capsicum? This is the main problem I have with MACs and related frameworks; to gain full advantage from these, you need the resources of Red Hat. Are we going to patch third-party software to use Capsicum? Who knows what should be allowed or disallowed in a monster like Firefox? Apache? X.org? Bind? Who would be maintaining these patches? - Jukka.
Re: acpivga(4) v. MI display controls
On Sat, Oct 16, 2010 at 05:45:51PM -0500, David Young wrote: Another thing is the actual device tree. For instance, currently, even with the fine work done with pmf(9), in some corner cases we may power off a device before its children are turned off because the device tree is partially arbitrary. What devices do you have in mind? The canonical example is perhaps the LPC bridge. This is also the case brought up by Quentin in an earlier revision of this discussion. The following takes a very specific point of view to demonstrate the issue. Now raise the abstraction so that we do not talk about any specific chip. The so-called power resource, if it exists, is shared by all devices under the bridge. The concept of power resource itself can be just bad abstractions used in the ACPI code, but there are no guarantees that manipulating it won't turn off the chip (or stop processing in the chip or whatever this may mean). (Actually, I have seen several systems where turning power resources on/off actually turns hardware on/off.) The power resource code implements several sanity checks, namely (a) a parent can not be turned on/off if its children are not on/off and (b) reference counting prevents turning anything off if something else is using the power resource. Neither (a) nor (b) really works in NetBSD due reasons mentioned. Because the ACPI tree is not synchronized with the real tree, none of the devices under the bridge claim the power resource when they attach. But the real trick is that the firmware may turn a power resource off for instance when we enter a sleep state. Upon resume, we need to turn it back on, but we can not do it blindly. Another question is whether we have sufficient abstractions for device power state in the real tree. For example, most of the devices are incorrectly attached (to acpi0) here: LPC [06] [ ] (PCI) @ 0x00:0x00:0x1F:0x00 ichlpcib0 SIO [06] [ ] PIC [06] [ ] TIMR [06] [ ] attimer1 HPET [06] [ ] hpet0 DMAC [06] [ ] SPKR [06] [ ] pcppi1 FPU [06] [ ] npx1 RTC [06] [ ] KBD [06] [ ] pckbc1 MOU [06] [ ] pckbc2 DURT [06] [ W] DLPT [06] [ ] DECP [06] [ ] FIR [06] [ ] TPM [06] [ ] EC[06] [ ] acpiec0 PUBS [11] [ ] BAT0 [06] [ ] acpibat0 BAT1 [06] [ ] BAT2 [06] [ ] AC[06] [ ] acpiacad0 HKEY [06] [ ] thinkpad0 The above example also reveals the devices (in this machine) that reference the ACPI embedded controller's operation regions. Thus, the three children should be attached under acpiec(4), or more conservatively, these should at least never be attached before acpiec(4). Hope the above made some sense, Jukka.
Re: acpivga(4) v. MI display controls
On Fri, Oct 15, 2010 at 07:53:53PM -0500, David Young wrote: OK, what this code is doing is essentially attach a device to the acpi tree that really refers to a PCI device. Can we please get this to attach as child of vga0 by checking for a device matching the PCI address of vga0, that also provides _DOD and _DOS. This would prevent accessing vga0 on resume before it has been reset. Joerg calls attention in that last sentence to the possibility of defects in suspend/resume that arise when a device is represented twice in the device tree. Sounds familiar. :-) The above scheme is easily achieved if we start dropping #ifdefs to the device tree. (Hopefully everyone can agree that this is out of the question.) As I wrote, if we start to implement hacks specific to one acpi(4) driver, we end up with a big mess. It is much better to have the whole acpi(4) uniformly at 'acpinodebus' even with the risks involved, so that once we have a solution, everything can be transformed in a single sweep. You do realize that our suspend/resume paths are full of defects due reasons I outlined? For instance, because drivers do not inform the firmware upon suspend(), we have several cases where devices resume in a power off state (cf. PR #37891). Complaining about a single driver prevents one from seeing the forest. ISTM that more than one developer can, and has, described in a broad outline how it should be done. For example, I can outline how device_register() can be used to put ACPI information into MI device properties for device-attachment hooks to read back out. I'm happy to give more detailed suggestions, too. I think everyone groks this. Opening up an editor and doing the work is another thing. I emphasize that this is not entirely about autoconfiguration. I'm not sure I understand what you mean by the 'natural' device tree. I think you may have drawn a line between virtual and real device hierarchies and assigned ACPI to a different category than I would. Again, I'm not sure I've taken your meaning right. By natural I refer to the discussion on this list about (semi-random) thoughts on device tree structure (and the several inconsistencies in it). See appendix. It's just occurred to me that it may help to form a group to discuss how BIOS information should be encoded and conveyed from MD code to MI drivers in NetBSD. By setting standards, we may help developers on every port leverage others' knowledge and work. What do you think? Sounds good, albeit talk tends to be cheap. I take the above quote to clear some misunderstandings: (b) This is not about passing something from MD to MI -- it goes to the other direction also. (a) This is not only about passing information, but applies to controls (callbacks, etc.) also. (b) This is not only about autoconfiguration, but (a) and (b) are present dynamically at runtime. When a driver writes to a register, it may need to inform the firmware. When the firmware writes to a register, it may need to inform the driver. - Jukka. Appendix: the natural device tree on a ThinkPad. \ [06] [ ] CPU0 [12] [ ] CPU1 [12] [ ] _SB [06] [ ] LNKA [06] [ ] LNKB [06] [ ] LNKC [06] [ ] LNKD [06] [ ] LNKE [06] [ ] LNKF [06] [ ] LNKG [06] [ ] LNKH [06] [ ] MEM [06] [ ] LID [06] [ W] acpilid0 SLPB [06] [ W] acpibut0 PCI0 [06] [ ] (PCI) @ 0x00:0x00:0x00:0x00 [R] [B] - 0x00 pchb0 LPC [06] [ ] (PCI) @ 0x00:0x00:0x1F:0x00 ichlpcib0 SIO [06] [ ] PIC [06] [ ] TIMR [06] [ ] attimer1 HPET [06] [ ] hpet0 DMAC [06] [ ] SPKR [06] [ ] pcppi1 FPU [06] [ ] npx1 RTC [06] [ ] KBD [06] [ ] pckbc1 MOU [06] [ ] pckbc2 DURT [06] [ W] DLPT [06] [ ] DECP [06] [ ] FIR [06] [ ] TPM [06] [ ] EC[06] [ ] acpiec0 PUBS [11] [ ] BAT0 [06] [ ] acpibat0 BAT1 [06] [ ] BAT2 [06] [ ] AC[06] [ ] acpiacad0 HKEY [06] [ ] thinkpad0 VID [06] [ ] (PCI) @ 0x00:0x00:0x02:0x00 vga1 LCD0 [06] [ ] CRT0 [06] [ ] AGP [06] [ ] (PCI) @ 0x00:0x00:0x01:0x00 VID [06] [ ] LCD0 [06] [ ] CRT0 [06] [ ] EXP0 [06] [ W] (PCI) @ 0x00:0x00:0x1C:0x00 [B] - 0x01 ppb0 EXP1 [06] [ W] (PCI) @ 0x00:0x00:0x1C:0x01 [B] - 0x02 ppb1 EXP2 [06] [ W] (PCI) @
Re: acpivga(4) v. MI display controls
On Fri, Oct 15, 2010 at 08:29:57AM -0400, der Mouse wrote: ACPI may be the source of the information, but that doesn't mean it has to be how the autoconf tree is constructed. Compare and contrast with how NetBSD/sparc uses the OF (or is it OBP? I'm not sure) device tree to drive autoconf, but doesn't have a device node corresponding to OF that everything attaches under; it just uses the OF tree as the source of the data about what exists where. (Well, much of it; autoconf doesn't totally mirror OF, eg, in SCSI device attachment.) I do not know OF well, but my impression is that it is much, much less invasive than what we have nowadays on x86 where close interaction between the firmware and drivers are expected. Several people seem to be under the false impression that this is something only related to device attachment and autoconfiguration. It is not. I tried to outline this in another mail, but frankly I think whether 'X attachs to Y or Z' is just a little, largely irrelevant, detail in the face of much bigger problems. In a nutshell: ACPI BIOS may access hardware directly, with or without the consent from the system. In an entirely x86 based codebase this is hardly a problem, but in NetBSD this comes down to the question on how to maintain the clean MD/MI separation in the future. - Jukka.
Re: acpivga(4) v. MI display controls
On Fri, Oct 15, 2010 at 08:26:34AM +0300, Jukka Ruohonen wrote: The task is not trivial. On modern x86, practically *everything* that attachs has an ACPI counterpart. In a way we are thinking this backwards: the attachment should perhaps be done via ACPI that has information about the natural device tree (I recommend to boot with ACPIVERBOSE option and observe the output). This is how it is supposedly done in Windows. And consequently, *most* (MI) drivers that work on x86 need to eventually call (MD) ACPI callbacks, and vice versa. Bringing this all together in a clean (MI) implementation is hard and requires substantial changes, to say the least. As an addition, due reasons stated above, I object anything that tries to make a case for a single driver from acpi(4) -- be it acpivga(4), acpicpu(4), or the ISA and PCI cases discussed previously. This should be solved once and for all, for all acpi(4) and for all pci(4), isa(4), ... Otherwise we end up with god-awful mess. If such a solution comes to existence, we are happy to refactor acpi(4). During the ten years that ACPI has been in NetBSD, several people have tried a solution without much success. I have personally tried twice, and failed already at the self-criticism stage. - Jukka.
Re: acpivga(4) v. MI display controls
On Fri, Oct 15, 2010 at 10:10:18AM +0200, Martin Husemann wrote: On Fri, Oct 15, 2010 at 08:26:34AM +0300, Jukka Ruohonen wrote: This was discussed during the development process. Where? Already when this was first presented in 2008: http://mail-index.netbsd.org/tech-kern/2008/12/05/msg003744.html The issues noted back then are still present. - Jukka.
Re: acpivga(4) v. MI display controls
On Thu, Oct 14, 2010 at 06:50:30PM -0500, David Young wrote: Rather than attaching new nodes at acpi0, the system should let ACPI BIOS inform the autoconfiguration process, which should attach one or more instances of a new, MI device, display(4). For example: vga0 at pci0 device ... function ... display0 at vga0: Ext. Monitor, head 0, bios detect (ACPI CRT1) display1 at vga0: TV, head 0, bios detect (ACPI DTV1) display2 at vga0: Unknown Output Device, head 0, bios detect (ACPI LCD) In this way, no single device has two representations in the device tree (think about the consequences, they're not pretty), and every device appears in the most appropriate place in the device tree for the purpose of suspending, resuming, detaching and re-attaching it. This was discussed during the development process. Sure, the above is the ideal case. Yet once again I need to remind that we can not hold back important acpi(4) work because the perfect abstraction has not arrived, and no one seems to really know how it should be done. The task is not trivial. On modern x86, practically *everything* that attachs has an ACPI counterpart. In a way we are thinking this backwards: the attachment should perhaps be done via ACPI that has information about the natural device tree (I recommend to boot with ACPIVERBOSE option and observe the output). This is how it is supposedly done in Windows. And consequently, *most* (MI) drivers that work on x86 need to eventually call (MD) ACPI callbacks, and vice versa. Bringing this all together in a clean (MI) implementation is hard and requires substantial changes, to say the least. - Jukka.
Re: Capsicum: practical capabilities for UNIX
On Sun, Sep 26, 2010 at 08:48:45PM -0400, Perry E. Metzger wrote: They did Chrome in the paper, and it required very few lines of code (under 100). They did other tests too. It appears that they've had quite a bit of success in creating a very usable API here. I'm not entirely surprised, given the nature of what they're doing. Just a little historical remark. I am little puzzled why Watson et. al. did not bother to mention Linux capabilities that have existed for a long time. The Linux API is almost identical to the one proposed in the capsicum paper. And yet, Linux capabilities are seldom used. Perhaps a general perception would be that somehow these capabilities slided to sidetracks from the very beginning. One probable cause for this was that the vendor-independent committee that started the whole thing was unable to provide something that could have become an actual standard across UNIX platforms and their derivatives. The result was only a draft POSIX document, IEEE 1003.1e, released in 1997, which is considered a failure by many. Maybe there is something to learn from here. - Jukka.
Re: 5.1_RC3 on Dell r610 fails
On Tue, Aug 31, 2010 at 04:06:16PM +1200, Mark Davies wrote: Any suggestions on whats broke This is again the so-called Enhanced SpeedStep (EST). how to fix? Disable options(4) ENHANCED_SPEEDSTEP. - Jukka.
Re: RFC: device flavours
On Sun, Jul 25, 2010 at 09:22:53PM +, Quentin Garnier wrote: bridges (mostly on x86). An even older idea of mine is to finally see legacy devices listed in the ACPI tables attached to the PCI-ISA bridge where they logically belong, and device flavours can be used for that, too. I am not sure if I understand all of this, so bear with me. While this is the direction we should go, it seems to me that the long- standing issues in ACPI-PCI-ISA are not so much where the legacy drivers actually logically attach, but that these, like majority of drivers on modern x86, should utilize the information from ACPI. Is this possible with flavours? Will the siblings still require a stub on the ACPI side of things? pcib0 at pci0 dev 31 function 0: vendor 0x8086 product 0x27b9 (rev. 0x02) timecounter: Timecounter pcib0/ichlpc frequency 3579545 Hz quality 1000 pcib0/ichlpc: 24-bit timer pcib0/ichlpc: TCO (watchdog) timer configured. gpio5 at pcib0: 64 pins pcib0/acpiib: ACPI node SBRG npx1 at pcib0 (COPR, PNP0C04): io 0xf0-0xff irq 13 npx1: reported by CPUID; using exception 16 SIOR (PNP0C02) at pcib0 not configured RMSC (PNP0C02) at pcib0 not configured OMSC (PNP0C02) at pcib0 not configured In the above example it is known that the LPC bridge currently conflicts with the ACPI PM registers. So to put this to the logical end, the derivation using ACPI should start from there, and the pci_mapreg_map(9) call therein should use the information supplied by ACPI. There are other situations in which I think device flavours could bring clarity and also better modularisation. For instance, support for CPU features on x86 like EST or PowerNow, or even ACPI P-states could be done that way, and it is more module-friendly because it wouldn't require the main CPU driver to explicitely call those feature-drivers. Here I can see use. I was actually seeking this kind of granularity with the ACPI CPU. - Jukka.
Re: Modules loading modules?
On Mon, Jul 26, 2010 at 06:41:11AM +1000, matthew green wrote: it seems to me the root problem is that module_mutex is held while calling into the module startup routines. Here is one related question: is it ensured that the module lock is dropped immediately after a modular device driver returns from its attachment routine? I am thinking of a case where a modular driver defers its configuration by using config_interrupts(9) or config_finalize_register(9). - Jukka.
Re: (Semi-random) thoughts on device tree structure and devfs
On Mon, Mar 08, 2010 at 10:54:13AM -0500, der Mouse wrote: Linux had a devfs and [dropped] it. Now it has udevd(8). Most likely the penguins had a reason for this. Surely there are mailing list messages or something that outline that reason? (Not that I have any idea where they'd be, but don't we have at least a few people with feet in both camps?) It is more like: Linux had a devfs and [dropped] it. Now it has udevd(8). Most likely the penguins had a reason for this. Linux had udevd(8) and reintroduced devfs. Now it has udevd(8) and some kind of devfs. Most likely the penguins had a reason for this. - Jukka. http://lwn.net/Articles/331818/
Re: (Semi-random) thoughts on device tree structure and devfs
On Sun, Mar 07, 2010 at 08:18:15PM +, Quentin Garnier wrote: As an example: one thing that holds back the ACPI CPU code I am working on is that I need to be sure that e.g. cpu3 that attaches to acpi0 is the same cpu3 that has attached to mainbus0. So: Well, the answer to that is simple: there should only be one device. Anything design that doesn't produce that result can go to thrown out the window without further delay. In the above example it would be acpicpu3 at acpi0 and cpu3 at mainbus0. But as you know quite well what is involved, I am merely pointing out that the current situation holds back many possibilities. And noting that I don't have the competency to do anything about it. - Jukka.
Re: CVS commit: src/sys/arch
On Sat, Feb 06, 2010 at 01:07:08PM -0800, Paul Goyette wrote: If it matches a device, and there is also a native driver for the underlying i2c controller, then there'll be two devices accessing the same bus. Bad things (tm) will happen. This is noted in the BUGS section of the acpismbus(4) man page. On a related note, a similar warning should be probably added to aiboost(4). At least on Linux it is known to cause weird problems and lockups if the iic(4) is being accessed at the same time by a native driver (it87?). It is also a reasonable assumption that things will get worse at this front. The new ACPI 4.0 standard introduced a sensor framework of its own, and my guess is that consumer PC manufacturers will jump on the bandwagon, trying to hide these things in the abyss of ACPI. - Jukka.
Re: regression (crash) in sysmon/acpiacad
On Sun, Feb 07, 2010 at 08:30:27AM +0100, Joerg Sonnenberger wrote: On Sun, Feb 07, 2010 at 09:04:54AM +0200, Jukka Ruohonen wrote: * The following sensors should be removed: technology, low capacity, and warning capacity. These are not really something that should be sensed. Technology ok. I'm not too sure about low and warning, given that they normally can't be modified. The idea here would be to use the sme_get_limits() and possibly sme_set_limits(). This is exactly the rationale behind those callbacks. This would also result a nicer output in envstat(8). * The design capacity should be the maximum of the last known full charge capacity, which is the maximum of the present capacity. This is useful for checking the overall health of deteriorating (lithium-ion) batteries. I disagree. Both batteries for my laptop had initially a higher capacity than designed for -- e.g. last full and design cap don't necessarily agree with each other. I noticed the same thing with voltages. Yet, what is wrong with envstat(8) or some other tool reporting last full charge capacity is 123 % of the design capacity? * Sensors that have a maximum should report also percentages in relation to these maximums. From the usability point of view, this is probably almost always the right choice. That should be a task for userland, not the kernel. It already is; in acpibat(4) this just implies setting the ENVSYS_FPERCENT flag, nothing more. - Jukka.
Re: regression (crash) in sysmon/acpiacad
On Thu, Feb 04, 2010 at 10:15:03PM +0100, Matthias Drochner wrote: p...@whooppee.com said: Since the charge value was not updating, it might be that the ACPI Notify isn't working here. Since this involved running on battery power, I doubt it is about the removal of the refresh routine in acpiacad(4). If the sensor value changes when one plugs/unplugs the AC, it is easily verified to be working. For the critical shutdown, a call to _BTP might help. The _BTP is just a custom warning trip-point that triggers a Notify once reached. It is probably there to provide user space applications some control over the limits, and to possibly avoid polling of the values. Note though that nothing has changed in acpibat(4) with regards to the refresh routine or the sensors generally. But anyway, from my limited experience with process control (SCADA) systems, it makes sense to maintain a timestamp for the last data value read (or delivered by asynchronous notification) and force a fresh read if it is older than a limit defined by the provider (and possibly overridden by the consumer). Something like is already done in acpibat(4). - Jukka.