Re: [RFC/RFT] calloutng
The patch has been splitted in smaller logical chunks in order to improve readability and facilitate review. The whole code can be found here: http://people.freebsd.org/~davide/calloutng_split/ In particular: http://people.freebsd.org/~davide/calloutng_split/sbintime.diff brings the new type 32.32 fixed point used in lieu of 'int ticks' for the new callout mechanism and relative functions (conversion from to {timeval,bintime,timespec}, sbinuptime(), getsbinuptime()) This change is preliminary to http://people.freebsd.org/~davide/calloutng_split/callout_internals.diff which contains all the callout changes internally (introduction of direct dispatching, conversion to sbintime, new precision mechanism) http://people.freebsd.org/~davide/calloutng_split/cpuidle.diff (see r242905) introduce some optimization in the sleep time calculation and choose of the optimal sleep state, when CPU is idle. http://people.freebsd.org/~mav/calloutng-20130221/libprocstatfix.diff is a workaround in libprocstat in order to allow world to build after changes. About this one, I'm not quite sure is the best solution, so if there are other opinions, I'd be more than happy to hear. Other patches available in the directory converts some kernel consumers of callout(9) to the new interface (nanosleep, poll, select, kqueue, syscons, etc...). My plan is to commit the whole patchset March 4th, unless there are valid reasons to not do so. Any cooment is (as always) appreciated. Thanks, Davide ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Thu, Feb 14, 2013 at 11:56:54PM -0800, Alfred Perlstein wrote: [ added Luigi Rizzo to thread ] On 2/11/13 12:26 PM, Davide Italiano wrote: [trimmed old mails] Here's a new version of the patch: http://people.freebsd.org/~davide/patches/calloutng-11022012.diff Significant bits changed (after wider discussion and suggestion by phk@): - Introduction of the new sbintime_t type (32.32 fixed point) with the respective conversion (sbintime2bintime, sbintime2timeval etc...) - switch from 64.64 (struct bintime) format to measure time to sbintime_t - Use sbintime_t to represent expected sleep time instead of measuring it in microseconds (cpu_idle_acpi(), cpu_idle_spin() ...). Luigi and I discussed this at BAFUG tonight and he expressed an interest in shepherding the code in at this point. Luigi, can you reiterate any points that still remain before we integrate this code? I am mostly ok with the patch in the current state. I have few comments/suggestions below but they are mostly on documenting/style/trivial bugfixes. So now I would really like to see this code go in HEAD, to give people a chance to see how it works and possibly figure out whether the new API needs modifications. To recall, my major concerns were the following: - API explosion, with multiple versions of the callout routines. This seems to be solved now, with only one version (the *_sbt() functions) and macros remapping the old functions to the new ones. - the use of struct bintime for timeouts and precision. This is also solved thanks to the introduction of sbintime_t values (fixed-point 32.32 times) - Some measurements also indicated that the default precision could give large (absolute) errors on the timeouts, especially with large timeouts. I am not sure what is the situation with this new version, but i believe this a relatively minor issue because it surely simple to customize, e.g. having a couple of sysctl setting the default precision (percentage) and maximum error (absolute time). So users could e.g. set a 5% precision and a maximum error of 100us on a desktop, and 10% and 10ms on a laptop. - Another thing that we should do (but after the patch is in) is to see if any existing code is converting times to ticks just to call the timeout routines -- right now the macros convert back to times, clearly it would be best to avoid the double conversion. comments inline below, search for XXX-lr thanks again for your work on this, and for following up with changes after the previous discussion cheers luigi Index: sys/dev/acpica/acpi_cpu.c === --- sys/dev/acpica/acpi_cpu.c (.../head) (revision 246685) +++ sys/dev/acpica/acpi_cpu.c (.../projects/calloutng)(revision 246685) @@ -980,13 +980,16 @@ } /* Find the lowest state that has small enough latency. */ +us = sc-cpu_prev_sleep; +if (sbt = 0 us sbt / SBT_1US)XXX-lr can we have us2sbt() , ms2sbt() and the like ? + us = sbt / SBT_1US; cx_next_idx = 0; if (cpu_disable_deep_sleep) i = min(sc-cpu_cx_lowest, sc-cpu_non_c3); else i = sc-cpu_cx_lowest; for (; i = 0; i--) { - if (sc-cpu_cx_states[i].trans_lat * 3 = sc-cpu_prev_sleep) { + if (sc-cpu_cx_states[i].trans_lat * 3 = us) { cx_next_idx = i; break; } Index: sys/kern/kern_synch.c === --- sys/kern/kern_synch.c (.../head) (revision 246685) +++ sys/kern/kern_synch.c (.../projects/calloutng)(revision 246685) @@ -146,12 +146,12 @@ */ int _sleep(void *ident, struct lock_object *lock, int priority, -const char *wmesg, int timo) +const char *wmesg, sbintime_t sbt, sbintime_t pr, int flags) { struct thread *td; struct proc *p; struct lock_class *class; - int catch, flags, lock_state, pri, rval; + int catch, sleepq_flags, lock_state, pri, rval; XXX-lr keep flags, use _sleep_flags as function argument ? WITNESS_SAVE_DECL(lock_witness); td = curthread; @@ -348,28 +349,30 @@ * to a timo value of one. */ int -pause(const char *wmesg, int timo) +pause_sbt(const char *wmesg, sbintime_t sbt, sbintime_t pr, int flags) { - KASSERT(timo = 0, (pause: timo must be = 0)); + int sbt_sec;XXX-lr localize to the cold block + sbt_sec = sbintime_getsec(sbt); + KASSERT(sbt_sec = 0, (pause: timo must be = 0)); + /* silently convert invalid timeouts */ - if (timo 1) - timo = 1; + if (sbt == 0) + sbt = tick_sbt; if (cold) { /* -* We delay one HZ at a time to avoid overflowing the +* We delay one second at a time to avoid overflowing the *
Re: [RFC/RFT] calloutng
[ added Luigi Rizzo to thread ] On 2/11/13 12:26 PM, Davide Italiano wrote: [trimmed old mails] Here's a new version of the patch: http://people.freebsd.org/~davide/patches/calloutng-11022012.diff Significant bits changed (after wider discussion and suggestion by phk@): - Introduction of the new sbintime_t type (32.32 fixed point) with the respective conversion (sbintime2bintime, sbintime2timeval etc...) - switch from 64.64 (struct bintime) format to measure time to sbintime_t - Use sbintime_t to represent expected sleep time instead of measuring it in microseconds (cpu_idle_acpi(), cpu_idle_spin() ...). Luigi and I discussed this at BAFUG tonight and he expressed an interest in shepherding the code in at this point. Luigi, can you reiterate any points that still remain before we integrate this code? thank you, -Alfred ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
[trimmed old mails] Here's a new version of the patch: http://people.freebsd.org/~davide/patches/calloutng-11022012.diff Significant bits changed (after wider discussion and suggestion by phk@): - Introduction of the new sbintime_t type (32.32 fixed point) with the respective conversion (sbintime2bintime, sbintime2timeval etc...) - switch from 64.64 (struct bintime) format to measure time to sbintime_t - Use sbintime_t to represent expected sleep time instead of measuring it in microseconds (cpu_idle_acpi(), cpu_idle_spin() ...). Thanks, Davide ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Sun, Jan 13, 2013 at 09:36:11PM +0200, Alexander Motin wrote: On 13.01.2013 20:09, Marius Strobl wrote: On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: On 06.01.2013 17:23, Marius Strobl wrote: I'm not really sure what to do about that. Earlier you already said that sched_bind(9) also isn't an option in case if td_critnest 1. To be honest, I don't really unerstand why using a spin lock in the timecounter path makes sparc64 the only problematic architecture for your changes. The x86 i8254_get_timecount() also uses a spin lock so it should be in the same boat. The problem is not in using spinlock, but in waiting for other CPU while spinlock is held. Other CPU may also hold spinlock and wait for something, causing deadlock. i8254 code uses spinlock just to atomically access hardware registers, so it causes no problems. Okay, but wouldn't that be a general problem then? Pretty much anything triggering an IPI holds smp_ipi_mtx while doing so and the lower level IPI stuff waits for other CPU(s), including on x86. The problem is general. But now it works because single smp_ipi_mtx is used in all cases where IPI result is waited. As soon as spinning happens with interrupts still enabled, there is no deadlocks. But problem reappears if any different lock is used, or locks are nested. I'm having a hard time getting an alternate time counter device to work. The crystal required for the counters in the south bridge just doesn't seem to be mounted any where near it (I've not looked at the bottom of the PCB though). While the time counter part of the on- board bge(4) driven chips basically work, they don't seem to like concurrent accesses caused by the rest of bge(4). I.e. although the counter is just read, sooner or later this causes a fatal bus error. I haven't tried serializing accesses to the chip, but getting to such a complexity for just reading a non-indexed register at least doesn't feel good ... However, AFAICT the scenario you describe can't happen. On sparc64, spinlock_enter() only raises the processor interrupt level, which doesn't block the direct cross traps I've implemented remote reading of (S)TICK as (which also means that the actions such traps may perform are very limitted and must occur in interrupt context, but which are sufficient for this purpose and in turn makes them very fast). I.e. although the AP holds smp_ipi_mtx or any amount of nested spin locks, this will not deadlock in case the BSP also holds any spin lock when reading (S)TICK from it. Marius ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Thu, 17 Jan 2013, Ian Lepore wrote: On Mon, 2013-01-14 at 11:38 +1100, Bruce Evans wrote: Er, timecounters are called with a spin mutex held in existing code: though it is dangerous to do so, timecounters are called from fast interrupt handlers for very timekeeping-critical purposes: - to implement the TIOCTIMESTAMP ioctl (except this is broken in -current). This was a primitive version of pps timestamping. - for pps timestamping. The interrupt handler (which should be a fast interrupt handler to minimize latency) calls pps_capture() which calls tc_get_timecount() and does other lock-free accesses to the timecounter state. This still works in -current (at least there is still code for it). Unfortunately, calling pps_capture() in the primary interrupt context is no longer an option with the stock pps driver. Ever since the ppbus rewrite all ppbus children must use threaded handlers. I tried to fix that a couple different ways, and both ended up with crazy-complex code Hmm, I didn't notice that ppc supported pps (I try not to look at it since it is ugly :-), and don't know of any version of it that uses non-threaded handlers (except in FreeBSD-4 before, where normal interrupt handlers were non-threaded, so ppc had their high latency but not the even higher latency and overheads of threaded handlers). OTOH, my x86 RTC interrupt handler is threaded and supports pps, and I haven't noticed any latency problems with this. It just can't possibly give the ~1 usec jitter that FreeBSD-[3-4] could give ~15 years ago using a fast interrupt handler (there must be only 1 device using a fast interrupt handler, with this dedicated to pps, else the multiple fast interrupt handlers will give latency much larger than ~1 usec to each other. I don't actually use this for anything except testing whether the RTC can be used for a poor man's pps. scattered around the ppbus family just to support the rarely-used pps capture. It would have been easier to do if filter and threaded interrupt handlers had the same function signature. I ended up writting a separate driver that can be used instead of ppc + ppbus + pps, since anyone who cares about precise pps capture is unlikely to be sharing the port with a printer or plip device or some such. Probably all pps handlers should be special. On x86 with reasonable timecounter hardware, say a TSC, it takes about 10 instructions for an entire pps interrupt handler: XintrN: pushl %eax pushl %edx rdtsc # Need some ugliness for EIO here or later. ss:movl %eax,ppscap # Hopefully lock-free via time-domain locking. ss:movl %edx,ppscap+4 popl%edx popl%eax iret After capturing the timecounter hardware value here, you convert it to a pps event at leisure. But since this only happens once per second, it wouldn't be very inefficient to turn the interrupt handler into a slow high-latency one, even a threaded one, to handle the pps event and/or other devices attached to the interrupt. OTOH, all drivers that call pps_capture() from their interrupt handler then immediately call pps_event(). This has always been very broken, and became even more broken with SMPng. pps_event() does many more timecounter and pps accesses whose locking is unclear at best, and in some configurations it calls hardpps(), which is only locked by Giant, despite comments in kern_ntptime.c still saying that it (and many other functions in kern_ntptime.c) must be called at splclock() or higher. splclock() is of course now null, but the locking requirements in kern_ntptime.c haven't changed much. kern_ntptime.c always needed to be locked by the equivalent of a spin mutex, which is stronger locking than was given by splclock(). pps_event() would have to aquire the spin mutex before calling hardpps(), although this is bad for fast interrupt handlers. The correct implementation is probably to only do the capture part from fast interrupt handlers. In my rewritten dedicated pps driver I call pps_capture() from the filter handler and pps_event() from the threaded handler. I never found That seems right. any good documentation on the low-level details of this stuff, and there isn't enough good example code to work from. My hazy memory is that I THere seem to be no good examples. ended up studying the pps_capture() and pps_event() code enough to infer that their design intent seems to be to allow you to capture with no locking and do the event processing later in some sort of deferred or threaded context. That seems to be the design, but there are no examples of separating the event from the capture. I think the correct locking is: - capture in a fast interrupt handler, into a per-device state that is locked by whatever locks all of the state accessed by the fast interrupt handler - switch to a less critical context later: - lock this step
Re: [RFC/RFT] calloutng
On Mon, 2013-01-14 at 11:38 +1100, Bruce Evans wrote: On Sun, 13 Jan 2013, Alexander Motin wrote: On 13.01.2013 20:09, Marius Strobl wrote: On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: [...] In existing code in HEAD and 9 timecounters are never called with spin mutex held. I intentionally tried to avoid that in existing eventtimers code. Er, timecounters are called with a spin mutex held in existing code: though it is dangerous to do so, timecounters are called from fast interrupt handlers for very timekeeping-critical purposes: - to implement the TIOCTIMESTAMP ioctl (except this is broken in -current). This was a primitive version of pps timestamping. - for pps timestamping. The interrupt handler (which should be a fast interrupt handler to minimize latency) calls pps_capture() which calls tc_get_timecount() and does other lock-free accesses to the timecounter state. This still works in -current (at least there is still code for it). Unfortunately, calling pps_capture() in the primary interrupt context is no longer an option with the stock pps driver. Ever since the ppbus rewrite all ppbus children must use threaded handlers. I tried to fix that a couple different ways, and both ended up with crazy-complex code scattered around the ppbus family just to support the rarely-used pps capture. It would have been easier to do if filter and threaded interrupt handlers had the same function signature. I ended up writting a separate driver that can be used instead of ppc + ppbus + pps, since anyone who cares about precise pps capture is unlikely to be sharing the port with a printer or plip device or some such. OTOH, all drivers that call pps_capture() from their interrupt handler then immediately call pps_event(). This has always been very broken, and became even more broken with SMPng. pps_event() does many more timecounter and pps accesses whose locking is unclear at best, and in some configurations it calls hardpps(), which is only locked by Giant, despite comments in kern_ntptime.c still saying that it (and many other functions in kern_ntptime.c) must be called at splclock() or higher. splclock() is of course now null, but the locking requirements in kern_ntptime.c haven't changed much. kern_ntptime.c always needed to be locked by the equivalent of a spin mutex, which is stronger locking than was given by splclock(). pps_event() would have to aquire the spin mutex before calling hardpps(), although this is bad for fast interrupt handlers. The correct implementation is probably to only do the capture part from fast interrupt handlers. In my rewritten dedicated pps driver I call pps_capture() from the filter handler and pps_event() from the threaded handler. I never found any good documentation on the low-level details of this stuff, and there isn't enough good example code to work from. My hazy memory is that I ended up studying the pps_capture() and pps_event() code enough to infer that their design intent seems to be to allow you to capture with no locking and do the event processing later in some sort of deferred or threaded context. Callout code same time can be called in any environment with any locks held. And new callout code may need to know precise current time in any of those conditions. Attempt to use an IPI and wait there can be fatal. Callout code can't be called from such a general any environment as timecounter code. Not from a fast interrupt handler. Not from an NMI or IPI handler. I hope. But timecounter code has a good chance of working even for the last 2 environments, due to its design requirement of working in the first. The spinlock in the i8254 timecounter certainly breaks some cases. For example, suppose the lock is held for a timecounter read from normal context. It masks hardware interrupts on the current CPU (except in my version). It doesn't mask NMIs or other traps. So if the NMI or other trap handler does a timecounter hardware call, there is deadlock in at least the !SMP case. In my version, it blocks normal interrupts later if they occur, but doesn't block fast interrupts, so the pps_capture() call would deadlock if it occurs, like a timecounter call from an NMI. I avoid this by not using pps in any fast interrupt handler, and by only using the i8254 timecounter for testing. I do use pps in a (nonstandard) x86 RTC clock interrupt handler. My clock interrupt handlers are all non-fast to avoid this and other locking problems. Hrm, now you've got me a bit worried about capturing in the primary context. Not that I have much option, on a 300mhz Geode and similar wimpy embedded processors there's enough latency on a theaded handler that the pps signal can be de-asserted by time the handler runs (precision timing gear often outputs a very narrow pps pulse, 1 - 10uS isn't uncommon). I know I don't have
Re: [RFC/RFT] calloutng
On Sun, 2012-12-30 at 16:13 -0700, Ian Lepore wrote: On Wed, 2012-12-26 at 21:24 +0200, Alexander Motin wrote: [...] I grabbed testsleep.c to test an arm event timer implementation, and had to fix a couple nits... kqueueto was missing from the names[] array, and I had to add a * 1000 to a couple places where usec was stuffed into a timespec's tv_nsec. I also tested the calloutng_12_17 patches and the kqueue stuff behaved very strangely. Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. -- Ian plain text document attachment (calloutng_test.txt) for t in 1 300 3000 3 30 ; do for m in select poll usleep nanosleep kqueue kqueueto syscall ; do ./testsleep $t $m done done With calloutng_12_26.patch... HZ=100 HZ=250 HZ=1000 -- select 1 55.79 1 50.96 1 61.32 poll1 1109.46 1 1107.86 1 1114.38 usleep 1 56.33 1 72.90 1 62.78 nanosleep 1 52.66 1 55.23 1 64.23 kqueue 1 1114.23 1 1113.81 1 1121.21 kqueueto1 65.44 1 71.00 1 75.01 syscall 1 4.70 1 4.45 1 4.55 select300355.79300357.76300362.35 poll 300 1107.85300 1122.55300 1115.62 usleep300355.28300357.28300360.79 nanosleep 300354.49300355.82300360.62 kqueue300 1112.57300 1118.13300 1117.16 kqueueto 300375.98300378.62300395.61 syscall 300 4.41300 4.45300 4.54 select 3000 3246.75 3000 3246.74 3000 3252.72 poll 3000 3238.10 3000 3229.12 3000 3250.10 usleep 3000 3242.47 3000 3237.06 3000 3249.61 nanosleep3000 3238.79 3000 3231.55 3000 3248.11 kqueue 3000 3240.01 3000 3236.07 3000 3247.60 kqueueto 3000 3265.36 3000 3267.22 3000 3274.96 syscall 3000 4.69 3000 4.44 3000 4.50 select 3 31714.60 3 31941.17 3 32467.69 poll3 31522.76 3 31983.00 3 32497.81 usleep 3 31459.67 3 31980.76 3 32458.71 nanosleep 3 31431.02 3 31982.22 3 32525.20 kqueue 3 31466.75 3 31873.90 3 31973.54 kqueueto3 31564.67 3 32522.35 3 32475.59 syscall 3 4.70 3 4.73 3 4.89 select 30 319133.02 30 311562.33 30 309918.62 poll 30 319604.27 30 311422.94 30 31.76 usleep 30 319314.60 30 311269.69 30 309996.34 nanosleep 30 319497.58 30 311425.40 30 309997.13 kqueue 30 309995.55 30 303980.27 30 309908.82 kqueueto 30 319505.88 30 311424.97 30 309996.16 syscall30 4.41 30 4.45 30 4.89 With no patches... HZ=100 HZ=250 HZ=1000 -- select 1 19941.70 1 7989.10 1 1999.16 poll1 19904.61 1 7987.32 1 1999.78 usleep 1 19904.95 1 7993.30 1 1999.96 nanosleep 1 19905.64 1 7993.71 1 1999.72 kqueue 1 10001.61 1 4004.00 1 1000.27 kqueueto1 19904.00 1 7993.03 1 1999.54 syscall 1 4.04 1 4.05 1 4.75 select300 19904.66300 7998.39300 2000.27 poll 300 19904.35300 7993.47300 1999.86 usleep300 19903.96300 7994.11300 1999.81 nanosleep 300 19904.48300 7993.77300 1999.80 kqueue300 10001.68300 4004.18300 1000.31 kqueueto 300 19997.86300 7993.37300 1999.59 syscall 300 4.01300 4.00300 4.32 select 3000 19904.80 3000 7998.85 3000 3998.43 poll 3000 19904.92 3000 8005.93 3000 3999.39 usleep 3000 19904.50 3000 7992.88 3000 3999.44 nanosleep3000 19904.84 3000
Re: [RFC/RFT] calloutng
On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: On 06.01.2013 17:23, Marius Strobl wrote: On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote: On 26.12.2012 01:21, Marius Strobl wrote: On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a try on sparc64 and it at least survives a buildworld there. However, with the patched kernels, buildworld times seem to increase slightly but reproducible by 1-2% (I only did four runs but typically buildworld times are rather stable and don't vary more than a minute for the same kernel and source here). Is this an expected trade-off (system time as such doesn't seem to increase)? I don't think build process uses significant number of callouts to affect results directly. I think this additional time could be result of the deeper next event look up, done by the new code, that is practically useless for sparc64, which effectively has no cpu_idle() routine. It wouldn't affect system time and wouldn't show up in any statistics (except PMC or something alike) because it is executed inside timer hardware interrupt handler. If my guess is right, that is a part that probably still could be optimized. I'll look on it. Thanks. Is there anything specific to test? Since the most of code is MI, for sparc64 I would mostly look on related MD parts (eventtimers and timecounters) to make sure they are working reliably in more stressful conditions. I still have some worries about possible deadlock on hardware where IPIs are used to fetch present time from other CPU. Well, I've just learnt two things the hard way: a) We really need the mutex in that path. b) Assuming that the initial synchronization of the counters is good enough and they won't drift considerably accross the CPUs so we can always use the local one makes things go south pretty soon after boot. At least with your calloutng_12_26.patch applied. Do you think it means they are not really synchronized for some reason? There's definitely no hardware in place which would synchronize them. I've no idea how to properly measure the difference between two tick counters, but I think it's rarther their drift and not the software synchronization we do when starting APs that is causing problems. Mainly, because I can't really think of a better algorithm for doing the latter when startiing the APs. The symptoms are that bout 30 to 60 seconds after that I start to see weird timeouts from device drivers. I'd need to check how long these timeouts actually are to see whether it could be a problem right from the start though. In any case, it seems foolish to think that synchronizing them once would be sufficient and they won't drift until shutdown. Linux probably also doesn't keep re-synchronize them without a reason. Just using a single timecounter source simply appears to be the better choice. I'm not really sure what to do about that. Earlier you already said that sched_bind(9) also isn't an option in case if td_critnest 1. To be honest, I don't really unerstand why using a spin lock in the timecounter path makes sparc64 the only problematic architecture for your changes. The x86 i8254_get_timecount() also uses a spin lock so it should be in the same boat. The problem is not in using spinlock, but in waiting for other CPU while spinlock is held. Other CPU may also hold spinlock and wait for something, causing deadlock. i8254 code uses spinlock just to atomically access hardware registers, so it causes no problems. Okay, but wouldn't that be a general problem then? Pretty much anything triggering an IPI holds smp_ipi_mtx while doing so and the lower level IPI stuff waits for other CPU(s), including on x86. The affected machines are equipped with a x86-style south bridge which exposes a powermanagment unit (intended to be used as a SMBus bridge only in these machines) on the PCI bus. Actually, this device also includes an ACPI power management timer. However, I've just spent a day trying to get that one working without success - it just doesn't increment. Probably its clock input isn't connected as it's
Re: [RFC/RFT] calloutng
Ok, So everyone - what _could_ be brought into -HEAD right now, without any actual change in code behaviour? eg, what kind of refactoring could be done to reduce the amount of diffs between the branch and -HEAD? Adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 13.01.2013 20:09, Marius Strobl wrote: On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: On 06.01.2013 17:23, Marius Strobl wrote: I'm not really sure what to do about that. Earlier you already said that sched_bind(9) also isn't an option in case if td_critnest 1. To be honest, I don't really unerstand why using a spin lock in the timecounter path makes sparc64 the only problematic architecture for your changes. The x86 i8254_get_timecount() also uses a spin lock so it should be in the same boat. The problem is not in using spinlock, but in waiting for other CPU while spinlock is held. Other CPU may also hold spinlock and wait for something, causing deadlock. i8254 code uses spinlock just to atomically access hardware registers, so it causes no problems. Okay, but wouldn't that be a general problem then? Pretty much anything triggering an IPI holds smp_ipi_mtx while doing so and the lower level IPI stuff waits for other CPU(s), including on x86. The problem is general. But now it works because single smp_ipi_mtx is used in all cases where IPI result is waited. As soon as spinning happens with interrupts still enabled, there is no deadlocks. But problem reappears if any different lock is used, or locks are nested. In existing code in HEAD and 9 timecounters are never called with spin mutex held. I intentionally tried to avoid that in existing eventtimers code. Callout code same time can be called in any environment with any locks held. And new callout code may need to know precise current time in any of those conditions. Attempt to use an IPI and wait there can be fatal. The affected machines are equipped with a x86-style south bridge which exposes a powermanagment unit (intended to be used as a SMBus bridge only in these machines) on the PCI bus. Actually, this device also includes an ACPI power management timer. However, I've just spent a day trying to get that one working without success - it just doesn't increment. Probably its clock input isn't connected as it's not intended to be used in these machines. That south bridge also includes 8254 compatible timers on the ISA/ LPC side, but are hidden from the OFW device tree. I can hack these devices into existence and give it a try, but even if that works this likely would use the same code as the x86 i8254_get_timecount() so I don't see what would be gained with that. The last thing in order to avoid using the tick counter as timecounter in the MP case I can think of is that the Broadcom MACs in the affected machines also provide a counter driven by a 1 MHz clock. If that's good enough for a timecounter I can hook these up (in case these work ...) and hack bge(4) to not detach in that case (given that we can't detach timecounters ...). i8254 on x86 is also just a bit above 1MHz. Here is small tool we are using for test correctness and performance of different user-level APIs: http://people.freebsd.org/~mav/testsleep.c I've run Ian's set of tests on a v215 with and without your calloutng_12_26.patch and on a v210 (these uses the IPI approach) with the latter also applied. I'm not really sure what to make out of the numbers. v215 w/o v215 w/ v210 w/ -- select 1 1999.61 1 23.87 1 29.97 poll1 1999.70 1 1069.61 1 1075.24 usleep 1 1999.86 1 23.43 1 28.99 nanosleep 1999.92 1 23.28 1 28.66 kqueue 1 1000.12 1 1071.13 1 1076.35 kqueueto1999.56 1 26.33 1 31.34 syscall 1 1.89 1 1.92 1 2.88 FYI, these are the results of the v215 (btw., these (ab)use a bus cycle counter of the host-PCI-bridge as timecounter) with your calloutng_12_17.patch and kern.timecounter.alloweddeviation=0: select 1 23.82 poll 1 1008.23 usleep 1 23.31 nanosleep 1 23.17 kqueue 1 1010.35 kqueueto 1 26.26 syscall1 1.91 select 300307.72 poll 300 1008.23 usleep 300307.64 nanosleep300 23.21 kqueue 300 1010.49 kqueueto 300 26.27 syscall 300 1.92 select 3000 3009.95 poll3000 3013.33 usleep 3000 3013.56 nanosleep 3000 23.17 kqueue 3000 3011.09 kqueueto3000 26.24 syscall 3000 1.91 select 3 30013.51 poll 3 30010.63 usleep 3 30010.64 nanosleep 3 36.91 kqueue 3 30012.38 kqueueto 3 39.90 syscall3 1.90 select30 300017.52 poll 30 300013.00 usleep30 300012.64 nanosleep 30307.59 kqueue30 300017.07 kqueueto 30310.24
Re: [RFC/RFT] calloutng
On Sun, 2013-01-13 at 21:36 +0200, Alexander Motin wrote: On 13.01.2013 20:09, Marius Strobl wrote: [...] Uhm, there are no NMIs on sparc64. Does it make sense to bypass this adjustment on sparc64? If it is not possible or not good to to stop timer during programming, there will always be some race window between code execution and timer ticking. So some minimal safety range should be reserved. Though it probably can be significantly reduced. In case of x86/HPET there is additional factor of NMI, that extends race to unpredictable level and so makes additional post-read almost mandatory. May be with rereading counter after programming comparator (same as done for HPET, reading which is probably much more expensive) this value could be reduced. I see. There are some bigger fish to fry at the moment though :) Speaking of the HPET code, it seems to me that its restart logic can fire the same event twice. Is that harmless? -- Ian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 14.01.2013 01:10, Ian Lepore wrote: On Sun, 2013-01-13 at 21:36 +0200, Alexander Motin wrote: On 13.01.2013 20:09, Marius Strobl wrote: [...] Uhm, there are no NMIs on sparc64. Does it make sense to bypass this adjustment on sparc64? If it is not possible or not good to to stop timer during programming, there will always be some race window between code execution and timer ticking. So some minimal safety range should be reserved. Though it probably can be significantly reduced. In case of x86/HPET there is additional factor of NMI, that extends race to unpredictable level and so makes additional post-read almost mandatory. May be with rereading counter after programming comparator (same as done for HPET, reading which is probably much more expensive) this value could be reduced. I see. There are some bigger fish to fry at the moment though :) Speaking of the HPET code, it seems to me that its restart logic can fire the same event twice. Is that harmless? It is much less harmful then not fire and loose interrupt completely, and as result stuck for indefinite time. I have no better ideas, while this seems works well. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Sun, 13 Jan 2013, Alexander Motin wrote: On 13.01.2013 20:09, Marius Strobl wrote: On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: On 06.01.2013 17:23, Marius Strobl wrote: I'm not really sure what to do about that. Earlier you already said that sched_bind(9) also isn't an option in case if td_critnest 1. To be honest, I don't really unerstand why using a spin lock in the timecounter path makes sparc64 the only problematic architecture for your changes. The x86 i8254_get_timecount() also uses a spin lock so it should be in the same boat. The problem is not in using spinlock, but in waiting for other CPU while spinlock is held. Other CPU may also hold spinlock and wait for something, causing deadlock. i8254 code uses spinlock just to atomically access hardware registers, so it causes no problems. Okay, but wouldn't that be a general problem then? Pretty much anything triggering an IPI holds smp_ipi_mtx while doing so and the lower level IPI stuff waits for other CPU(s), including on x86. The problem is general. But now it works because single smp_ipi_mtx is used in all cases where IPI result is waited. As soon as spinning happens with interrupts still enabled, there is no deadlocks. But problem reappears if any different lock is used, or locks are nested. In existing code in HEAD and 9 timecounters are never called with spin mutex held. I intentionally tried to avoid that in existing eventtimers code. Er, timecounters are called with a spin mutex held in existing code: though it is dangerous to do so, timecounters are called from fast interrupt handlers for very timekeeping-critical purposes: - to implement the TIOCTIMESTAMP ioctl (except this is broken in -current). This was a primitive version of pps timestamping. - for pps timestamping. The interrupt handler (which should be a fast interrupt handler to minimize latency) calls pps_capture() which calls tc_get_timecount() and does other lock-free accesses to the timecounter state. This still works in -current (at least there is still code for it). OTOH, all drivers that call pps_capture() from their interrupt handler then immediately call pps_event(). This has always been very broken, and became even more broken with SMPng. pps_event() does many more timecounter and pps accesses whose locking is unclear at best, and in some configurations it calls hardpps(), which is only locked by Giant, despite comments in kern_ntptime.c still saying that it (and many other functions in kern_ntptime.c) must be called at splclock() or higher. splclock() is of course now null, but the locking requirements in kern_ntptime.c haven't changed much. kern_ntptime.c always needed to be locked by the equivalent of a spin mutex, which is stronger locking than was given by splclock(). pps_event() would have to aquire the spin mutex before calling hardpps(), although this is bad for fast interrupt handlers. The correct implementation is probably to only do the capture part from fast interrupt handlers. Callout code same time can be called in any environment with any locks held. And new callout code may need to know precise current time in any of those conditions. Attempt to use an IPI and wait there can be fatal. Callout code can't be called from such a general any environment as timecounter code. Not from a fast interrupt handler. Not from an NMI or IPI handler. I hope. But timecounter code has a good chance of working even for the last 2 environments, due to its design requirement of working in the first. The spinlock in the i8254 timecounter certainly breaks some cases. For example, suppose the lock is held for a timecounter read from normal context. It masks hardware interrupts on the current CPU (except in my version). It doesn't mask NMIs or other traps. So if the NMI or other trap handler does a timecounter hardware call, there is deadlock in at least the !SMP case. In my version, it blocks normal interrupts later if they occur, but doesn't block fast interrupts, so the pps_capture() call would deadlock if it occurs, like a timecounter call from an NMI. I avoid this by not using pps in any fast interrupt handler, and by only using the i8254 timecounter for testing. I do use pps in a (nonstandard) x86 RTC clock interrupt handler. My clock interrupt handlers are all non-fast to avoid this and other locking problems. FYI, these are the results of the v215 (btw., these (ab)use a bus cycle counter of the host-PCI-bridge as timecounter) with your calloutng_12_17.patch and kern.timecounter.alloweddeviation=0: select 1 23.82 poll 1 1008.23 usleep 1 23.31 nanosleep 1 23.17 kqueue 1 1010.35 kqueueto 1 26.26 syscall1 1.91 select 300307.72 poll 300 1008.23 usleep 300307.64 nanosleep300 23.21 Please fix the tv_nsec initialization so that we can see if nanosleep() and
Re: [RFC/RFT] calloutng
On 06.01.2013 17:23, Marius Strobl wrote: On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote: On 26.12.2012 01:21, Marius Strobl wrote: On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a try on sparc64 and it at least survives a buildworld there. However, with the patched kernels, buildworld times seem to increase slightly but reproducible by 1-2% (I only did four runs but typically buildworld times are rather stable and don't vary more than a minute for the same kernel and source here). Is this an expected trade-off (system time as such doesn't seem to increase)? I don't think build process uses significant number of callouts to affect results directly. I think this additional time could be result of the deeper next event look up, done by the new code, that is practically useless for sparc64, which effectively has no cpu_idle() routine. It wouldn't affect system time and wouldn't show up in any statistics (except PMC or something alike) because it is executed inside timer hardware interrupt handler. If my guess is right, that is a part that probably still could be optimized. I'll look on it. Thanks. Is there anything specific to test? Since the most of code is MI, for sparc64 I would mostly look on related MD parts (eventtimers and timecounters) to make sure they are working reliably in more stressful conditions. I still have some worries about possible deadlock on hardware where IPIs are used to fetch present time from other CPU. Well, I've just learnt two things the hard way: a) We really need the mutex in that path. b) Assuming that the initial synchronization of the counters is good enough and they won't drift considerably accross the CPUs so we can always use the local one makes things go south pretty soon after boot. At least with your calloutng_12_26.patch applied. Do you think it means they are not really synchronized for some reason? I'm not really sure what to do about that. Earlier you already said that sched_bind(9) also isn't an option in case if td_critnest 1. To be honest, I don't really unerstand why using a spin lock in the timecounter path makes sparc64 the only problematic architecture for your changes. The x86 i8254_get_timecount() also uses a spin lock so it should be in the same boat. The problem is not in using spinlock, but in waiting for other CPU while spinlock is held. Other CPU may also hold spinlock and wait for something, causing deadlock. i8254 code uses spinlock just to atomically access hardware registers, so it causes no problems. The affected machines are equipped with a x86-style south bridge which exposes a powermanagment unit (intended to be used as a SMBus bridge only in these machines) on the PCI bus. Actually, this device also includes an ACPI power management timer. However, I've just spent a day trying to get that one working without success - it just doesn't increment. Probably its clock input isn't connected as it's not intended to be used in these machines. That south bridge also includes 8254 compatible timers on the ISA/ LPC side, but are hidden from the OFW device tree. I can hack these devices into existence and give it a try, but even if that works this likely would use the same code as the x86 i8254_get_timecount() so I don't see what would be gained with that. The last thing in order to avoid using the tick counter as timecounter in the MP case I can think of is that the Broadcom MACs in the affected machines also provide a counter driven by a 1 MHz clock. If that's good enough for a timecounter I can hook these up (in case these work ...) and hack bge(4) to not detach in that case (given that we can't detach timecounters ...). i8254 on x86 is also just a bit above 1MHz. Here is small tool we are using for test correctness and performance of different user-level APIs: http://people.freebsd.org/~mav/testsleep.c I've run Ian's set of tests on a v215 with and without your calloutng_12_26.patch and on a v210 (these uses the IPI approach) with the latter also applied. I'm not really sure what to make out of the numbers. v215 w/o v215 w/
Re: [RFC/RFT] calloutng
On 06.01.2013 18:20, Luigi Rizzo wrote: On Sun, Jan 06, 2013 at 04:23:13PM +0100, Marius Strobl wrote: On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote: Here is small tool we are using for test correctness and performance of different user-level APIs: http://people.freebsd.org/~mav/testsleep.c I've run Ian's set of tests on a v215 with and without your calloutng_12_26.patch and on a v210 (these uses the IPI approach) with the latter also applied. I'm not really sure what to make out of the numbers. v215 w/o v215 w/ v210 w/ -- select 1 1999.61 1 23.87 1 29.97 poll1 1999.70 1 1069.61 1 1075.24 usleep 1 1999.86 1 23.43 1 28.99 nanosleep 1999.92 1 23.28 1 28.66 kqueue 1 1000.12 1 1071.13 1 1076.35 kqueueto1999.56 1 26.33 1 31.34 syscall 1 1.89 1 1.92 1 2.88 select300 1999.72300326.08300332.24 poll 300 1999.12300 1069.78300 1075.82 usleep300 1999.91300325.63300330.94 nanosleep 300999.82300 23.25300 26.76 kqueue300 1000.14300 1071.06300 1075.96 kqueueto 300999.53300 26.32300 31.42 syscall 300 1.90300 1.93300 2.89 select 3000 3998.18 3000 3176.51 3000 3193.86 poll 3000 3999.29 3000 3182.21 3000 3193.12 usleep 3000 3998.46 3000 3191.60 3000 3192.50 nanosleep3000 1999.79 3000 23.21 3000 27.02 kqueue 3000 3000.12 3000 3189.13 3000 3191.96 kqueueto 3000 1999.99 3000 26.28 3000 31.91 syscall 3000 1.91 3000 1.91 3000 2.90 select 3 30990.85 3 31489.18 3 31548.77 poll3 30995.25 3 31518.80 3 31487.92 usleep 3 30992.00 3 31510.42 3 31475.50 nanosleep 3 1999.46 3 38.67 3 41.95 kqueue 3 30006.49 3 30991.86 3 30996.77 kqueueto3 1999.09 3 41.67 3 46.36 syscall 3 1.91 3 1.91 3 2.88 select 30 300990.65 30 301864.98 30 301787.01 poll 30 300998.09 30 301831.36 30 301741.62 usleep 30 300990.80 30 301824.67 30 301770.10 nanosleep 30 1999.15 30325.74 30331.01 kqueue 30 30.87 30 301031.11 30 300992.28 kqueueto 30 1999.39 30328.77 30333.45 syscall30 1.91 30 1.91 30 2.88 the nanosleep and kqueueto tests are probably passing the wrong argument to the syscall (meant to be microseconds, but nanosleep takes nanosecond so it should probably be multiplied by 1000). Yes, you are right. I've used it in such way to see what will happen if I request sub-microsecond interval. In this test these rows are definitely incorrect. I think that for the time being it would be useful to run at least one set of tests with kern.timecounter.alloweddeviation=0 so we can tell how close we get to the required timeouts May be just to be sure, because it should not significantly affect results of the 1us tests, as 5% of 1us is much less then numbers we see there. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Tue, Jan 08, 2013 at 12:55:27PM +0200, Alexander Motin wrote: On 06.01.2013 18:20, Luigi Rizzo wrote: ... I think that for the time being it would be useful to run at least one set of tests with kern.timecounter.alloweddeviation=0 so we can tell how close we get to the required timeouts May be just to be sure, because it should not significantly affect results of the 1us tests, as 5% of 1us is much less then numbers we see there. to clarify - i don't mind if we are 50-100us (absolute error) off the requested timeout for short intervals, but i want to be sure that this error can be achieved also for large requests. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote: On 26.12.2012 01:21, Marius Strobl wrote: On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a try on sparc64 and it at least survives a buildworld there. However, with the patched kernels, buildworld times seem to increase slightly but reproducible by 1-2% (I only did four runs but typically buildworld times are rather stable and don't vary more than a minute for the same kernel and source here). Is this an expected trade-off (system time as such doesn't seem to increase)? I don't think build process uses significant number of callouts to affect results directly. I think this additional time could be result of the deeper next event look up, done by the new code, that is practically useless for sparc64, which effectively has no cpu_idle() routine. It wouldn't affect system time and wouldn't show up in any statistics (except PMC or something alike) because it is executed inside timer hardware interrupt handler. If my guess is right, that is a part that probably still could be optimized. I'll look on it. Thanks. Is there anything specific to test? Since the most of code is MI, for sparc64 I would mostly look on related MD parts (eventtimers and timecounters) to make sure they are working reliably in more stressful conditions. I still have some worries about possible deadlock on hardware where IPIs are used to fetch present time from other CPU. Well, I've just learnt two things the hard way: a) We really need the mutex in that path. b) Assuming that the initial synchronization of the counters is good enough and they won't drift considerably accross the CPUs so we can always use the local one makes things go south pretty soon after boot. At least with your calloutng_12_26.patch applied. I'm not really sure what to do about that. Earlier you already said that sched_bind(9) also isn't an option in case if td_critnest 1. To be honest, I don't really unerstand why using a spin lock in the timecounter path makes sparc64 the only problematic architecture for your changes. The x86 i8254_get_timecount() also uses a spin lock so it should be in the same boat. The affected machines are equipped with a x86-style south bridge which exposes a powermanagment unit (intended to be used as a SMBus bridge only in these machines) on the PCI bus. Actually, this device also includes an ACPI power management timer. However, I've just spent a day trying to get that one working without success - it just doesn't increment. Probably its clock input isn't connected as it's not intended to be used in these machines. That south bridge also includes 8254 compatible timers on the ISA/ LPC side, but are hidden from the OFW device tree. I can hack these devices into existence and give it a try, but even if that works this likely would use the same code as the x86 i8254_get_timecount() so I don't see what would be gained with that. The last thing in order to avoid using the tick counter as timecounter in the MP case I can think of is that the Broadcom MACs in the affected machines also provide a counter driven by a 1 MHz clock. If that's good enough for a timecounter I can hook these up (in case these work ...) and hack bge(4) to not detach in that case (given that we can't detach timecounters ...). Here is small tool we are using for test correctness and performance of different user-level APIs: http://people.freebsd.org/~mav/testsleep.c I've run Ian's set of tests on a v215 with and without your calloutng_12_26.patch and on a v210 (these uses the IPI approach) with the latter also applied. I'm not really sure what to make out of the numbers. v215 w/o v215 w/ v210 w/ -- select 1 1999.61 1 23.87 1 29.97 poll1 1999.70 1 1069.61 1 1075.24 usleep 1 1999.86 1 23.43 1 28.99 nanosleep 1999.92 1 23.28 1 28.66 kqueue 1 1000.12 1 1071.13 1 1076.35
Re: [RFC/RFT] calloutng
On Sun, Jan 06, 2013 at 04:23:13PM +0100, Marius Strobl wrote: On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote: On 26.12.2012 01:21, Marius Strobl wrote: On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a try on sparc64 and it at least survives a buildworld there. However, with the patched kernels, buildworld times seem to increase slightly but reproducible by 1-2% (I only did four runs but typically buildworld times are rather stable and don't vary more than a minute for the same kernel and source here). Is this an expected trade-off (system time as such doesn't seem to increase)? I don't think build process uses significant number of callouts to affect results directly. I think this additional time could be result of the deeper next event look up, done by the new code, that is practically useless for sparc64, which effectively has no cpu_idle() routine. It wouldn't affect system time and wouldn't show up in any statistics (except PMC or something alike) because it is executed inside timer hardware interrupt handler. If my guess is right, that is a part that probably still could be optimized. I'll look on it. Thanks. Is there anything specific to test? Since the most of code is MI, for sparc64 I would mostly look on related MD parts (eventtimers and timecounters) to make sure they are working reliably in more stressful conditions. I still have some worries about possible deadlock on hardware where IPIs are used to fetch present time from other CPU. Well, I've just learnt two things the hard way: a) We really need the mutex in that path. b) Assuming that the initial synchronization of the counters is good enough and they won't drift considerably accross the CPUs so we can always use the local one makes things go south pretty soon after boot. At least with your calloutng_12_26.patch applied. I'm not really sure what to do about that. Earlier you already said that sched_bind(9) also isn't an option in case if td_critnest 1. To be honest, I don't really unerstand why using a spin lock in the timecounter path makes sparc64 the only problematic architecture for your changes. The x86 i8254_get_timecount() also uses a spin lock so it should be in the same boat. The affected machines are equipped with a x86-style south bridge which exposes a powermanagment unit (intended to be used as a SMBus bridge only in these machines) on the PCI bus. Actually, this device also includes an ACPI power management timer. However, I've just spent a day trying to get that one working without success - it just doesn't increment. Probably its clock input isn't connected as it's not intended to be used in these machines. That south bridge also includes 8254 compatible timers on the ISA/ LPC side, but are hidden from the OFW device tree. I can hack these devices into existence and give it a try, but even if that works this likely would use the same code as the x86 i8254_get_timecount() so I don't see what would be gained with that. The last thing in order to avoid using the tick counter as timecounter in the MP case I can think of is that the Broadcom MACs in the affected machines also provide a counter driven by a 1 MHz clock. If that's good enough for a timecounter I can hook these up (in case these work ...) and hack bge(4) to not detach in that case (given that we can't detach timecounters ...). Here is small tool we are using for test correctness and performance of different user-level APIs: http://people.freebsd.org/~mav/testsleep.c I've run Ian's set of tests on a v215 with and without your calloutng_12_26.patch and on a v210 (these uses the IPI approach) with the latter also applied. I'm not really sure what to make out of the numbers. v215 w/o v215 w/ v210 w/ -- select 1 1999.61 1 23.87 1 29.97 poll1 1999.70 1 1069.61 1 1075.24 usleep 1 1999.86 1 23.43
Re: [RFC/RFT] calloutng
On Wed, Jan 02, 2013 at 09:52:37PM -0800, Kevin Oberman wrote: On Wed, Jan 2, 2013 at 2:06 PM, Alexander Motin m...@freebsd.org wrote: On 02.01.2013 18:08, Adrian Chadd wrote: .. I'm pretty damned sure we're going to need to enforce a never earlier than X latency. Do you mean here that we should never wake up before specified time (just as specified by the most of existing APIs), or that we should not allow sleep shorter then some value to avoid DoS? At least on x86 nanosleep(0) doesn't allow to block the system. Also there is already present mechanism for specifying minimum timer programming interval in eventtimers(9) KPI. I can see serious performance issues with some hardware (wireless comes to mind) if things happen too quickly. Intuition is that it could also play hob with VMs. I believe that the proper way is to wake between T_X and T_X + D. This assumes that D is max_wake_delay, not deviation, which leaves us at the original of (T_X) = event_time = (T_X + D). i think max delay was the intended meaning of the D parameter. We picked bad names (tolerance, deviation,...) for it. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Wed, 2 Jan 2013, Alexander Motin wrote: On 02.01.2013 19:09, Konstantin Belousov wrote: On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote: Probably one way to close this discussion would be to provide a sysctl so the sysadmin can decide which point in the interval to pick when there is no suitable callout already scheduled. Isn't trying to synchronize to the external events in this way unsafe ? I remember, but cannot find the reference right now, a scheduler exploit(s) which completely hide malicious thread from the time accounting, by making it voluntary yielding right before statclock should fire. If statistic gathering could be piggy-backed on the external interrupt, and attacker can control the source of the external events, wouldn't this give her a handle ? Fine-grained timeouts complete fully opening this security hole. Synchronization without fine-grained timeouts might allow the same, but is harder to exploit since you can't control the yielding points directly. With fine-grained timeouts, you just have to predict the statclock firing points. Use one timeout to arrange to yield just before statclock fires and another to regain control just after it has fired. If the timeout resolution is say 50 usec, then this can hope to run for all except 100 usec out of every 1/stathz seconds. With stathz = 128, 1/stathz is 7812 usec, so this gives 7712/7812 of the CPU with 0 statclock ticks. Since the scheduler never sees you running, your priority remains minimal, so the scheduler should prefer to run you whenever a timeout expires, with only round-robin with other minimal-priority threads preventing you getting 7712/7812 of the (user non-rtprio) CPU. The previous stage of fully opening this security hole was changing (the default) HZ from 100 to 1000. HZ must not be much smaller than stathz, else the security hole is almost fully open. With HZ = 100 being less than stathz and timeout granularity limiting the fine control to 2/HZ = 20 msec (except you can use a periodic itimer to get a 1/HZ granularity at a minor cost of getting more SIGALRMs), it is impossible to get near 100% of the CPU with 0 statclock ticks. After yielding, you can't get control for another 100 or 200 msec. Since this exceeds 1/stathz = 78.12 usec, you can only hide from statclock ticks by not running very often or for very long. Limited hiding is possible by wasting even more CPU to determine when to hide: since the timeout granularity is large, it is also ineffective for determining when to yield. So when running, you must poll the current time a lot to determine when to yield. Yield just before statclock fires, as above. (Do it 50 usec early, as above, to avoid most races involving polling the time.) This actually has good chances of not limiting the hiding too much, depending on the details of the scheduling. It yields just before a statclock tick. After this tick fires, if the scheduler reschedules for any reason, then the hiding process would most likely be run again, since its priority is minimal. But at least the old 4BSD scheduler doesn't reschedule after _every_ statclock tick. This depends on the bugfeature that the priority is not checked on _every_ return to user mode (sched_clock() does change the priority, but this is not acted on until much later). Without this bugfeature, there would be excessive context switches. OTOH, with timeouts, at least old non-fine-grained ones, you can force a rescheduling that is acted on soon enough simply by using timeouts (since timeouts give a context switch to the softclock thread, the scheduler has no option to skip checking the priority on return to user mode). After the previous stage of changing HZ to 1000, the granuarity is fine enough for using timeouts to hide from the scheduler. Using a periodic itimer to get a granularity of 1000 usec, start hiding 50-1000 usec before each statclock tick and regain control 1000 usec later. With stathz = 128, 6812/7812 of the CPU with 0 statclock ticks. Not much worse (for the hider) than 7712/7812. Statclock was supposed to be aperiodic to avoid hiding (see statclk-usenix93.ps), but this was never implemented in FreeBSD. With fine-grained timeouts, it would have to be very aperiodic, to the point of giving large inaccuracies, to limit the hiding very much. For example, suppose that it has an average period of 7812 usec with +-50% jitter. You would try to hide from it most of the time by running for a bit less than 7812/2 usec before yielding in most cases. If too much scheduling is done on each statclock tick, then you are likely to regain control after each one (as above) and then know that there is almost a full minimal period until the next one. Otherwise, it seems to be necessary to determine when the previous statclock tick occurred, so as to determine the minimum time until the next one. There are many different kinds of accounting with different characteristics. Run time for each thread
Re: [RFC/RFT] calloutng
On 03.01.2013 16:45, Bruce Evans wrote: On Wed, 2 Jan 2013, Alexander Motin wrote: More important for scheduling fairness thread's CPU percentage is also based on hardclock() and hiding from it was trivial before, since all sleep primitives were strictly aligned to hardclock(). Now it is slightly less trivial, since this alignment was removed and user-level APIs provide no easy way to enforce it. %cpu is actually based on statclock(), and not even used for scheduling. May be for SCHED_4BSD, but not for SCHED_ULE. In SCHED_ULE both %cpu and thread priority based on the same ts_ticks counter, that is based on hardclock() as time source. Interactivity calculation uses alike logic and uses the same time source. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Thu, 3 Jan 2013, Alexander Motin wrote: On 03.01.2013 16:45, Bruce Evans wrote: On Wed, 2 Jan 2013, Alexander Motin wrote: More important for scheduling fairness thread's CPU percentage is also based on hardclock() and hiding from it was trivial before, since all sleep primitives were strictly aligned to hardclock(). Now it is slightly less trivial, since this alignment was removed and user-level APIs provide no easy way to enforce it. %cpu is actually based on statclock(), and not even used for scheduling. May be for SCHED_4BSD, but not for SCHED_ULE. In SCHED_ULE both %cpu and thread priority based on the same ts_ticks counter, that is based on hardclock() as time source. Interactivity calculation uses alike logic and uses the same time source. Hmm. I missed this because it hacks on the 'ticks' global. It is clearer in intermediate versions which use the scheduler API sched_tick(), which is the hardclock analogue of sched_clock() for statclock. sched_tick() is now bogus since it is null for all schedulers. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Mon, Dec 31, 2012 at 12:17:27PM +0200, Alexander Motin wrote: On 31.12.2012 08:17, Luigi Rizzo wrote: On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote: ... Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. Thank you for testing, Ian. interesting numbers, but there seems to be some problem in computing the exact interval; delays are much larger than expected. In this test, the original timer code used to round to the next multiple of 1 tick and then add another tick (except for the kqueue case), which is exactly what you see in the second set of measurements. The calloutng code however seems to do something odd: in addition to fixed overhead (some 50us, which you can see in the tests for 1us and 300us), all delay seem to be ~10% larger than what is requested, upper bounded to 10ms (note, the numbers are averages so i cannot tell whether all samples are the same or there is some distribution of values). I am not sure if this error is peculiar of the ARM version or also appears on x86/amd64 but I believe it should be fixed. If you look at the results below: 1us possily ok: for very short intervals i would expect some kind of 'reschedule' without actually firing a timer; maybe 50us are what it takes to do a round through the scheduler ? 300usprobably ok i guess the extra 50-90us are what it takes to do a round through the scheduler 1000us borderline (this is the case for poll and kqueue, which are rounded to 1ms) here intervals seem to be increased by 10%, and i cannot see a good reason for this (more below). 3000us and above: wrong here again, the intervals seem to be 10% larger than what is requested, perhaps limiting the error to 10-20ms. Maybe the 10% extension results from creating a default 'precision' for legacy calls, but i do not think this is done correctly. First of all, if users do not specify a precision themselves, the automatically generated value should never exceed one tick. Second, the only point of a 'precision' parameter is to merge requests that may be close in time, so if there is already a timer scheduled within [Treq, Treq+precision] i will get it; but if there no pending timer, then one should schedule it for the requested interval. Davide/Alexander, any ideas ? All mentioned effects could be explained with implemented logic. 50us at 1us is probably sum of minimal latency of the hardware eventtimer on the specific platform and some software processing overhead (syscall, callout, timecouters, scheduler, etc). At later points system starts to noticeably use precision specified by kern.timecounter.alloweddeviation sysctl. It affects results from two sides: 1) extending intervals for specified percent of time to allow event aggregation, and 2) choosing time base between fast getbinuptime() and precise binuptime(). Extending interval is needed to aggregate not only callouts with each other, but also callouts with other system events, which are impossible to schedule in advance. It gives specified relative error, but no more then one CPU wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) it is 1/hz, for completely idle one it can be up to 0.5s. Second point allows to reduce processing overhead by the cost of error up to 1/hz for long periods ((100/allowed)*(1/hz)), when it is used. i am not sure what you mean by extending interval, but i believe the logic should be the following: - say user requests a timeout after X seconds and with a tolerance of D second (both X and D are fractional, so they can be short). Interpret this as the system should do its best to generate an event between X and X+D seconds - convert X to an absolute time, T_X - if there are any pending events already scheduled between T_X and T_X+D, then by definition they are acceptable. Attach the requested timeout to the earliest of these events. - otherwise, schedule an event at time T_X (because there is no valid reason to generate a late event, and it makes no sense from an energy saving standpoint, either -- see below). It seems to me that you are instead extending the requested interval upfront, which causes some gratuitous pessimizations in scheduling the callout. Re. energy savings: the gain in extending the timeout cannot exceed the value D/X. So while it may make sense to extend a 1us request to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s, it is completely pointless from an energy saving standpoint to introduce a 10ms error on a 300ms request. (even though i hate the idea that a 1us request defaults to a 50us delay; but that is hopefully something that can be tuned in a
Re: [RFC/RFT] calloutng
On 02.01.2013 12:57, Luigi Rizzo wrote: On Mon, Dec 31, 2012 at 12:17:27PM +0200, Alexander Motin wrote: On 31.12.2012 08:17, Luigi Rizzo wrote: On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote: ... Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. Thank you for testing, Ian. interesting numbers, but there seems to be some problem in computing the exact interval; delays are much larger than expected. In this test, the original timer code used to round to the next multiple of 1 tick and then add another tick (except for the kqueue case), which is exactly what you see in the second set of measurements. The calloutng code however seems to do something odd: in addition to fixed overhead (some 50us, which you can see in the tests for 1us and 300us), all delay seem to be ~10% larger than what is requested, upper bounded to 10ms (note, the numbers are averages so i cannot tell whether all samples are the same or there is some distribution of values). I am not sure if this error is peculiar of the ARM version or also appears on x86/amd64 but I believe it should be fixed. If you look at the results below: 1us possily ok: for very short intervals i would expect some kind of 'reschedule' without actually firing a timer; maybe 50us are what it takes to do a round through the scheduler ? 300us probably ok i guess the extra 50-90us are what it takes to do a round through the scheduler 1000us borderline (this is the case for poll and kqueue, which are rounded to 1ms) here intervals seem to be increased by 10%, and i cannot see a good reason for this (more below). 3000us and above: wrong here again, the intervals seem to be 10% larger than what is requested, perhaps limiting the error to 10-20ms. Maybe the 10% extension results from creating a default 'precision' for legacy calls, but i do not think this is done correctly. First of all, if users do not specify a precision themselves, the automatically generated value should never exceed one tick. Second, the only point of a 'precision' parameter is to merge requests that may be close in time, so if there is already a timer scheduled within [Treq, Treq+precision] i will get it; but if there no pending timer, then one should schedule it for the requested interval. Davide/Alexander, any ideas ? All mentioned effects could be explained with implemented logic. 50us at 1us is probably sum of minimal latency of the hardware eventtimer on the specific platform and some software processing overhead (syscall, callout, timecouters, scheduler, etc). At later points system starts to noticeably use precision specified by kern.timecounter.alloweddeviation sysctl. It affects results from two sides: 1) extending intervals for specified percent of time to allow event aggregation, and 2) choosing time base between fast getbinuptime() and precise binuptime(). Extending interval is needed to aggregate not only callouts with each other, but also callouts with other system events, which are impossible to schedule in advance. It gives specified relative error, but no more then one CPU wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) it is 1/hz, for completely idle one it can be up to 0.5s. Second point allows to reduce processing overhead by the cost of error up to 1/hz for long periods ((100/allowed)*(1/hz)), when it is used. i am not sure what you mean by extending interval, but i believe the logic should be the following: - say user requests a timeout after X seconds and with a tolerance of D second (both X and D are fractional, so they can be short). Interpret this as the system should do its best to generate an event between X and X+D seconds - convert X to an absolute time, T_X - if there are any pending events already scheduled between T_X and T_X+D, then by definition they are acceptable. Attach the requested timeout to the earliest of these events. All above is true, but not following. - otherwise, schedule an event at time T_X (because there is no valid reason to generate a late event, and it makes no sense from an energy saving standpoint, either -- see below). System may have many interrupts except timer: network, disk, ... WiFi cards generate interrupts with AP beacon rate -- dozens times per second. It is not very efficient to wake up CPU precisely at T_X time, that may be just 100us earlier then next hardware interrupt. That's why timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next statclock, ...). As result, event will be handled within allowed range, but real delay will depends on current environment conditions. It seems to me that you are instead extending the requested interval
Re: [RFC/RFT] calloutng
On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote: On 02.01.2013 12:57, Luigi Rizzo wrote: ... i am not sure what you mean by extending interval, but i believe the logic should be the following: - say user requests a timeout after X seconds and with a tolerance of D second (both X and D are fractional, so they can be short). Interpret this as the system should do its best to generate an event between X and X+D seconds - convert X to an absolute time, T_X - if there are any pending events already scheduled between T_X and T_X+D, then by definition they are acceptable. Attach the requested timeout to the earliest of these events. All above is true, but not following. - otherwise, schedule an event at time T_X (because there is no valid reason to generate a late event, and it makes no sense from an energy saving standpoint, either -- see below). System may have many interrupts except timer: network, disk, ... WiFi cards generate interrupts with AP beacon rate -- dozens times per second. It is not very efficient to wake up CPU precisely at T_X time, that may be just 100us earlier then next hardware interrupt. That's why timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next statclock, ...). As result, event will be handled within allowed range, but real delay will depends on current environment conditions. I don't see why system events (hardclock, statclock, 0.5s,...) need to be treated specially -- and i am saying this also in the interest of simplifying the logic of the code. First of all, if you know that there is already a hardclock/statclock/* scheduled in [T_X, T_X+D] you just reuse that. This particular bullet was no event scheduled in [T_X, T_X+D] so you need to generate a new one. Surely scheduling the event at T_X+D instead of T_X increases the chance of merging events. But the saving are smaller and smaller as the value X increases. This particular client will only change its request rate from 1/X to 1/(X+D) so in relative terms the gain is ( 1/X - 1/(X+D) ) / (1/(X+D) ) = D/X Example: if X = 300ms, and D = 10ms (as in the test case) you just save one interrupt every 30seconds by scheduling at T_X+D instead of T_X. Are we actually able to measure the difference ? Even at high interrupt rates (e.g. X = 1ms) you are not going to save a lot unless the tolerance D is very large, which is generally undesirable for other reasons (presumably, applications are not going to be happy if you artificially double their timeouts). Now, say your application requests timeouts every X = 300ms. It seems to me that you are instead extending the requested interval upfront, which causes some gratuitous pessimizations in scheduling the callout. Re. energy savings: the gain in extending the timeout cannot exceed the value D/X. So while it may make sense to extend a 1us request to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s, it is completely pointless from an energy saving standpoint to introduce a 10ms error on a 300ms request. I am not so sure in this. When CPU package is in C7 sleep state with all buses and caches shut down and memory set to self refresh, it consumes very few (some milli-Watts) of power. Wake up from that state takes 100us or even more with power consumption much higher then normal operational one. Sure, if we compare it with power consumption of 100% CPU load, difference between 10 and 100 wakeups per second may be small, but when comparing to each other in some low-power environment for mostly idle system it may be much more significant. see above -- at low rates the difference is not measurable, at high rates thCe only obvious answer is do not use C7 unless if the next interrupt is due in less than 2..5 milliseconds (even though i hate the idea that a 1us request defaults to a 50us delay; but that is hopefully something that can be tuned in a platform-specific way and perhaps at runtime). It is 50us on this ARM. On SandyBridge Core i7 it is only about 2us. very good, i suspected something similar, just wanted to be sure :) cheers luigi -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
02.01.2013 14:28 пользователь Luigi Rizzo ri...@iet.unipi.it написал: On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote: On 02.01.2013 12:57, Luigi Rizzo wrote: ... i am not sure what you mean by extending interval, but i believe the logic should be the following: - say user requests a timeout after X seconds and with a tolerance of D second (both X and D are fractional, so they can be short). Interpret this as the system should do its best to generate an event between X and X+D seconds - convert X to an absolute time, T_X - if there are any pending events already scheduled between T_X and T_X+D, then by definition they are acceptable. Attach the requested timeout to the earliest of these events. All above is true, but not following. - otherwise, schedule an event at time T_X (because there is no valid reason to generate a late event, and it makes no sense from an energy saving standpoint, either -- see below). System may have many interrupts except timer: network, disk, ... WiFi cards generate interrupts with AP beacon rate -- dozens times per second. It is not very efficient to wake up CPU precisely at T_X time, that may be just 100us earlier then next hardware interrupt. That's why timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next statclock, ...). As result, event will be handled within allowed range, but real delay will depends on current environment conditions. I don't see why system events (hardclock, statclock, 0.5s,...) need to be treated specially -- and i am saying this also in the interest of simplifying the logic of the code. Sure. That is mostly for historical reasons. At some point they should disappear, just not now, as patch is already quite big. First of all, if you know that there is already a hardclock/statclock/* scheduled in [T_X, T_X+D] you just reuse that. This particular bullet was no event scheduled in [T_X, T_X+D] so you need to generate a new one. That is true, but my main point was about merging with external events, which I can't predict and the only way to merge is increase sleep period, hoping for better. Surely scheduling the event at T_X+D instead of T_X increases the chance of merging events. But the saving are smaller and smaller as the value X increases. This particular client will only change its request rate from 1/X to 1/(X+D) so in relative terms the gain is ( 1/X - 1/(X+D) ) / (1/(X+D) ) = D/X Example: if X = 300ms, and D = 10ms (as in the test case) you just save one interrupt every 30seconds by scheduling at T_X+D instead of T_X. Are we actually able to measure the difference ? Even at high interrupt rates (e.g. X = 1ms) you are not going to save a lot unless the tolerance D is very large, which is generally undesirable for other reasons (presumably, applications are not going to be happy if you artificially double their timeouts). Now, say your application requests timeouts every X = 300ms. With default precision set to 5% it will be only 5% save from periods increase. But that is absolutely not my goal! Imagine different case: you have NIC interrupts at 1000Hz. Also you have 100 callouts with 100ms period each. If we program timer with absolute precision, you will get about 2000Hz of total interrupt rate. But if we allow just 2% deviation, most of callouts will be grouped with NIC interrupts and total rate will be 1000Hz. Loosing _less_ then 2% of precision we are reducing interrupt rate in _half_! It seems to me that you are instead extending the requested interval upfront, which causes some gratuitous pessimizations in scheduling the callout. Re. energy savings: the gain in extending the timeout cannot exceed the value D/X. So while it may make sense to extend a 1us request to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s, it is completely pointless from an energy saving standpoint to introduce a 10ms error on a 300ms request. I am not so sure in this. When CPU package is in C7 sleep state with all buses and caches shut down and memory set to self refresh, it consumes very few (some milli-Watts) of power. Wake up from that state takes 100us or even more with power consumption much higher then normal operational one. Sure, if we compare it with power consumption of 100% CPU load, difference between 10 and 100 wakeups per second may be small, but when comparing to each other in some low-power environment for mostly idle system it may be much more significant. see above -- at low rates the difference is not measurable, at high rates thCe only obvious answer is do not use C7 unless if the next interrupt is due in less than 2..5 milliseconds (even though i hate the idea that a 1us request defaults to a 50us delay; but that is hopefully something that can be tuned in a platform-specific way and perhaps at runtime). It is 50us on this ARM.
Re: [RFC/RFT] calloutng
On Wed, 2013-01-02 at 15:11 +0200, Alexander Motin wrote: 02.01.2013 14:28 пользователь Luigi Rizzo ri...@iet.unipi.it написал: On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote: On 02.01.2013 12:57, Luigi Rizzo wrote: First of all, if you know that there is already a hardclock/statclock/* scheduled in [T_X, T_X+D] you just reuse that. This particular bullet was no event scheduled in [T_X, T_X+D] so you need to generate a new one. That is true, but my main point was about merging with external events, which I can't predict and the only way to merge is increase sleep period, hoping for better. This really is the crux of the problem, because you can't *by default* dispatch an event earlier than requested because that's just a violation of the usual rules of precision timing (where you expect to be late but never early). Sometimes there is no need for such precision, and an early wakeup is no more or less detrimental than a late wakeup. In fact, that may even be the majority case. I wonder if it might make sense to allow the precision specification to indicate whether it needs traditional never early behavior or whether it can be interpretted as plus or minus this amount is fine. Like maybe negative precision is interpretted as plus or minus abs(precision) or something like that. Or maybe even the other way around... you get plus or minus precision by default and the few things that really care about precision timing have a way of indicating that. (But in that case the userland sleeps would have to assume the traditional behavior because that's how they've always been documented.) -- Ian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Wed, Jan 02, 2013 at 03:11:05PM +0200, Alexander Motin wrote: ... First of all, if you know that there is already a hardclock/statclock/* scheduled in [T_X, T_X+D] you just reuse that. This particular bullet was no event scheduled in [T_X, T_X+D] so you need to generate a new one. That is true, but my main point was about merging with external events, which I can't predict and the only way to merge is increase sleep period, hoping for better. ok, now i understand why you want to schedule for T_X+D. Probably one way to close this discussion would be to provide a sysctl so the sysadmin can decide which point in the interval to pick when there is no suitable callout already scheduled. cheers luigi -+--- Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione http://www.iet.unipi.it/~luigi/. Universita` di Pisa TEL +39-050-2211611 . via Diotisalvi 2 Mobile +39-338-6809875 . 56122 PISA (Italy) -+--- ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote: On Wed, Jan 02, 2013 at 03:11:05PM +0200, Alexander Motin wrote: ... First of all, if you know that there is already a hardclock/statclock/* scheduled in [T_X, T_X+D] you just reuse that. This particular bullet was no event scheduled in [T_X, T_X+D] so you need to generate a new one. That is true, but my main point was about merging with external events, which I can't predict and the only way to merge is increase sleep period, hoping for better. ok, now i understand why you want to schedule for T_X+D. Probably one way to close this discussion would be to provide a sysctl so the sysadmin can decide which point in the interval to pick when there is no suitable callout already scheduled. Isn't trying to synchronize to the external events in this way unsafe ? I remember, but cannot find the reference right now, a scheduler exploit(s) which completely hide malicious thread from the time accounting, by making it voluntary yielding right before statclock should fire. If statistic gathering could be piggy-backed on the external interrupt, and attacker can control the source of the external events, wouldn't this give her a handle ? pgpVV7IP9DZbJ.pgp Description: PGP signature
Re: [RFC/RFT] calloutng
.. I'm pretty damned sure we're going to need to enforce a never earlier than X latency. Is there a more detailed writeup of calloutng somewhere, besides David's slides? The wiki page is rather empty. Eg - I think this work does coalesce wakeups, right? Or it can? So when in low-power scenarios you can end up with lower-resolution callout periods, but many less CPU wakeups a second? (Do we actually _expose_ wakeups-per-second somewhere?) Adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 02.01.2013 19:09, Konstantin Belousov wrote: On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote: Probably one way to close this discussion would be to provide a sysctl so the sysadmin can decide which point in the interval to pick when there is no suitable callout already scheduled. Isn't trying to synchronize to the external events in this way unsafe ? I remember, but cannot find the reference right now, a scheduler exploit(s) which completely hide malicious thread from the time accounting, by making it voluntary yielding right before statclock should fire. If statistic gathering could be piggy-backed on the external interrupt, and attacker can control the source of the external events, wouldn't this give her a handle ? There are many different kinds of accounting with different characteristics. Run time for each thread calculated using high resolution per-CPU clocks on each context switch. It is impossible to hide from it. System load average updated using callout and aligned with hardclock(). Hiding from it was easy before, but it can be made more complicated (asynchronous) now. Per-CPU SYSTEM/INTERRUPT/USER/NICE/IDLE counters are updated by statcklock(), that is asynchronous to hardclock() and hiding from it supposed to be more complicated. But even before it was possible, since hardclock() frequency is 8 times higher then statclock() one. More important for scheduling fairness thread's CPU percentage is also based on hardclock() and hiding from it was trivial before, since all sleep primitives were strictly aligned to hardclock(). Now it is slightly less trivial, since this alignment was removed and user-level APIs provide no easy way to enforce it. The only way to get really safe accounting is forget about sampling and use potentially more expensive other ways. It was always stopped by lack of cheap and reliable clocks. But since TSC is P-state invariant on most of CPUs present now it could probably be reconsidered. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 02.01.2013 18:08, Adrian Chadd wrote: .. I'm pretty damned sure we're going to need to enforce a never earlier than X latency. Do you mean here that we should never wake up before specified time (just as specified by the most of existing APIs), or that we should not allow sleep shorter then some value to avoid DoS? At least on x86 nanosleep(0) doesn't allow to block the system. Also there is already present mechanism for specifying minimum timer programming interval in eventtimers(9) KPI. Is there a more detailed writeup of calloutng somewhere, besides David's slides? The wiki page is rather empty. There are updated manual pages in the patch. Also Davide written some blog during GSoC. Now we are working on papers for the AsiaBSDCon. Eg - I think this work does coalesce wakeups, right? Or it can? So when in low-power scenarios you can end up with lower-resolution callout periods, but many less CPU wakeups a second? This work does coalesce wakeups out of the box, but also provide ways to improve it further, where possible. With additional tuning of some kernel subsystems and drivers I was able to drop total idle interrupt rate down to 10-15Hz on arm and 20-30Hz on x86. (Do we actually _expose_ wakeups-per-second somewhere?) On systems with ACPI there are average per-CPU sleep times exposed via sysctls dev.cpu.X.cx_usage. Also cpu_idle() call rate calculated by both schedulers for purposes of idle loop optimizations, but it is not exposed outside now. Also for idle SMP system enabling COUNT_IPIS should give number of interrupts in systat comparable to number of wakeups. I am mostly using the last way. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Wed, Jan 2, 2013 at 2:06 PM, Alexander Motin m...@freebsd.org wrote: On 02.01.2013 18:08, Adrian Chadd wrote: .. I'm pretty damned sure we're going to need to enforce a never earlier than X latency. Do you mean here that we should never wake up before specified time (just as specified by the most of existing APIs), or that we should not allow sleep shorter then some value to avoid DoS? At least on x86 nanosleep(0) doesn't allow to block the system. Also there is already present mechanism for specifying minimum timer programming interval in eventtimers(9) KPI. I can see serious performance issues with some hardware (wireless comes to mind) if things happen too quickly. Intuition is that it could also play hob with VMs. I believe that the proper way is to wake between T_X and T_X + D. This assumes that D is max_wake_delay, not deviation, which leaves us at the original of (T_X) = event_time = (T_X + D). -- R. Kevin Oberman, Network Engineer E-mail: kob6...@gmail.com ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Sun, Dec 30, 2012 at 10:17 PM, Luigi Rizzo ri...@iet.unipi.it wrote: On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote: ... I grabbed testsleep.c to test an arm event timer implementation, and had to fix a couple nits... kqueueto was missing from the names[] array, and I had to add a * 1000 to a couple places where usec was stuffed into a timespec's tv_nsec. I also tested the calloutng_12_17 patches and the kqueue stuff behaved very strangely. Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. interesting numbers, but there seems to be some problem in computing the exact interval; delays are much larger than expected. In this test, the original timer code used to round to the next multiple of 1 tick and then add another tick (except for the kqueue case), which is exactly what you see in the second set of measurements. The calloutng code however seems to do something odd: in addition to fixed overhead (some 50us, which you can see in the tests for 1us and 300us), all delay seem to be ~10% larger than what is requested, upper bounded to 10ms (note, the numbers are averages so i cannot tell whether all samples are the same or there is some distribution of values). I am not sure if this error is peculiar of the ARM version or also appears on x86/amd64 but I believe it should be fixed. If you look at the results below: 1us possily ok: for very short intervals i would expect some kind of 'reschedule' without actually firing a timer; maybe 50us are what it takes to do a round through the scheduler ? 300us probably ok i guess the extra 50-90us are what it takes to do a round through the scheduler 1000us borderline (this is the case for poll and kqueue, which are rounded to 1ms) here intervals seem to be increased by 10%, and i cannot see a good reason for this (more below). 3000us and above: wrong here again, the intervals seem to be 10% larger than what is requested, perhaps limiting the error to 10-20ms. Maybe the 10% extension results from creating a default 'precision' for legacy calls, but i do not think this is done correctly. First of all, if users do not specify a precision themselves, the automatically generated value should never exceed one tick. Second, the only point of a 'precision' parameter is to merge requests that may be close in time, so if there is already a timer scheduled within [Treq, Treq+precision] i will get it; but if there no pending timer, then one should schedule it for the requested interval. Davide/Alexander, any ideas ? The first question is what couple timecounter/eventtimer is used. Some hardware is slower than other, and this may affect numbers. Thanks for testing, BTW. Davide cheers luigi for t in 1 300 3000 3 30 ; do for m in select poll usleep nanosleep kqueue kqueueto syscall ; do ./testsleep $t $m done done With calloutng_12_26.patch... HZ=100 HZ=250 HZ=1000 -- select 1 55.79 1 50.96 1 61.32 poll1 1109.46 1 1107.86 1 1114.38 usleep 1 56.33 1 72.90 1 62.78 nanosleep 1 52.66 1 55.23 1 64.23 kqueue 1 1114.23 1 1113.81 1 1121.21 kqueueto1 65.44 1 71.00 1 75.01 syscall 1 4.70 1 4.45 1 4.55 select300355.79300357.76300362.35 poll 300 1107.85300 1122.55300 1115.62 usleep300355.28300357.28300360.79 nanosleep 300354.49300355.82300360.62 kqueue300 1112.57300 1118.13300 1117.16 kqueueto 300375.98300378.62300395.61 syscall 300 4.41300 4.45300 4.54 select 3000 3246.75 3000 3246.74 3000 3252.72 poll 3000 3238.10 3000 3229.12 3000 3250.10 usleep 3000 3242.47 3000 3237.06 3000 3249.61 nanosleep3000 3238.79 3000 3231.55 3000 3248.11 kqueue 3000 3240.01 3000 3236.07 3000 3247.60 kqueueto 3000 3265.36 3000 3267.22 3000 3274.96 syscall 3000 4.69 3000 4.44 3000 4.50 select 3 31714.60 3 31941.17 3 32467.69 poll3 31522.76 3 31983.00 3 32497.81 usleep 3
Re: [RFC/RFT] calloutng
On 31.12.2012 08:17, Luigi Rizzo wrote: On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote: ... I grabbed testsleep.c to test an arm event timer implementation, and had to fix a couple nits... kqueueto was missing from the names[] array, and I had to add a * 1000 to a couple places where usec was stuffed into a timespec's tv_nsec. I also tested the calloutng_12_17 patches and the kqueue stuff behaved very strangely. I've rewritten kqueue timeouts at the calloutng_12_26.patch. Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. Thank you for testing, Ian. interesting numbers, but there seems to be some problem in computing the exact interval; delays are much larger than expected. In this test, the original timer code used to round to the next multiple of 1 tick and then add another tick (except for the kqueue case), which is exactly what you see in the second set of measurements. The calloutng code however seems to do something odd: in addition to fixed overhead (some 50us, which you can see in the tests for 1us and 300us), all delay seem to be ~10% larger than what is requested, upper bounded to 10ms (note, the numbers are averages so i cannot tell whether all samples are the same or there is some distribution of values). I am not sure if this error is peculiar of the ARM version or also appears on x86/amd64 but I believe it should be fixed. If you look at the results below: 1us possily ok: for very short intervals i would expect some kind of 'reschedule' without actually firing a timer; maybe 50us are what it takes to do a round through the scheduler ? 300us probably ok i guess the extra 50-90us are what it takes to do a round through the scheduler 1000us borderline (this is the case for poll and kqueue, which are rounded to 1ms) here intervals seem to be increased by 10%, and i cannot see a good reason for this (more below). 3000us and above: wrong here again, the intervals seem to be 10% larger than what is requested, perhaps limiting the error to 10-20ms. Maybe the 10% extension results from creating a default 'precision' for legacy calls, but i do not think this is done correctly. First of all, if users do not specify a precision themselves, the automatically generated value should never exceed one tick. Second, the only point of a 'precision' parameter is to merge requests that may be close in time, so if there is already a timer scheduled within [Treq, Treq+precision] i will get it; but if there no pending timer, then one should schedule it for the requested interval. Davide/Alexander, any ideas ? All mentioned effects could be explained with implemented logic. 50us at 1us is probably sum of minimal latency of the hardware eventtimer on the specific platform and some software processing overhead (syscall, callout, timecouters, scheduler, etc). At later points system starts to noticeably use precision specified by kern.timecounter.alloweddeviation sysctl. It affects results from two sides: 1) extending intervals for specified percent of time to allow event aggregation, and 2) choosing time base between fast getbinuptime() and precise binuptime(). Extending interval is needed to aggregate not only callouts with each other, but also callouts with other system events, which are impossible to schedule in advance. It gives specified relative error, but no more then one CPU wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) it is 1/hz, for completely idle one it can be up to 0.5s. Second point allows to reduce processing overhead by the cost of error up to 1/hz for long periods ((100/allowed)*(1/hz)), when it is used. To get best possible precision kern.timecounter.alloweddeviation sysctl can be set to smaller value. Setting it to 0 will effectively disable all optimizations, but should give 50us precision in all cases. for t in 1 300 3000 3 30 ; do for m in select poll usleep nanosleep kqueue kqueueto syscall ; do ./testsleep $t $m done done With calloutng_12_26.patch... HZ=100 HZ=250 HZ=1000 -- select 1 55.79 1 50.96 1 61.32 poll1 1109.46 1 1107.86 1 1114.38 usleep 1 56.33 1 72.90 1 62.78 nanosleep 1 52.66 1 55.23 1 64.23 kqueue 1 1114.23 1 1113.81 1 1121.21 kqueueto1 65.44 1 71.00 1 75.01 syscall 1 4.70 1 4.45 1 4.55 select300355.79300357.76
Re: [RFC/RFT] calloutng
On Mon, 2012-12-31 at 12:17 +0200, Alexander Motin wrote: On 31.12.2012 08:17, Luigi Rizzo wrote: On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote: ... I grabbed testsleep.c to test an arm event timer implementation, and had to fix a couple nits... kqueueto was missing from the names[] array, and I had to add a * 1000 to a couple places where usec was stuffed into a timespec's tv_nsec. I also tested the calloutng_12_17 patches and the kqueue stuff behaved very strangely. I've rewritten kqueue timeouts at the calloutng_12_26.patch. Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. Thank you for testing, Ian. interesting numbers, but there seems to be some problem in computing the exact interval; delays are much larger than expected. In this test, the original timer code used to round to the next multiple of 1 tick and then add another tick (except for the kqueue case), which is exactly what you see in the second set of measurements. The calloutng code however seems to do something odd: in addition to fixed overhead (some 50us, which you can see in the tests for 1us and 300us), all delay seem to be ~10% larger than what is requested, upper bounded to 10ms (note, the numbers are averages so i cannot tell whether all samples are the same or there is some distribution of values). I am not sure if this error is peculiar of the ARM version or also appears on x86/amd64 but I believe it should be fixed. If you look at the results below: 1us possily ok: for very short intervals i would expect some kind of 'reschedule' without actually firing a timer; maybe 50us are what it takes to do a round through the scheduler ? 300us probably ok i guess the extra 50-90us are what it takes to do a round through the scheduler 1000us borderline (this is the case for poll and kqueue, which are rounded to 1ms) here intervals seem to be increased by 10%, and i cannot see a good reason for this (more below). 3000us and above: wrong here again, the intervals seem to be 10% larger than what is requested, perhaps limiting the error to 10-20ms. Maybe the 10% extension results from creating a default 'precision' for legacy calls, but i do not think this is done correctly. First of all, if users do not specify a precision themselves, the automatically generated value should never exceed one tick. Second, the only point of a 'precision' parameter is to merge requests that may be close in time, so if there is already a timer scheduled within [Treq, Treq+precision] i will get it; but if there no pending timer, then one should schedule it for the requested interval. Davide/Alexander, any ideas ? All mentioned effects could be explained with implemented logic. 50us at 1us is probably sum of minimal latency of the hardware eventtimer on the specific platform and some software processing overhead (syscall, callout, timecouters, scheduler, etc). At later points system starts to noticeably use precision specified by kern.timecounter.alloweddeviation sysctl. It affects results from two sides: 1) extending intervals for specified percent of time to allow event aggregation, and 2) choosing time base between fast getbinuptime() and precise binuptime(). Extending interval is needed to aggregate not only callouts with each other, but also callouts with other system events, which are impossible to schedule in advance. It gives specified relative error, but no more then one CPU wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) it is 1/hz, for completely idle one it can be up to 0.5s. Second point allows to reduce processing overhead by the cost of error up to 1/hz for long periods ((100/allowed)*(1/hz)), when it is used. To get best possible precision kern.timecounter.alloweddeviation sysctl can be set to smaller value. Setting it to 0 will effectively disable all optimizations, but should give 50us precision in all cases. for t in 1 300 3000 3 30 ; do for m in select poll usleep nanosleep kqueue kqueueto syscall ; do ./testsleep $t $m done done [test results snipped] I should have posted some information about the test platform... It's a single-processor 700mhz arm chip. There was essentially nothing else running during the tests other than mostly-idle daemons (sshd, ntpd, the usual suspects). Kernel debugging options off (INVARIANTS[_SUPPORT], DIAGNOSTIC, and WITNESS). Some sysctl values of interest... rpi# sysctl kern.timecounter kern.timecounter.fast_gettime: 1 kern.timecounter.tick: 1 kern.timecounter.choice: BCM2835 Timecounter(1000) dummy(-100)
Re: [RFC/RFT] calloutng
On 31.12.2012 17:02, Ian Lepore wrote: On Mon, 2012-12-31 at 12:17 +0200, Alexander Motin wrote: On 31.12.2012 08:17, Luigi Rizzo wrote: On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote: ... I grabbed testsleep.c to test an arm event timer implementation, and had to fix a couple nits... kqueueto was missing from the names[] array, and I had to add a * 1000 to a couple places where usec was stuffed into a timespec's tv_nsec. I also tested the calloutng_12_17 patches and the kqueue stuff behaved very strangely. I've rewritten kqueue timeouts at the calloutng_12_26.patch. Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. Thank you for testing, Ian. interesting numbers, but there seems to be some problem in computing the exact interval; delays are much larger than expected. In this test, the original timer code used to round to the next multiple of 1 tick and then add another tick (except for the kqueue case), which is exactly what you see in the second set of measurements. The calloutng code however seems to do something odd: in addition to fixed overhead (some 50us, which you can see in the tests for 1us and 300us), all delay seem to be ~10% larger than what is requested, upper bounded to 10ms (note, the numbers are averages so i cannot tell whether all samples are the same or there is some distribution of values). I am not sure if this error is peculiar of the ARM version or also appears on x86/amd64 but I believe it should be fixed. If you look at the results below: 1us possily ok: for very short intervals i would expect some kind of 'reschedule' without actually firing a timer; maybe 50us are what it takes to do a round through the scheduler ? 300us probably ok i guess the extra 50-90us are what it takes to do a round through the scheduler 1000us borderline (this is the case for poll and kqueue, which are rounded to 1ms) here intervals seem to be increased by 10%, and i cannot see a good reason for this (more below). 3000us and above: wrong here again, the intervals seem to be 10% larger than what is requested, perhaps limiting the error to 10-20ms. Maybe the 10% extension results from creating a default 'precision' for legacy calls, but i do not think this is done correctly. First of all, if users do not specify a precision themselves, the automatically generated value should never exceed one tick. Second, the only point of a 'precision' parameter is to merge requests that may be close in time, so if there is already a timer scheduled within [Treq, Treq+precision] i will get it; but if there no pending timer, then one should schedule it for the requested interval. Davide/Alexander, any ideas ? All mentioned effects could be explained with implemented logic. 50us at 1us is probably sum of minimal latency of the hardware eventtimer on the specific platform and some software processing overhead (syscall, callout, timecouters, scheduler, etc). At later points system starts to noticeably use precision specified by kern.timecounter.alloweddeviation sysctl. It affects results from two sides: 1) extending intervals for specified percent of time to allow event aggregation, and 2) choosing time base between fast getbinuptime() and precise binuptime(). Extending interval is needed to aggregate not only callouts with each other, but also callouts with other system events, which are impossible to schedule in advance. It gives specified relative error, but no more then one CPU wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) it is 1/hz, for completely idle one it can be up to 0.5s. Second point allows to reduce processing overhead by the cost of error up to 1/hz for long periods ((100/allowed)*(1/hz)), when it is used. To get best possible precision kern.timecounter.alloweddeviation sysctl can be set to smaller value. Setting it to 0 will effectively disable all optimizations, but should give 50us precision in all cases. for t in 1 300 3000 3 30 ; do for m in select poll usleep nanosleep kqueue kqueueto syscall ; do ./testsleep $t $m done done [test results snipped] I should have posted some information about the test platform... It's a single-processor 700mhz arm chip. There was essentially nothing else running during the tests other than mostly-idle daemons (sshd, ntpd, the usual suspects). Kernel debugging options off (INVARIANTS[_SUPPORT], DIAGNOSTIC, and WITNESS). Some sysctl values of interest... rpi# sysctl kern.timecounter kern.timecounter.fast_gettime: 1 kern.timecounter.tick: 1 kern.timecounter.choice: BCM2835 Timecounter(1000) dummy(-100) kern.timecounter.hardware: BCM2835 Timecounter kern.timecounter.alloweddeviation: 5
Re: [RFC/RFT] calloutng
On Wed, 2012-12-26 at 21:24 +0200, Alexander Motin wrote: On 26.12.2012 01:21, Marius Strobl wrote: On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a try on sparc64 and it at least survives a buildworld there. However, with the patched kernels, buildworld times seem to increase slightly but reproducible by 1-2% (I only did four runs but typically buildworld times are rather stable and don't vary more than a minute for the same kernel and source here). Is this an expected trade-off (system time as such doesn't seem to increase)? I don't think build process uses significant number of callouts to affect results directly. I think this additional time could be result of the deeper next event look up, done by the new code, that is practically useless for sparc64, which effectively has no cpu_idle() routine. It wouldn't affect system time and wouldn't show up in any statistics (except PMC or something alike) because it is executed inside timer hardware interrupt handler. If my guess is right, that is a part that probably still could be optimized. I'll look on it. Thanks. Is there anything specific to test? Since the most of code is MI, for sparc64 I would mostly look on related MD parts (eventtimers and timecounters) to make sure they are working reliably in more stressful conditions. I still have some worries about possible deadlock on hardware where IPIs are used to fetch present time from other CPU. Here is small tool we are using for test correctness and performance of different user-level APIs: http://people.freebsd.org/~mav/testsleep.c I grabbed testsleep.c to test an arm event timer implementation, and had to fix a couple nits... kqueueto was missing from the names[] array, and I had to add a * 1000 to a couple places where usec was stuffed into a timespec's tv_nsec. I also tested the calloutng_12_17 patches and the kqueue stuff behaved very strangely. Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. -- Ian for t in 1 300 3000 3 30 ; do for m in select poll usleep nanosleep kqueue kqueueto syscall ; do ./testsleep $t $m done done With calloutng_12_26.patch... HZ=100 HZ=250 HZ=1000 -- select 1 55.79 1 50.96 1 61.32 poll1 1109.46 1 1107.86 1 1114.38 usleep 1 56.33 1 72.90 1 62.78 nanosleep 1 52.66 1 55.23 1 64.23 kqueue 1 1114.23 1 1113.81 1 1121.21 kqueueto1 65.44 1 71.00 1 75.01 syscall 1 4.70 1 4.45 1 4.55 select300355.79300357.76300362.35 poll 300 1107.85300 1122.55300 1115.62 usleep300355.28300357.28300360.79 nanosleep 300354.49300355.82300360.62 kqueue300 1112.57300 1118.13300 1117.16 kqueueto 300375.98300378.62300395.61 syscall 300 4.41300 4.45300 4.54 select 3000 3246.75 3000 3246.74 3000 3252.72 poll 3000 3238.10 3000 3229.12 3000 3250.10 usleep 3000 3242.47 3000 3237.06 3000 3249.61 nanosleep3000 3238.79 3000 3231.55 3000 3248.11 kqueue 3000 3240.01 3000 3236.07 3000 3247.60 kqueueto 3000 3265.36 3000 3267.22 3000 3274.96 syscall 3000 4.69 3000 4.44 3000 4.50 select 3 31714.60 3 31941.17 3 32467.69 poll3 31522.76 3 31983.00 3 32497.81 usleep 3 31459.67 3 31980.76 3 32458.71 nanosleep
Re: [RFC/RFT] calloutng
On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote: ... I grabbed testsleep.c to test an arm event timer implementation, and had to fix a couple nits... kqueueto was missing from the names[] array, and I had to add a * 1000 to a couple places where usec was stuffed into a timespec's tv_nsec. I also tested the calloutng_12_17 patches and the kqueue stuff behaved very strangely. Then I noticed you had a 12_26 patchset so I tested that (after crudely fixing a couple uninitialized var warnings), and it all looks good on this arm (Raspberry Pi). I'll attach the results. It's so sweet to be able to do precision sleeps. interesting numbers, but there seems to be some problem in computing the exact interval; delays are much larger than expected. In this test, the original timer code used to round to the next multiple of 1 tick and then add another tick (except for the kqueue case), which is exactly what you see in the second set of measurements. The calloutng code however seems to do something odd: in addition to fixed overhead (some 50us, which you can see in the tests for 1us and 300us), all delay seem to be ~10% larger than what is requested, upper bounded to 10ms (note, the numbers are averages so i cannot tell whether all samples are the same or there is some distribution of values). I am not sure if this error is peculiar of the ARM version or also appears on x86/amd64 but I believe it should be fixed. If you look at the results below: 1us possily ok: for very short intervals i would expect some kind of 'reschedule' without actually firing a timer; maybe 50us are what it takes to do a round through the scheduler ? 300us probably ok i guess the extra 50-90us are what it takes to do a round through the scheduler 1000us borderline (this is the case for poll and kqueue, which are rounded to 1ms) here intervals seem to be increased by 10%, and i cannot see a good reason for this (more below). 3000us and above: wrong here again, the intervals seem to be 10% larger than what is requested, perhaps limiting the error to 10-20ms. Maybe the 10% extension results from creating a default 'precision' for legacy calls, but i do not think this is done correctly. First of all, if users do not specify a precision themselves, the automatically generated value should never exceed one tick. Second, the only point of a 'precision' parameter is to merge requests that may be close in time, so if there is already a timer scheduled within [Treq, Treq+precision] i will get it; but if there no pending timer, then one should schedule it for the requested interval. Davide/Alexander, any ideas ? cheers luigi for t in 1 300 3000 3 30 ; do for m in select poll usleep nanosleep kqueue kqueueto syscall ; do ./testsleep $t $m done done With calloutng_12_26.patch... HZ=100 HZ=250 HZ=1000 -- select 1 55.79 1 50.96 1 61.32 poll1 1109.46 1 1107.86 1 1114.38 usleep 1 56.33 1 72.90 1 62.78 nanosleep 1 52.66 1 55.23 1 64.23 kqueue 1 1114.23 1 1113.81 1 1121.21 kqueueto1 65.44 1 71.00 1 75.01 syscall 1 4.70 1 4.45 1 4.55 select300355.79300357.76300362.35 poll 300 1107.85300 1122.55300 1115.62 usleep300355.28300357.28300360.79 nanosleep 300354.49300355.82300360.62 kqueue300 1112.57300 1118.13300 1117.16 kqueueto 300375.98300378.62300395.61 syscall 300 4.41300 4.45300 4.54 select 3000 3246.75 3000 3246.74 3000 3252.72 poll 3000 3238.10 3000 3229.12 3000 3250.10 usleep 3000 3242.47 3000 3237.06 3000 3249.61 nanosleep3000 3238.79 3000 3231.55 3000 3248.11 kqueue 3000 3240.01 3000 3236.07 3000 3247.60 kqueueto 3000 3265.36 3000 3267.22 3000 3274.96 syscall 3000 4.69 3000 4.44 3000 4.50 select 3 31714.60 3 31941.17 3 32467.69 poll3 31522.76 3 31983.00 3 32497.81 usleep 3 31459.67 3 31980.76 3 32458.71 nanosleep 3 31431.02 3 31982.22 3 32525.20 kqueue 3 31466.75 3 31873.90 3 31973.54 kqueueto3 31564.67 3 32522.35 3 32475.59 syscall
Re: [RFC/RFT] calloutng
On 26.12.2012 01:21, Marius Strobl wrote: On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a try on sparc64 and it at least survives a buildworld there. However, with the patched kernels, buildworld times seem to increase slightly but reproducible by 1-2% (I only did four runs but typically buildworld times are rather stable and don't vary more than a minute for the same kernel and source here). Is this an expected trade-off (system time as such doesn't seem to increase)? I don't think build process uses significant number of callouts to affect results directly. I think this additional time could be result of the deeper next event look up, done by the new code, that is practically useless for sparc64, which effectively has no cpu_idle() routine. It wouldn't affect system time and wouldn't show up in any statistics (except PMC or something alike) because it is executed inside timer hardware interrupt handler. If my guess is right, that is a part that probably still could be optimized. I'll look on it. Thanks. Is there anything specific to test? Since the most of code is MI, for sparc64 I would mostly look on related MD parts (eventtimers and timecounters) to make sure they are working reliably in more stressful conditions. I still have some worries about possible deadlock on hardware where IPIs are used to fetch present time from other CPU. Here is small tool we are using for test correctness and performance of different user-level APIs: http://people.freebsd.org/~mav/testsleep.c -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a try on sparc64 and it at least survives a buildworld there. However, with the patched kernels, buildworld times seem to increase slightly but reproducible by 1-2% (I only did four runs but typically buildworld times are rather stable and don't vary more than a minute for the same kernel and source here). Is this an expected trade-off (system time as such doesn't seem to increase)? Is there anything specific to test? Marius ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, Dec 19, 2012 at 11:00:06AM +, Poul-Henning Kamp wrote: In message 50d192e8.3020...@freebsd.org, Alexander Motin writes: Linux uses 32.32 format in their eventtimers code. (And that is no accident, I know who they got the number from :-) But if at some point we want to be able to handle absolute wall time, [...] Then you have other problems, including but not limited to clock being stepped, leap-seconds, suspend/resume and frequency stability. If you want to support callouts of the type (At 14:00 UTC tomorrow) (disregarding the time-zone issue), you need to catch all significant changes to our UTC estimate and recalibrate your callout based on that. It is not obvious that we have applications for such an API that warrant the complexity. Either way, such a facility should be layered on top of the callout facility, which should always run in elapsed time[1] with no attention paid to what NTPD might do to the UTC estimate. POSIX specifies functions that assume such a facility exists, although applications may not care much if we implement them incorrectly. For example, pthread_mutex_timedlock() and sem_timedwait() shall time out when the CLOCK_REALTIME clock reaches the given value, and pthread_cond_timedwait() and clock_nanosleep() (with TIMER_ABSTIME) shall time out when the specified clock reaches the given value. So summary: 32.32 is the right format. Poul-Henning [1] Notice that elapsed time needs a firm definition with respect to suspend/resume, and that this decision has big implications for the API use and code duplication. I think it prudent to specify a flag to callouts, to tell what should happen on suspend/resume, something like: SR_CANCEL /* Cancel the callout on S/R */ /* no flag* /* Toll this callout only when system is running */ SR_IGNORE /* Toll suspended time from callout */ If you get this right, callouts from device drivers will just DTRT, if you get it wrong, all device drivers will need boilerplate code to handle S/R Userland could get access to this via CLOCK_REALTIME vs CLOCK_MONOTONIC vs CLOCK_UPTIME. -- Jilles Tjoelker ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
In message 2012125025.ga46...@stack.nl, Jilles Tjoelker writes: Either way, such a facility should be layered on top of the callout facility, which should always run in elapsed time[1] with no attention paid to what NTPD might do to the UTC estimate. POSIX specifies functions that assume such a facility exists, although applications may not care much if we implement them incorrectly. It should still be implemented op top of callouts, not as part of: it is an entirely different thing to try to do right. I think it prudent to specify a flag to callouts, to tell what should happen on suspend/resume, something like: SR_CANCEL /* Cancel the callout on S/R */ /* no flag* /* Toll this callout only when system is running */ SR_IGNORE /* Toll suspended time from callout */ If you get this right, callouts from device drivers will just DTRT, if you get it wrong, all device drivers will need boilerplate code to handle S/R Userland could get access to this via CLOCK_REALTIME vs CLOCK_MONOTONIC vs CLOCK_UPTIME. I have _no_ idea what you are trying to say here... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
Fabian Keil freebsd-lis...@fabiankeil.de wrote: Alexander Motin m...@freebsd.org wrote: On 20.12.2012 15:26, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: On 20.12.2012 12:56, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With this patch (and the previous one, I didn't test the others) my mouse cursor is occasionally not reacting for short amounts of time (less than a second, but long enough to be noticeable). Could you try to revert part of the patch, related to dev/atkbdc? I am not strong in details of that hardware, but in comments there mention that they are related. May be lost keyboard interrupts (which polling rate was increased to 1 second) cause PS/2 mouse delays. I reverted the changes to sys/dev/atkbdc/* about an hour ago and so far it's looking good. I'll report back tomorrow after some more testing. Still looking good. Fabian signature.asc Description: PGP signature
Re: [RFC/RFT] calloutng
On 21.12.2012 14:16, Fabian Keil wrote: Fabian Keil freebsd-lis...@fabiankeil.de wrote: Alexander Motin m...@freebsd.org wrote: On 20.12.2012 15:26, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: On 20.12.2012 12:56, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With this patch (and the previous one, I didn't test the others) my mouse cursor is occasionally not reacting for short amounts of time (less than a second, but long enough to be noticeable). Could you try to revert part of the patch, related to dev/atkbdc? I am not strong in details of that hardware, but in comments there mention that they are related. May be lost keyboard interrupts (which polling rate was increased to 1 second) cause PS/2 mouse delays. I reverted the changes to sys/dev/atkbdc/* about an hour ago and so far it's looking good. I'll report back tomorrow after some more testing. Still looking good. Thank you. I think it may be some locking issue in atkbdc code. I'll revert that part of the patch until somebody rewrite it properly. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
Alexander Motin m...@freebsd.org wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With this patch (and the previous one, I didn't test the others) my mouse cursor is occasionally not reacting for short amounts of time (less than a second, but long enough to be noticeable). Every now and then the window manager (i3-wm) changes window focus which could be explained by either phantom keyboard or mouse input, or terminal lines are marked as if the cursor was moved with the left button pressed. The problems happen a couple of times per hour but I haven't been able to intentionally reproduce them. They only seem to occur while I'm moving the cursor, but of course I wouldn't otherwise notice a unresponsive cursor anyway. While the cursor is unresponsive, keyboard input and the rest of the system works as expected as far as I can tell. If I set debug.psm.loglevel=4 I get a psm0: lost interrupt? message once per second when not moving the mouse, however that also happens without the patch and thus might be unrelated. I'm using moused. I'm not sure what additional information is necessary to debug this, so here a bunch of sysctl values that may or may not be relevant: fk@r500 ~ $sysctl kern.eventtimer kern.timecounter kern.eventtimer.et.i8254.flags: 1 kern.eventtimer.et.i8254.frequency: 1193182 kern.eventtimer.et.i8254.quality: 100 kern.eventtimer.et.HPET.flags: 3 kern.eventtimer.et.HPET.frequency: 14318180 kern.eventtimer.et.HPET.quality: 450 kern.eventtimer.et.HPET1.flags: 3 kern.eventtimer.et.HPET1.frequency: 14318180 kern.eventtimer.et.HPET1.quality: 440 kern.eventtimer.et.HPET2.flags: 3 kern.eventtimer.et.HPET2.frequency: 14318180 kern.eventtimer.et.HPET2.quality: 440 kern.eventtimer.et.HPET3.flags: 3 kern.eventtimer.et.HPET3.frequency: 14318180 kern.eventtimer.et.HPET3.quality: 440 kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440) HPET3(440) i8254(100) kern.eventtimer.singlemul: 2 kern.eventtimer.idletick: 0 kern.eventtimer.activetick: 1 kern.eventtimer.timer: HPET kern.eventtimer.periodic: 0 kern.timecounter.tc.i8254.mask: 65535 kern.timecounter.tc.i8254.counter: 25970 kern.timecounter.tc.i8254.frequency: 1193182 kern.timecounter.tc.i8254.quality: 0 kern.timecounter.tc.HPET.mask: 4294967295 kern.timecounter.tc.HPET.counter: 3963519587 kern.timecounter.tc.HPET.frequency: 14318180 kern.timecounter.tc.HPET.quality: 950 kern.timecounter.tc.ACPI-fast.mask: 16777215 kern.timecounter.tc.ACPI-fast.counter: 7323739 kern.timecounter.tc.ACPI-fast.frequency: 3579545 kern.timecounter.tc.ACPI-fast.quality: 900 kern.timecounter.tc.TSC.mask: 4294967295 kern.timecounter.tc.TSC.counter: 454465294 kern.timecounter.tc.TSC.frequency: 1995040520 kern.timecounter.tc.TSC.quality: -1000 kern.timecounter.stepwarnings: 0 kern.timecounter.hardware: HPET kern.timecounter.choice: TSC(-1000) ACPI-fast(900) HPET(950) i8254(0) dummy(-100) kern.timecounter.tick: 1 kern.timecounter.fast_gettime: 1 kern.timecounter.invariant_tsc: 1 kern.timecounter.smp_tsc: 0 The system is a Lenovo R500 with a Intel(R) Core(TM)2 Duo CPU T5870 @ 2.00GHz (1995.04-MHz K8-class CPU) Fabian signature.asc Description: PGP signature
Re: [RFC/RFT] calloutng
On 20.12.2012 12:56, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With this patch (and the previous one, I didn't test the others) my mouse cursor is occasionally not reacting for short amounts of time (less than a second, but long enough to be noticeable). Every now and then the window manager (i3-wm) changes window focus which could be explained by either phantom keyboard or mouse input, or terminal lines are marked as if the cursor was moved with the left button pressed. The problems happen a couple of times per hour but I haven't been able to intentionally reproduce them. They only seem to occur while I'm moving the cursor, but of course I wouldn't otherwise notice a unresponsive cursor anyway. While the cursor is unresponsive, keyboard input and the rest of the system works as expected as far as I can tell. If I set debug.psm.loglevel=4 I get a psm0: lost interrupt? message once per second when not moving the mouse, however that also happens without the patch and thus might be unrelated. I'm using moused. Could you try to revert part of the patch, related to dev/atkbdc? I am not strong in details of that hardware, but in comments there mention that they are related. May be lost keyboard interrupts (which polling rate was increased to 1 second) cause PS/2 mouse delays. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
Alexander Motin m...@freebsd.org wrote: On 20.12.2012 12:56, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With this patch (and the previous one, I didn't test the others) my mouse cursor is occasionally not reacting for short amounts of time (less than a second, but long enough to be noticeable). Every now and then the window manager (i3-wm) changes window focus which could be explained by either phantom keyboard or mouse input, or terminal lines are marked as if the cursor was moved with the left button pressed. The problems happen a couple of times per hour but I haven't been able to intentionally reproduce them. They only seem to occur while I'm moving the cursor, but of course I wouldn't otherwise notice a unresponsive cursor anyway. While the cursor is unresponsive, keyboard input and the rest of the system works as expected as far as I can tell. If I set debug.psm.loglevel=4 I get a psm0: lost interrupt? message once per second when not moving the mouse, however that also happens without the patch and thus might be unrelated. I'm using moused. Could you try to revert part of the patch, related to dev/atkbdc? I am not strong in details of that hardware, but in comments there mention that they are related. May be lost keyboard interrupts (which polling rate was increased to 1 second) cause PS/2 mouse delays. I reverted the changes to sys/dev/atkbdc/* about an hour ago and so far it's looking good. I'll report back tomorrow after some more testing. Fabian signature.asc Description: PGP signature
Re: [RFC/RFT] calloutng
On 20.12.2012 15:26, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: On 20.12.2012 12:56, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With this patch (and the previous one, I didn't test the others) my mouse cursor is occasionally not reacting for short amounts of time (less than a second, but long enough to be noticeable). Every now and then the window manager (i3-wm) changes window focus which could be explained by either phantom keyboard or mouse input, or terminal lines are marked as if the cursor was moved with the left button pressed. The problems happen a couple of times per hour but I haven't been able to intentionally reproduce them. They only seem to occur while I'm moving the cursor, but of course I wouldn't otherwise notice a unresponsive cursor anyway. While the cursor is unresponsive, keyboard input and the rest of the system works as expected as far as I can tell. If I set debug.psm.loglevel=4 I get a psm0: lost interrupt? message once per second when not moving the mouse, however that also happens without the patch and thus might be unrelated. I'm using moused. Could you try to revert part of the patch, related to dev/atkbdc? I am not strong in details of that hardware, but in comments there mention that they are related. May be lost keyboard interrupts (which polling rate was increased to 1 second) cause PS/2 mouse delays. I reverted the changes to sys/dev/atkbdc/* about an hour ago and so far it's looking good. I'll report back tomorrow after some more testing. Thank you for the report. If it will be fine. you can try to reapply that part of the patch, just changing line: callout_reset_flags(sc-callout, hz, atkbdtimeout, dev, C_PRELSET(0)); to the: callout_reset_flags(sc-callout, hz/10, atkbdtimeout, dev, C_PRELSET(0)); It should about to restore original polling interval, but still make it more flexible then original. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
Alexander Motin m...@freebsd.org wrote: On 20.12.2012 15:26, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: On 20.12.2012 12:56, Fabian Keil wrote: Alexander Motin m...@freebsd.org wrote: Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With this patch (and the previous one, I didn't test the others) my mouse cursor is occasionally not reacting for short amounts of time (less than a second, but long enough to be noticeable). Could you try to revert part of the patch, related to dev/atkbdc? I am not strong in details of that hardware, but in comments there mention that they are related. May be lost keyboard interrupts (which polling rate was increased to 1 second) cause PS/2 mouse delays. I reverted the changes to sys/dev/atkbdc/* about an hour ago and so far it's looking good. I'll report back tomorrow after some more testing. Thank you for the report. If it will be fine. you can try to reapply that part of the patch, just changing line: callout_reset_flags(sc-callout, hz, atkbdtimeout, dev, C_PRELSET(0)); to the: callout_reset_flags(sc-callout, hz/10, atkbdtimeout, dev, C_PRELSET(0)); It should about to restore original polling interval, but still make it more flexible then original. With this change the mouse stops working completely after moving the cursor a couple of pixels, at which point dmesg shows: kbdc: TEST_AUX_PORT status: kbdc: RESET_AUX return code:00fa kbdc: RESET_AUX status:00aa kbdc: RESET_AUX ID: Using the keyboard frequently causes duplicated characters. I'm aware of a problem in vanilla CURRENT where the mouse stops working with similar looking messages when the cursor is being moved at the wrong moment, but this happens less than once a week and I haven't been able to track it down yet. It's possible that it's the same bug and your patch just triggers it more reliable (two times in a row). I haven't seen the duplicated-characters issue before, though. Just for kicks I also tried: callout_reset_flags(sc-callout, hz/100, atkbdtimeout, dev, C_PRELSET(0)); but with this modification I can't even enter my login password and thus can't tell if the mouse would work. Fabian signature.asc Description: PGP signature
Re: API explosion (Re: [RFC/RFT] calloutng)
On 19.12.2012 01:37, Ian Lepore wrote: On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote: On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote: On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: [top posting for readability; in summary we were discussing the new callout API trying to avoid an explosion of methods and arguments while at the same time supporting the old API and the new one] (I am also Cc-ing phk as he might have better insight on the topic). I think the patch you propose is a step in the right direction, but i still remain concerned by having to pass two bintimes (by reference, but they should really go by value) and one 'ticks' value to all these functions. I am also dubious that we need a full 128 bits to specify the 'precision': there would be absolutely no loss of functionality if we decided to specify the precision in powers of 2, so a precision 'k' (signed) means 2^k seconds. This way 8 bits are enough to represent any precision we want. ... I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It would typical to want a 500uS timeout but be willing to late by up to 250uS if i said k is signed so negative values represent fractions of a second. 2^-128 is pretty short :) Ahh, I missed that. Good enough then! Hmmm, if that ideas survives further review, then could precision be encoded in 8 bits of the flags, eliminating another parm? Those who tracked the branch could see that actually was our first approach to handle precision. Unfortunately, it appeared not very convenient when you need to convert relative precision in percents or fraction of interval to the absolute precision in power of 2. We were worried that using some ffsll() for it can be inconvenient or expensive. But since we are now talking about passing relative bintime as an argument, that may be more viable option. I'll make another try. Thanks for the input. Pity it didn't happen couple of months ago. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore writes : On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It is a bad idea, and it is physically pointless, given the stabilities of the timebases available for computers in general. Please just take my word as a time-nut, and use a 32.32 binary format in seconds (see previous email) and be done with it. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, Dec 19, 2012 at 1:54 AM, Poul-Henning Kamp p...@phk.freebsd.dk wrote: In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore writes : On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It is a bad idea, and it is physically pointless, given the stabilities of the timebases available for computers in general. Please just take my word as a time-nut, and use a 32.32 binary format in seconds (see previous email) and be done with it. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. Right now -- the precision is specified in 'bintime', which is a binary number. It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in the specific platform. I do not really think it worth to create another structure for handling time (e.g. struct bintime32), as it will lead to code duplication for all the basic conversion/math operation. On the other hand, 32.32 may not be enough in the long future. What do you think about that? Thanks, Davide ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On 19.12.2012 12:03, Davide Italiano wrote: On Wed, Dec 19, 2012 at 1:54 AM, Poul-Henning Kamp p...@phk.freebsd.dk wrote: In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore writes : On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It is a bad idea, and it is physically pointless, given the stabilities of the timebases available for computers in general. Please just take my word as a time-nut, and use a 32.32 binary format in seconds (see previous email) and be done with it. Right now -- the precision is specified in 'bintime', which is a binary number. It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in the specific platform. I do not really think it worth to create another structure for handling time (e.g. struct bintime32), as it will lead to code duplication for all the basic conversion/math operation. On the other hand, 32.32 may not be enough in the long future. What do you think about that? Linux uses 32.32 format in their eventtimers code. Respecting that now we have no any timer hardware with frequency about 4GHz, that precision is probably sufficient. But if at some point we want to be able to handle absolute wall time, then 32bit integer part may be not enough. Then we return to the question: how many different data types do we want to see in one subsystem? Sure, using single 64bit value would be much easier then struct bintime from many perspectives, but what's about the edge cases? -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
In message CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com , Davide Italiano writes: Right now -- the precision is specified in 'bintime', which is a binary number. It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in the specific platform. And that is way overkill for specifying a callout, at best your clock has short term stabilities approaching 1e-8, but likely as bad as 1e-6. (The reason why bintime is important for timekeeping is that we accumulate timeintervals approx 1e3 times a second, so the rounding error has to be much smaller than the short term stability in order to not dominate) I do not really think it worth to create another structure for handling time (e.g. struct bintime32), as it will lead to code No, that was exactly my point: It should be an integer so that comparisons and arithmetic is trivial. A 32.32 format fits nicely into a int64_t which is readily available in the language. As I said in my previous email: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 100) (Bikeshed the names at your convenience) Then you can say callout_foo(34 * DURSEC) callout_foo(2400 * DURMSEC) or callout_foo(500 * DURNSEC) With this format you can specify callouts 68 years into the future with quarter nanosecond resolution, and you can trivially and efficiently compare dur_t's with if (d1 d2) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
In message 50d192e8.3020...@freebsd.org, Alexander Motin writes: Linux uses 32.32 format in their eventtimers code. (And that is no accident, I know who they got the number from :-) But if at some point we want to be able to handle absolute wall time, [...] Then you have other problems, including but not limited to clock being stepped, leap-seconds, suspend/resume and frequency stability. If you want to support callouts of the type (At 14:00 UTC tomorrow) (disregarding the time-zone issue), you need to catch all significant changes to our UTC estimate and recalibrate your callout based on that. It is not obvious that we have applications for such an API that warrant the complexity. Either way, such a facility should be layered on top of the callout facility, which should always run in elapsed time[1] with no attention paid to what NTPD might do to the UTC estimate. So summary: 32.32 is the right format. Poul-Henning [1] Notice that elapsed time needs a firm definition with respect to suspend/resume, and that this decision has big implications for the API use and code duplication. I think it prudent to specify a flag to callouts, to tell what should happen on suspend/resume, something like: SR_CANCEL /* Cancel the callout on S/R */ /* no flag* /* Toll this callout only when system is running */ SR_IGNORE /* Toll suspended time from callout */ If you get this right, callouts from device drivers will just DTRT, if you get it wrong, all device drivers will need boilerplate code to handle S/R -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: In message CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com , Davide Italiano writes: Right now -- the precision is specified in 'bintime', which is a binary number. It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in the specific platform. And that is way overkill for specifying a callout, at best your clock has short term stabilities approaching 1e-8, but likely as bad as 1e-6. So you always agreed with me that bintimes are unsuitable for almost everything, and especially unsuitable for timeouts? :-) (The reason why bintime is important for timekeeping is that we accumulate timeintervals approx 1e3 times a second, so the rounding error has to be much smaller than the short term stability in order to not dominate) bintimes are not unsuitable for timekeeping, but they a painful to use for other APIs. You have to either put bintimes in layers in the other APIs, or convert them to a more suitable format, and there is a problem placing the conversion at points where it is efficient. This thread seems to be mostly about putting the conversion in wrong places. My original objection was about using bintimes for almost everything at the implementation level. I do not really think it worth to create another structure for handling time (e.g. struct bintime32), as it will lead to code No, that was exactly my point: It should be an integer so that comparisons and arithmetic is trivial. A 32.32 format fits nicely into a int64_t which is readily available in the language. I would have tried a 32 bit format with a variable named 'ticks'. Something like: - ticks = 0. Same meaning as now. No changes in ABIs or APIs to use this. The tick period would be constant but for virtual ticks and not too small. hz = 1000 now makes the period too small, and not a power of 2. So make the period 1/128 second. This gives a 1.24.7 binary format. 2**24 seconds is 194 days. - ticks 0. The 31 value bits are now a cookie (descriptor) referring to a bintime or whatever. This case should rarely be used. I don't like it that a tickless kernel, which is needed mainly for power saving, has expanded into complications to support short timeouts which should rarely be used. As I said in my previous email: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 100) (Bikeshed the names at your convenience) Then you can say callout_foo(34 * DURSEC) callout_foo(2400 * DURMSEC) or callout_foo(500 * DURNSEC) Constructing the cookie for my special case would not be so easy. With this format you can specify callouts 68 years into the future with quarter nanosecond resolution, and you can trivially and efficiently compare dur_t's with if (d1 d2) This would make a better general format than timevals, timespecs and of course bintimes :-). It is a bit wasteful for timeouts since its extremes are rarely used. Malicious and broken callers can still cause overflow at 68 years, so you have to check for it and handle it. The limit of 194 days is just as good for timeouts. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
In message 20121219221518.e1...@besplex.bde.org, Bruce Evans writes: With this format you can specify callouts 68 years into the future with quarter nanosecond resolution, and you can trivially and efficiently compare dur_t's with if (d1 d2) This would make a better general format than timevals, timespecs and of course bintimes :-). Except that for absolute timescales, we're running out of the 32 bits integer part. Bintimes is a necessary superset of the 32.32 which tries to work around the necessary but missing int96_t or int128_t[1]. Poul-Henning [1] A good addition to C would be a general multi-word integer type where you could ask for any int%d_t or uint%d_t you cared for, and have the compiler DTRT. In difference from using a multiword-library, this would still give these types their natural integer behaviour. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote: On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: In message CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com , Davide Italiano writes: Right now -- the precision is specified in 'bintime', which is a binary number. It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in the specific platform. And that is way overkill for specifying a callout, at best your clock has short term stabilities approaching 1e-8, but likely as bad as 1e-6. So you always agreed with me that bintimes are unsuitable for almost everything, and especially unsuitable for timeouts? :-) (The reason why bintime is important for timekeeping is that we accumulate timeintervals approx 1e3 times a second, so the rounding error has to be much smaller than the short term stability in order to not dominate) bintimes are not unsuitable for timekeeping, but they a painful to use for other APIs. You have to either put bintimes in layers in the other APIs, or convert them to a more suitable format, and there is a problem placing the conversion at points where it is efficient. This thread seems to be mostly about putting the conversion in wrong places. My original objection was about using bintimes for almost everything at the implementation level. I do not really think it worth to create another structure for handling time (e.g. struct bintime32), as it will lead to code No, that was exactly my point: It should be an integer so that comparisons and arithmetic is trivial. A 32.32 format fits nicely into a int64_t which is readily available in the language. I would have tried a 32 bit format with a variable named 'ticks'. Something like: - ticks = 0. Same meaning as now. No changes in ABIs or APIs to use this. The tick period would be constant but for virtual ticks and not too small. hz = 1000 now makes the period too small, and not a power of 2. So make the period 1/128 second. This gives a 1.24.7 binary format. 2**24 seconds is 194 days. - ticks 0. The 31 value bits are now a cookie (descriptor) referring to a bintime or whatever. This case should rarely be used. I don't like it that a tickless kernel, which is needed mainly for power saving, has expanded into complications to support short timeouts which should rarely be used. Bruce, I don't really agree with this. The data addressed by cookie should be still stored somewhere, and KBI will result broken. This, indeed, is not real problem as long as current calloutng code heavily breaks KBI, but if that was your point, I don't see how your proposed change could help. As I said in my previous email: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 100) (Bikeshed the names at your convenience) Then you can say callout_foo(34 * DURSEC) callout_foo(2400 * DURMSEC) or callout_foo(500 * DURNSEC) Constructing the cookie for my special case would not be so easy. With this format you can specify callouts 68 years into the future with quarter nanosecond resolution, and you can trivially and efficiently compare dur_t's with if (d1 d2) This would make a better general format than timevals, timespecs and of course bintimes :-). It is a bit wasteful for timeouts since its extremes are rarely used. Malicious and broken callers can still cause overflow at 68 years, so you have to check for it and handle it. The limit of 194 days is just as good for timeouts. Bruce I think the phk's proposal is better. About your overflow objection, I think is really unlikely to happen, but better safe than sorry. Thanks Davide ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
In message 20121220005706.i1...@besplex.bde.org, Bruce Evans writes: On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: Except that for absolute timescales, we're running out of the 32 bits integer part. Except 32 bit time_t works until 2106 if it is unsigned. That's sort of not an option. The real problem was that time_t was not defined as a floating point number. [1] A good addition to C would be a general multi-word integer type where you could ask for any int%d_t or uint%d_t you cared for, and have the compiler DTRT. In difference from using a multiword-library, this would still give these types their natural integer behaviour. That would be convenient, but bad for efficiency if it were actually used much. You can say that about anything but CPU-native operations, and I doubt it would be as inefficient as struct bintime, which does not have access to the carry bit. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: In message 20121219221518.e1...@besplex.bde.org, Bruce Evans writes: With this format you can specify callouts 68 years into the future with quarter nanosecond resolution, and you can trivially and efficiently compare dur_t's with if (d1 d2) This would make a better general format than timevals, timespecs and of course bintimes :-). Except that for absolute timescales, we're running out of the 32 bits integer part. Except 32 bit time_t works until 2106 if it is unsigned. Bintimes is a necessary superset of the 32.32 which tries to work around the necessary but missing int96_t or int128_t[1]. [1] A good addition to C would be a general multi-word integer type where you could ask for any int%d_t or uint%d_t you cared for, and have the compiler DTRT. In difference from using a multiword-library, this would still give these types their natural integer behaviour. That would be convenient, but bad for efficiency if it were actually used much. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, Dec 19, 2012 at 10:51:48AM +, Poul-Henning Kamp wrote: In message CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com , Davide Italiano writes: Right now -- the precision is specified in 'bintime', which is a binary number. It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in the specific platform. And that is way overkill for specifying a callout, at best your clock has short term stabilities approaching 1e-8, but likely as bad as 1e-6. (The reason why bintime is important for timekeeping is that we accumulate timeintervals approx 1e3 times a second, so the rounding error has to be much smaller than the short term stability in order to not dominate) I do not really think it worth to create another structure for handling time (e.g. struct bintime32), as it will lead to code No, that was exactly my point: It should be an integer so that comparisons and arithmetic is trivial. A 32.32 format fits nicely into a int64_t which is readily available in the language. As I said in my previous email: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 100) (Bikeshed the names at your convenience) Then you can say callout_foo(34 * DURSEC) callout_foo(2400 * DURMSEC) or callout_foo(500 * DURNSEC) only thing, we must be careful with the parentheses For instance, in your macro, DURNSEC evaluates to 0 and so does any multiple of it. We should define them as #define DURNSEC DURSEC / 100 ... so DURNSEC is still 0 and 500*DURNSEC gives 214 I am curious that Bruce did not mention this :) (btw the typedef is swapped, should be typedef int64_t dur_t) cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
In message 20121219150809.ga98...@onelab2.iet.unipi.it, Luigi Rizzo writes: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 100) only thing, we must be careful with the parentheses Actually, it's more impportant to be careful with zeros, if you adjust the above to the correct number of zeros, DURNSEC is 4, which is within seven percent of the correct value. (btw the typedef is swapped, should be typedef int64_t dur_t) Yes, I'm trying to find out of people even listen to me :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On 19.12.2012 16:20, Bruce Evans wrote: On Wed, 19 Dec 2012, Davide Italiano wrote: On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote: I would have tried a 32 bit format with a variable named 'ticks'. Something like: - ticks = 0. Same meaning as now. No changes in ABIs or APIs to use this. The tick period would be constant but for virtual ticks and not too small. hz = 1000 now makes the period too small, and not a power of 2. So make the period 1/128 second. This gives a 1.24.7 binary format. 2**24 seconds is 194 days. - ticks 0. The 31 value bits are now a cookie (descriptor) referring to a bintime or whatever. This case should rarely be used. I don't like it that a tickless kernel, which is needed mainly for power saving, has expanded into complications to support short timeouts which should rarely be used. Bruce, I don't really agree with this. The data addressed by cookie should be still stored somewhere, and KBI will result broken. This, indeed, is not real problem as long as current calloutng code heavily breaks KBI, but if that was your point, I don't see how your proposed change could help. In the old API, it is an error to pass ticks 0, so only broken old callers are affected. Of course, if there are any then it would be hard to detect their garbage cookies. Anywy, it's too later to change to this, and maybe also to a 32.32 format. It would be late to change this after committing. I would definitely like it to be done earlier to not redo all the tests, but I think we could convert callout and eventtimers code to 32.32 format in several days. The only two questions are: do we really want it (won't there be any reasons to regret about it) and how do we want it to look? For the first question my personal showstopper since eventtimers creation always was the wish to keep consistency. But benefits of 32.32 format are clear, and if there are more votes for it, let's consider. If now it will be decided that full range will never be useful for callout subsystem, then it is obviously not needed for eventtimers also. About the second question, what do you think about such prototypes: typedef int64_t sbintime_t static __inline sbintime_t bintime_shrink(struct bintime *bt) {} static __inline struct bintime bintime_expand(sbintime_t sbt) {} ... int callout_reset_bt(struct callout *, sbintime_t sbt, sbintime_t pr, void (*fn)(void *), void *arg, int flags); , where pr used only for absolute precision, and flags includes: direct execution, absolute/relative time in argument, relative precision in case of relative sbt, flag for aligning to hardclock() to emulate legacy behavior, and potentially flags for reaction on suspend/resume. Another option is to move absolute precision also to flags, using log2() representation, as we tried and as was proposed before. With possibility to use precise relative time there will be few cases requiring absolute value of precision, that should depend on period. Then there will be no extra arguments in the most usual cases. Wrapper for existing API compatibility will look just like this: #define callout_reset(c, ticks, fn, arg)\ callout_reset_bt(c, ticks2sbintime(ticks), -1, \ (fn), (arg), C_HARDCLOCK) -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
dropping phk _AT_ onelab2 _DOT_ something from CC as long as it doesn't seem a valid mail address and I'm annoyed mails bounce back. On Wed, Dec 19, 2012 at 6:20 AM, Bruce Evans b...@optusnet.com.au wrote: On Wed, 19 Dec 2012, Davide Italiano wrote: On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote: I would have tried a 32 bit format with a variable named 'ticks'. Something like: - ticks = 0. Same meaning as now. No changes in ABIs or APIs to use this. The tick period would be constant but for virtual ticks and not too small. hz = 1000 now makes the period too small, and not a power of 2. So make the period 1/128 second. This gives a 1.24.7 binary format. 2**24 seconds is 194 days. - ticks 0. The 31 value bits are now a cookie (descriptor) referring to a bintime or whatever. This case should rarely be used. I don't like it that a tickless kernel, which is needed mainly for power saving, has expanded into complications to support short timeouts which should rarely be used. Bruce, I don't really agree with this. The data addressed by cookie should be still stored somewhere, and KBI will result broken. This, indeed, is not real problem as long as current calloutng code heavily breaks KBI, but if that was your point, I don't see how your proposed change could help. In the old API, it is an error to pass ticks 0, so only broken old callers are affected. Of course, if there are any then it would be hard to detect their garbage cookies. Anywy, it's too later to change to this, and maybe also to a 32.32 format. [32.32 format] It's not too late. What I'd like to do, right now people got interested in the problem is agreeing on the interface used. Following this thread, as I've already discussed to mav@, we would like to decide what of the two is better: - specify precision as additional argument (as we're doing right now) - use 'flags' argument If we allow time argument to be relative and not absolute, as suggested by luigi@, we can easily use relative precision where we had to use ffl() before. This would make a better general format than timevals, timespecs and of course bintimes :-). It is a bit wasteful for timeouts since its extremes are rarely used. Malicious and broken callers can still cause overflow at 68 years, so you have to check for it and handle it. The limit of 194 days is just as good for timeouts. I think the phk's proposal is better. About your overflow objection, I think is really unlikely to happen, but better safe than sorry. It's very easy for applications to cause kernel overflow using valid syscall args like tv_sec = TIME_T_MAX for a relative time in nanosleep(). Adding TIME_T_MAX to the current time in seconds overflow for all current times except for the first second after the Epoch. There is no difference between the overflow for 32-bit and 64-bit time_t's for this. This is now mostly handled so that the behaviour is harmless although wrong. E.g., the timeout might become negative, and then since it is not a cookie it is silently replaced by a timeout of 1 tick. In nanosleep(), IIRC there are further overflows that result in returning early instead of retrying the 1-tick timeouts endlessly. Bruce Thanks, Davide ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 19 Dec 2012, Davide Italiano wrote: On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote: I would have tried a 32 bit format with a variable named 'ticks'. Something like: - ticks = 0. Same meaning as now. No changes in ABIs or APIs to use this. The tick period would be constant but for virtual ticks and not too small. hz = 1000 now makes the period too small, and not a power of 2. So make the period 1/128 second. This gives a 1.24.7 binary format. 2**24 seconds is 194 days. - ticks 0. The 31 value bits are now a cookie (descriptor) referring to a bintime or whatever. This case should rarely be used. I don't like it that a tickless kernel, which is needed mainly for power saving, has expanded into complications to support short timeouts which should rarely be used. Bruce, I don't really agree with this. The data addressed by cookie should be still stored somewhere, and KBI will result broken. This, indeed, is not real problem as long as current calloutng code heavily breaks KBI, but if that was your point, I don't see how your proposed change could help. In the old API, it is an error to pass ticks 0, so only broken old callers are affected. Of course, if there are any then it would be hard to detect their garbage cookies. Anywy, it's too later to change to this, and maybe also to a 32.32 format. [32.32 format] This would make a better general format than timevals, timespecs and of course bintimes :-). It is a bit wasteful for timeouts since its extremes are rarely used. Malicious and broken callers can still cause overflow at 68 years, so you have to check for it and handle it. The limit of 194 days is just as good for timeouts. I think the phk's proposal is better. About your overflow objection, I think is really unlikely to happen, but better safe than sorry. It's very easy for applications to cause kernel overflow using valid syscall args like tv_sec = TIME_T_MAX for a relative time in nanosleep(). Adding TIME_T_MAX to the current time in seconds overflow for all current times except for the first second after the Epoch. There is no difference between the overflow for 32-bit and 64-bit time_t's for this. This is now mostly handled so that the behaviour is harmless although wrong. E.g., the timeout might become negative, and then since it is not a cookie it is silently replaced by a timeout of 1 tick. In nanosleep(), IIRC there are further overflows that result in returning early instead of retrying the 1-tick timeouts endlessly. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: In message 20121220005706.i1...@besplex.bde.org, Bruce Evans writes: On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: Except that for absolute timescales, we're running out of the 32 bits integer part. Except 32 bit time_t works until 2106 if it is unsigned. That's sort of not an option. I think it is. It is just probably not necessary since 32-bit systems will go away before 2038. The real problem was that time_t was not defined as a floating point number. That would be convenient too, but bad for efficiency on some systems. Kernels might not be able to use it, and then would have to use an alternative representation, which they should have done all along. [1] A good addition to C would be a general multi-word integer type where you could ask for any int%d_t or uint%d_t you cared for, and have the compiler DTRT. In difference from using a multiword-library, this would still give these types their natural integer behaviour. That would be convenient, but bad for efficiency if it were actually used much. You can say that about anything but CPU-native operations, and I doubt it would be as inefficient as struct bintime, which does not have access to the carry bit. Yes, I would say that about non-native. It goes against the spirit of C. OTOH, compilers are getting closer to giving full access to the carry bit. I just checked what clang does in a home-made 128-bit add function: % static void __noinline % uadd(struct u *xup, struct u *yup) % { % unsigned long long t; % % t = xup-w[0] + yup-w[0]; % if (t xup-w[0]) % xup-w[1]++; % xup-w[0] = t; % xup-w[1] += yup-w[1]; % } % % .align 16, 0x90 % .type uadd,@function % uadd: # @uadd % .cfi_startproc % # BB#0: # %entry % movq(%rdi), %rcx % movq8(%rdi), %rax % addq(%rsi), %rcx gcc generates an additional cmpq instruction here. % jae .LBB2_2 clang uses the carry bit set by the first addition to avoid the comparison, but still branches. % # BB#1: # %if.then % incq%rax % movq%rax, 8(%rdi) This adds 1 explicitly instead of using adcq, but this is the slow path. % .LBB2_2:# %if.end % movq%rcx, (%rdi) % addq8(%rsi), %rax This is as efficient as possible except for the extra branch, and the branch is almost perfectly predictable. % movq%rax, 8(%rdi) % ret % .Ltmp22: % .size uadd, .Ltmp22-uadd % .cfi_endproc Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
In message 50d1e0d8.9070...@freebsd.org, Alexander Motin writes: It would be late to change this after committing. I would definitely like it to be done earlier to not redo all the tests, but I think we could convert callout and eventtimers code to 32.32 format in several days. The only two questions are: do we really want it (won't there be any reasons to regret about it) and how do we want it to look? As much as it pains me to raise this point, we would regret it if we did not use 32.32, because Linux already went that way. As much as there is to be said for doing things right, we should also try to avoid pointless incompatibilities which will make it needlessly hard for people to move code, particular device drivers forth and back. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
I finally remembered to remove the .it phk :-). On Wed, 19 Dec 2012, Luigi Rizzo wrote: On Wed, Dec 19, 2012 at 10:51:48AM +, Poul-Henning Kamp wrote: ... As I said in my previous email: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 100) (Bikeshed the names at your convenience) Then you can say callout_foo(34 * DURSEC) callout_foo(2400 * DURMSEC) or callout_foo(500 * DURNSEC) only thing, we must be careful with the parentheses For instance, in your macro, DURNSEC evaluates to 0 and so does any multiple of it. We should define them as #define DURNSEC DURSEC / 100 ... so DURNSEC is still 0 and 500*DURNSEC gives 214 I am curious that Bruce did not mention this :) Er, he was careful. DURNSEC gives 4, not 0. This is not very accurate, but probably good enough. Your version without parentheses is not so careful and depends on a magic order of operations and no overflow from this. E.g.: 500*DURNSEC = 500*DURSEC / 10 = 500*((dur_t)1 32) / 10 This is very accurate and happens not to overflow. But 5 seconds represented a little strangely in nanoseconds would overflow: 50*DURNSEC = 50*((dur_t)1 32) / 10 So would 5 billion times DURSEC, but 5 billion seconds is more unreasobable than 5 billion nanoseconds and the format just can't represent that. (btw the typedef is swapped, should be typedef int64_t dur_t) Didn't notice this. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. On 17.12.2012 01:37, Alexander Motin wrote: Here is one more version. Unless something new will be found/reported this may be the last one, because me and Davide are quite satisfied with the results. If everything will be fine, I think we could commit it to HEAD closer to the end of the week: http://people.freebsd.org/~mav/calloutng_12_16.patch Changes in this version: -- Removed couple of redundant variables in callout implementation, that reduced sizeof(struct callout) by two pointers and simplified some internal code. -- syscons driver was made to schedule only 1-2 callouts per second instead of 20-30 before when console is in graphical mode and there are few other things to do. Now my laptop has only about 30 interrupts per second total during idle periods with X running. -- i8254 eventtimer driver was optimized to work faster in disabled by default one-shot mode. -- Few kernel functions were added to make KPIs more complete. -- Man pages were updated. -- Some style fixes were made. On 15.12.2012 18:55, Alexander Motin wrote: I'm sorry to interrupt review, but as usual good ideas came during the final testing, causing another round. :) Here is updated patch for HEAD, that includes several new changes: http://people.freebsd.org/~mav/calloutng_12_15.patch The new changes are: -- Precision and event aggregation code was reworked. Instead of previous -prec/+prec representation, precision is now single-sided -- -0/+prec. It allowed to significantly improve precision on long time intervals for APIs which imply that event should not happen before the specified time. Depending on CPU activity, mistake for long time intervals now will never be more then 1-500ms, even if specified precision allows more. -- Some minor optimizations were made to reduce callout overhead and latency by 1.5-2us. Now on Core2Duo amd64 system with LAPIC eventtimer and TSC timecounter usleep(1) call from user-level executes in just 5-6us, instead of 7-8us before. Now it can do 180K cycles per second on single CPU with only partial CPU load. -- Number of kernel subsystems (dcons, syscons, yarrow, led, atkbd, setrlimit) were modified to reduce number of interrupts, also with event aggregation by explicit specification of the acceptable events precision. Now my Core2Duo test system has only 30 interrupts per second in idle. If not remaining syscons events, it could easily be 15. My IvyBridge ultrabook first time in its history shown 5.5 hours of battery time with full screen brightness and 10 hours with lid closed. -- Some kernel functions were added to make KPIs more complete. I've successfully tested this patch on amd64 and arm. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng
On Mon, Dec 17, 2012 at 01:22:59PM -0800, Adrian Chadd wrote: Personally, I'd rather see some consistently used units here.. bintime (or something similar) is the correct choice here. If we are concerned about the size (128 bit) then we can map it to a shorter, fixed point format, such as sign+31+32 as phk was suggesting. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: Hi. I would instead do the following: I also don't very like the wide API and want to hear fresh ideas, but approaches to time measurement there are too different to do what you are proposing. Main problem is that while ticks value is relative, bintime is absolute. It is not easy to make conversion between them fast and precise. I've managed to do it, but the only function that does it now is _callout_reset_on(). All other functions are just passing values down. I am not sure I want to duplicate that code in each place, though doing it at least for for callout may be a good idea. I am afraid the above is not convincing. Most/all of the APIs i mentioned still have the conversion from ticks to bintime, and the code in your patch is just building multiple parallel paths (one for each of the three versions of the same function) to some final piece of code where the conversion takes place. The problem is that all of this goes through a set of obfuscating macros and the end result is horrible. To be clear, i believe the work you have been doing on cleaning up callout is great, i am just saying that this is the time to look at the code from a few steps away and clean up all those design decisions that perhaps were made in a haste to make things work. I will give you another example to show how convoluted is the code now: cv_timedwait() and cv_timedwait_sig() now have three versions each (plain, bt, flags). These six are remapped through macros to two functions, _cv_timedwait() and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() maps to _cv_timedwait_sig() ) These two _cv_timedwait*() take both ticks and bintimes, and contain this sequence: + if (bt == NULL) + sleepq_set_timeout_flags(cvp, timo, flags); + else + sleepq_set_timeout_bt(cvp, bt, precision); Guess what, both sleepq_* are macros that remap to the same _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless. But then if you dig into _sleepq_set_timeout() you'll see + if (bt == NULL) + callout_reset_flags_on(td-td_slpcallout, timo, + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); + else + callout_reset_bt_on(td-td_slpcallout, bt, precision, + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); and again both callout_reset*() are again remapped through macros to _callout_reset_on(), so another useless if (bt == NULL) And in the end you have the conversion from ticks to bintime. So basically the code path for cv_timedwait() has those two useless switches and one useless extra argument, and the conversion from ticks to bintime is down deep down in _callout_reset_on() where it can only be resolved at runtime, whereas by doing the conversion at the beginning the decision could have been made at compile time. So I believe my proposal would give large simplifications in the code and lead to a much cleaner implementation of what you have designed: 1. acknowledge the fact that the only representation of time that callouts use internally is a bintime+precision, define one single function (instead of two or three or six) that implements the blessed API, and implement the others with macros or inline functions doing the appropriate conversions; 2. specifically, the *_flags() variant has no reason to exist. It can be implemented through the *_bt() variant, and being a new function the only places where you introduce it require manual modifications so you can directly invoke the new function. Again, please take this as constructive criticism, as i really like the work you have been doing and appreciate the time and effort you are putting on it cheers luigi Creating sets of three functions I had three different goals: - callout_reset() -- it is legacy variant required to keep API compatibility; - callout_reset_flags() -- it is for cases where custom precision specification needs to be added to the existing code, or where direct callout execution is needed. Conversion to bintime would additionally complicate consumer code, that I would try to avoid. - callout_reset_bt() -- API for the new code, which needs high precision and doesn't mind to operate bintime. Now there is only three such places in kernel now, and I don't think there will be much more. Respectively, these three options are replicated to other APIs where time intervals are used. PS: Please keep me in CC. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Tue, Dec 18, 2012 at 06:36:43PM +0100, Luigi Rizzo wrote: On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: ... So I believe my proposal would give large simplifications in the code and lead to a much cleaner implementation of what you have designed: 1. acknowledge the fact that the only representation of time that callouts use internally is a bintime+precision, define one single function (instead of two or three or six) that implements the blessed API, and implement the others with macros or inline functions doing the appropriate conversions; 2. specifically, the *_flags() variant has no reason to exist. It can be implemented through the *_bt() variant, and being a new function the only places where you introduce it require manual modifications so you can directly invoke the new function. to clarify: i am not sure if now the *_bt() variant takes flags too, but my point is that the main API function should take all supported arguments (including flags) and others should simply be regarded as simplified versions. More or less what we have for sockets, with send() and sendmsg() and friend cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On 18.12.2012 19:36, Luigi Rizzo wrote: On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: I would instead do the following: I also don't very like the wide API and want to hear fresh ideas, but approaches to time measurement there are too different to do what you are proposing. Main problem is that while ticks value is relative, bintime is absolute. It is not easy to make conversion between them fast and precise. I've managed to do it, but the only function that does it now is _callout_reset_on(). All other functions are just passing values down. I am not sure I want to duplicate that code in each place, though doing it at least for for callout may be a good idea. I am afraid the above is not convincing. Most/all of the APIs i mentioned still have the conversion from ticks to bintime, and the code in your patch is just building multiple parallel paths (one for each of the three versions of the same function) to some final piece of code where the conversion takes place. The problem is that all of this goes through a set of obfuscating macros and the end result is horrible. To be clear, i believe the work you have been doing on cleaning up callout is great, i am just saying that this is the time to look at the code from a few steps away and clean up all those design decisions that perhaps were made in a haste to make things work. I will give you another example to show how convoluted is the code now: cv_timedwait() and cv_timedwait_sig() now have three versions each (plain, bt, flags). These six are remapped through macros to two functions, _cv_timedwait() and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() maps to _cv_timedwait_sig() ) These two _cv_timedwait*() take both ticks and bintimes, and contain this sequence: + if (bt == NULL) + sleepq_set_timeout_flags(cvp, timo, flags); + else + sleepq_set_timeout_bt(cvp, bt, precision); Guess what, both sleepq_* are macros that remap to the same _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless. But then if you dig into _sleepq_set_timeout() you'll see + if (bt == NULL) + callout_reset_flags_on(td-td_slpcallout, timo, + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); + else + callout_reset_bt_on(td-td_slpcallout, bt, precision, + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); and again both callout_reset*() are again remapped through macros to _callout_reset_on(), so another useless if (bt == NULL) And in the end you have the conversion from ticks to bintime. So basically the code path for cv_timedwait() has those two useless switches and one useless extra argument, and the conversion from ticks to bintime is down deep down in _callout_reset_on() where it can only be resolved at runtime, whereas by doing the conversion at the beginning the decision could have been made at compile time. So I believe my proposal would give large simplifications in the code and lead to a much cleaner implementation of what you have designed: 1. acknowledge the fact that the only representation of time that callouts use internally is a bintime+precision, define one single function (instead of two or three or six) that implements the blessed API, and implement the others with macros or inline functions doing the appropriate conversions; 2. specifically, the *_flags() variant has no reason to exist. It can be implemented through the *_bt() variant, and being a new function the only places where you introduce it require manual modifications so you can directly invoke the new function. Again, please take this as constructive criticism, as i really like the work you have been doing and appreciate the time and effort you are putting on it Your words about useless cascaded ifs touched me. Actually I've looked on _callout_reset_bt_on() yesterday, thinking about moving tick to bt conversion to separate function or wrapper. The only thing we would save in such case is single integer argument (ticks), as all others (bt, prec, flags) are used in the new world order. From the other side, to make the conversion process really effective and correct, I've used quite specific way of obtaining time, that may change in the future. I would not really like it to be inlined in every consumer function and become an ABI. So I see two possible ways: make that conversion a separate non-inline function (that will require two temporary variables to return results and will consume some time on call/return), or make callout_reset_bt_on() to have extra ticks argument, allowing to use it in one or another way without external ifs and macros. In last case all _bt functions in other APIs will also obtain ticks, bt, pr and flags arguments. Actually flags there could be used to specify time scale (monotonic or wall) and time base (relative or absolute), if we
Re: API explosion (Re: [RFC/RFT] calloutng)
On 18.12.2012 20:03, Alexander Motin wrote: On 18.12.2012 19:36, Luigi Rizzo wrote: On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: I would instead do the following: I also don't very like the wide API and want to hear fresh ideas, but approaches to time measurement there are too different to do what you are proposing. Main problem is that while ticks value is relative, bintime is absolute. It is not easy to make conversion between them fast and precise. I've managed to do it, but the only function that does it now is _callout_reset_on(). All other functions are just passing values down. I am not sure I want to duplicate that code in each place, though doing it at least for for callout may be a good idea. I am afraid the above is not convincing. Most/all of the APIs i mentioned still have the conversion from ticks to bintime, and the code in your patch is just building multiple parallel paths (one for each of the three versions of the same function) to some final piece of code where the conversion takes place. The problem is that all of this goes through a set of obfuscating macros and the end result is horrible. To be clear, i believe the work you have been doing on cleaning up callout is great, i am just saying that this is the time to look at the code from a few steps away and clean up all those design decisions that perhaps were made in a haste to make things work. I will give you another example to show how convoluted is the code now: cv_timedwait() and cv_timedwait_sig() now have three versions each (plain, bt, flags). These six are remapped through macros to two functions, _cv_timedwait() and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() maps to _cv_timedwait_sig() ) These two _cv_timedwait*() take both ticks and bintimes, and contain this sequence: +if (bt == NULL) +sleepq_set_timeout_flags(cvp, timo, flags); +else +sleepq_set_timeout_bt(cvp, bt, precision); Guess what, both sleepq_* are macros that remap to the same _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless. But then if you dig into _sleepq_set_timeout() you'll see +if (bt == NULL) +callout_reset_flags_on(td-td_slpcallout, timo, +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); +else +callout_reset_bt_on(td-td_slpcallout, bt, precision, +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); and again both callout_reset*() are again remapped through macros to _callout_reset_on(), so another useless if (bt == NULL) And in the end you have the conversion from ticks to bintime. So basically the code path for cv_timedwait() has those two useless switches and one useless extra argument, and the conversion from ticks to bintime is down deep down in _callout_reset_on() where it can only be resolved at runtime, whereas by doing the conversion at the beginning the decision could have been made at compile time. So I believe my proposal would give large simplifications in the code and lead to a much cleaner implementation of what you have designed: 1. acknowledge the fact that the only representation of time that callouts use internally is a bintime+precision, define one single function (instead of two or three or six) that implements the blessed API, and implement the others with macros or inline functions doing the appropriate conversions; 2. specifically, the *_flags() variant has no reason to exist. It can be implemented through the *_bt() variant, and being a new function the only places where you introduce it require manual modifications so you can directly invoke the new function. Again, please take this as constructive criticism, as i really like the work you have been doing and appreciate the time and effort you are putting on it Your words about useless cascaded ifs touched me. Actually I've looked on _callout_reset_bt_on() yesterday, thinking about moving tick to bt conversion to separate function or wrapper. The only thing we would save in such case is single integer argument (ticks), as all others (bt, prec, flags) are used in the new world order. From the other side, to make the conversion process really effective and correct, I've used quite specific way of obtaining time, that may change in the future. I would not really like it to be inlined in every consumer function and become an ABI. So I see two possible ways: make that conversion a separate non-inline function (that will require two temporary variables to return results and will consume some time on call/return), or make callout_reset_bt_on() to have extra ticks argument, allowing to use it in one or another way without external ifs and macros. In last case all _bt functions in other APIs will also obtain ticks, bt, pr and flags arguments. Actually flags there could be used to specify time scale (monotonic or wall) and time base (relative or absolute), if we decide to implement all
Re: API explosion (Re: [RFC/RFT] calloutng)
[top posting for readability; in summary we were discussing the new callout API trying to avoid an explosion of methods and arguments while at the same time supporting the old API and the new one] (I am also Cc-ing phk as he might have better insight on the topic). I think the patch you propose is a step in the right direction, but i still remain concerned by having to pass two bintimes (by reference, but they should really go by value) and one 'ticks' value to all these functions. I am also dubious that we need a full 128 bits to specify the 'precision': there would be absolutely no loss of functionality if we decided to specify the precision in powers of 2, so a precision 'k' (signed) means 2^k seconds. This way 8 bits are enough to represent any precision we want. The key difference between 'ticks' and bintimes (and the main difficulty in the conversion) is that ticks are relative and bintimes are interpreted as absolute. This could be easily solved by using a flag to specify if the 'bt' argument is absolute or relative, and passing the argument by value. So now the flags could contain C_DIRECT_EXEC, C_BT_IS_RELATIVE, the precision, and another 64 or 128 bit field contains the bintime. How does this look ? cheers luigi On 18.12.2012 20:03, Alexander Motin wrote: On 18.12.2012 19:36, Luigi Rizzo wrote: On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: I would instead do the following: I also don't very like the wide API and want to hear fresh ideas, but approaches to time measurement there are too different to do what you are proposing. Main problem is that while ticks value is relative, bintime is absolute. It is not easy to make conversion between them fast and precise. I've managed to do it, but the only function that does it now is _callout_reset_on(). All other functions are just passing values down. I am not sure I want to duplicate that code in each place, though doing it at least for for callout may be a good idea. I am afraid the above is not convincing. Most/all of the APIs i mentioned still have the conversion from ticks to bintime, and the code in your patch is just building multiple parallel paths (one for each of the three versions of the same function) to some final piece of code where the conversion takes place. The problem is that all of this goes through a set of obfuscating macros and the end result is horrible. To be clear, i believe the work you have been doing on cleaning up callout is great, i am just saying that this is the time to look at the code from a few steps away and clean up all those design decisions that perhaps were made in a haste to make things work. I will give you another example to show how convoluted is the code now: cv_timedwait() and cv_timedwait_sig() now have three versions each (plain, bt, flags). These six are remapped through macros to two functions, _cv_timedwait() and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() maps to _cv_timedwait_sig() ) These two _cv_timedwait*() take both ticks and bintimes, and contain this sequence: +if (bt == NULL) +sleepq_set_timeout_flags(cvp, timo, flags); +else +sleepq_set_timeout_bt(cvp, bt, precision); Guess what, both sleepq_* are macros that remap to the same _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless. But then if you dig into _sleepq_set_timeout() you'll see +if (bt == NULL) +callout_reset_flags_on(td-td_slpcallout, timo, +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); +else +callout_reset_bt_on(td-td_slpcallout, bt, precision, +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); and again both callout_reset*() are again remapped through macros to _callout_reset_on(), so another useless if (bt == NULL) And in the end you have the conversion from ticks to bintime. So basically the code path for cv_timedwait() has those two useless switches and one useless extra argument, and the conversion from ticks to bintime is down deep down in _callout_reset_on() where it can only be resolved at runtime, whereas by doing the conversion at the beginning the decision could have been made at compile time. So I believe my proposal would give large simplifications in the code and lead to a much cleaner implementation of what you have designed: 1. acknowledge the fact that the only representation of time that callouts use internally is a bintime+precision, define one single function (instead of two or three or six) that implements the blessed API, and implement the others with macros or inline functions doing the appropriate conversions; 2. specifically, the *_flags() variant has no reason to exist. It can be implemented through the *_bt() variant, and being a new function the only places where you introduce it require manual modifications so you can
Re: API explosion (Re: [RFC/RFT] calloutng)
On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: [top posting for readability; in summary we were discussing the new callout API trying to avoid an explosion of methods and arguments while at the same time supporting the old API and the new one] (I am also Cc-ing phk as he might have better insight on the topic). I think the patch you propose is a step in the right direction, but i still remain concerned by having to pass two bintimes (by reference, but they should really go by value) and one 'ticks' value to all these functions. I am also dubious that we need a full 128 bits to specify the 'precision': there would be absolutely no loss of functionality if we decided to specify the precision in powers of 2, so a precision 'k' (signed) means 2^k seconds. This way 8 bits are enough to represent any precision we want. The key difference between 'ticks' and bintimes (and the main difficulty in the conversion) is that ticks are relative and bintimes are interpreted as absolute. This could be easily solved by using a flag to specify if the 'bt' argument is absolute or relative, and passing the argument by value. So now the flags could contain C_DIRECT_EXEC, C_BT_IS_RELATIVE, the precision, and another 64 or 128 bit field contains the bintime. How does this look ? cheers luigi I tend to agree that the bintime should be passed by value instead of reference. That would allow an inline tickstobintime() that converts relative ticks to an absolute bintime returned by value and passed right along in one tidy line/clump of code without any temporary variables cluttering things up. While the 1980s C programmer in me still wants to avoid returning complex objects by value, the reality is that modern compilers tend to generate really nice code for such constructs, usually without any copying of the return value at all. I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It would typical to want a 500uS timeout but be willing to late by up to 250uS if that helped scheduling and performance. Also, my idea of precision would virtually always be I'm willing to be late by this much, but never early by any amount. To reinforce the point of that last paragraph... the way I'm looking at these changes has nothing to do with power saving (I've never owned a battery-operated computer, probably never will) and everything to do with performance and being able to sleep accurately for less than a tick. -- Ian On 18.12.2012 20:03, Alexander Motin wrote: On 18.12.2012 19:36, Luigi Rizzo wrote: On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: I would instead do the following: I also don't very like the wide API and want to hear fresh ideas, but approaches to time measurement there are too different to do what you are proposing. Main problem is that while ticks value is relative, bintime is absolute. It is not easy to make conversion between them fast and precise. I've managed to do it, but the only function that does it now is _callout_reset_on(). All other functions are just passing values down. I am not sure I want to duplicate that code in each place, though doing it at least for for callout may be a good idea. I am afraid the above is not convincing. Most/all of the APIs i mentioned still have the conversion from ticks to bintime, and the code in your patch is just building multiple parallel paths (one for each of the three versions of the same function) to some final piece of code where the conversion takes place. The problem is that all of this goes through a set of obfuscating macros and the end result is horrible. To be clear, i believe the work you have been doing on cleaning up callout is great, i am just saying that this is the time to look at the code from a few steps away and clean up all those design decisions that perhaps were made in a haste to make things work. I will give you another example to show how convoluted is the code now: cv_timedwait() and cv_timedwait_sig() now have three versions each (plain, bt, flags). These six are remapped through macros to two functions, _cv_timedwait() and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() maps to _cv_timedwait_sig() ) These two _cv_timedwait*() take both ticks and bintimes, and contain this sequence: +if (bt == NULL) +sleepq_set_timeout_flags(cvp, timo, flags); +else +sleepq_set_timeout_bt(cvp, bt, precision); Guess what, both sleepq_* are macros that remap to the same _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless. But then if you dig into _sleepq_set_timeout() you'll see +if (bt == NULL) +callout_reset_flags_on(td-td_slpcallout, timo, +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); +else +
Re: API explosion (Re: [RFC/RFT] calloutng)
On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote: On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: [top posting for readability; in summary we were discussing the new callout API trying to avoid an explosion of methods and arguments while at the same time supporting the old API and the new one] (I am also Cc-ing phk as he might have better insight on the topic). I think the patch you propose is a step in the right direction, but i still remain concerned by having to pass two bintimes (by reference, but they should really go by value) and one 'ticks' value to all these functions. I am also dubious that we need a full 128 bits to specify the 'precision': there would be absolutely no loss of functionality if we decided to specify the precision in powers of 2, so a precision 'k' (signed) means 2^k seconds. This way 8 bits are enough to represent any precision we want. ... I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It would typical to want a 500uS timeout but be willing to late by up to 250uS if i said k is signed so negative values represent fractions of a second. 2^-128 is pretty short :) cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote: On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote: On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: [top posting for readability; in summary we were discussing the new callout API trying to avoid an explosion of methods and arguments while at the same time supporting the old API and the new one] (I am also Cc-ing phk as he might have better insight on the topic). I think the patch you propose is a step in the right direction, but i still remain concerned by having to pass two bintimes (by reference, but they should really go by value) and one 'ticks' value to all these functions. I am also dubious that we need a full 128 bits to specify the 'precision': there would be absolutely no loss of functionality if we decided to specify the precision in powers of 2, so a precision 'k' (signed) means 2^k seconds. This way 8 bits are enough to represent any precision we want. ... I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It would typical to want a 500uS timeout but be willing to late by up to 250uS if i said k is signed so negative values represent fractions of a second. 2^-128 is pretty short :) cheers luigi Ahh, I missed that. Good enough then! Hmmm, if that ideas survives further review, then could precision be encoded in 8 bits of the flags, eliminating another parm? -- Ian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
On Tue, Dec 18, 2012 at 04:37:10PM -0700, Ian Lepore wrote: On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote: On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote: On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: [top posting for readability; in summary we were discussing the new callout API trying to avoid an explosion of methods and arguments while at the same time supporting the old API and the new one] (I am also Cc-ing phk as he might have better insight on the topic). I think the patch you propose is a step in the right direction, but i still remain concerned by having to pass two bintimes (by reference, but they should really go by value) and one 'ticks' value to all these functions. I am also dubious that we need a full 128 bits to specify the 'precision': there would be absolutely no loss of functionality if we decided to specify the precision in powers of 2, so a precision 'k' (signed) means 2^k seconds. This way 8 bits are enough to represent any precision we want. ... I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It would typical to want a 500uS timeout but be willing to late by up to 250uS if i said k is signed so negative values represent fractions of a second. 2^-128 is pretty short :) cheers luigi Ahh, I missed that. Good enough then! Hmmm, if that ideas survives further review, then could precision be encoded in 8 bits of the flags, eliminating another parm? that was also what i wrote later in the message :) now we should figure out some use for the remaining 22 bits of the flags cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 17.12.2012 03:29, Adrian Chadd wrote: On 16 December 2012 15:37, Alexander Motin m...@freebsd.org wrote: Here is one more version. Unless something new will be found/reported this may be the last one, because me and Davide are quite satisfied with the results. If everything will be fine, I think we could commit it to HEAD closer to the end of the week: http://people.freebsd.org/~mav/calloutng_12_16.patch Don't you think one week is a little on the low side for reviewing something this critical? It was in public development for the last half a year. IIRC, previous announce by Davide few months ago caused no any feedback. If you say you will review it in two weeks, I will wait for two weeks. But I don't want to wait without a purpose. Would you mind approaching some of the cluster peeps and seeing if they'll run this up on the ref10* boxes and VMs, just to get some further exposure? Are the ref* system have any load to see anything? -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 17.12.2012 05:38, Adrian Chadd wrote: On 16 December 2012 18:31, Garrett Cooper yaneg...@gmail.com wrote: Would you mind approaching some of the cluster peeps and seeing if they'll run this up on the ref10* boxes and VMs, just to get some further exposure? And maybe tinderbox..? Tinderbox is a great idea. Maybe hit up the altq/pf using crowd and see if they'll test this stuff out too? It would be good to test, though I know that at least dummynet is written awful from the point of this project with its callout_reset(dn_timeout, 1, dummynet, NULL); It should work, but kill most of power benefits. I was promised it will be fixed after this project end. What else gets heavily callout /timer driven? Try some computational workloads that stress the fairness of ULE/4BSD, maybe? Schedulers are driven directly by hardclock()/statclock(), so fairness is not affected here. If CPU is not idle, it will receive full set of required events with maximum available precision. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 16 December 2012 23:57, Andriy Gapon a...@freebsd.org wrote: Thank god that this feature was developed in a branch, it was developed for a long period of time and there were people who routinely reviewed and tested (and really used) it. And yeah, its design was announced and discussed well in advance too. I can see that; I was even there at David's presentation at BSDCan 2012 about it. Now that it's finished though, it would be nice to get some more stress testing before committing it. Just because it's been developed in a branch doesn't at all imply that it's had wide testing. It's now imminently going into the tree, so that may scare (heh!) a few people into testing it. I'm sure it'll mostly just work, that it'll not really break anything. It just to me feels that a week of warning before committing something like this is a little soon. It's _just_ been finished and the authors have been doing some last minute changes as they get better ideas on implementing things. I think it's great work, I'd just like to see some wider testing. So to put my money where my mouth is, I bought a new hard disk for my T60 last week. I'll install -HEAD on it tomorrow and give it a whirl. Thanks, Adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
calloutng and dummynet (Re: [RFC/RFT] calloutng)
On Mon, Dec 17, 2012 at 10:14:29AM +0200, Alexander Motin wrote: On 17.12.2012 05:38, Adrian Chadd wrote: ... Maybe hit up the altq/pf using crowd and see if they'll test this stuff out too? It would be good to test, though I know that at least dummynet is written awful from the point of this project with its callout_reset(dn_timeout, 1, dummynet, NULL); It should work, but kill most of power benefits. I was promised it will be fixed after this project end. never trust italians :) but it is good that you are reminding me this, hopefully i will be able to give it a shot after the holidays cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng
[addressing the various items separately] On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote: On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote: ... - for several functions the only change is the name of an argument from busy to us. Can you elaborate the reason for the change, and whether us means microseconds or the pronoun ?) Please see r242905 by mav@. i see the goal of this patch is to pass along the amount of time till the next timer. I wonder why the choice is to use (actually, call) the value microseconds rather use a bintime or something scaled and with a well defined resolution. In fact looking at the relevant diff http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905r2=242904pathrev=242905 cpu_idleclock() actually returns a value that is not even microseconds but 1/(2^20) seconds. The value seems to be ignored right now so it would be a good time to discuss the resolution. I am concerned that at some point (5 years from now perhaps ?) microseconds might start to become too coarse and we would like something with a more fine-grained resolution. On the other hand, for the purposes of this change, we can probably live with an upper limit of some seconds (waking up the machine once per second is not going to kill performance). So i would suggest to make the argument to these functions uint_32 or uint_64 (preferably the same for 32- and 64-bit machines), rename it to something different from 'us' and have at least 28-30 fractional bits to represent a bintime. Right now you are using a bintime with 20 fractional and 11 or 43 bits for the integer part. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng
Hi. I wonder why the choice is to use (actually, call) the value microseconds rather use a bintime or something scaled and with a well defined resolution. It was kind of engineering choice. I've chosen microseconds, following values used by ACPI to represent CPU sleep states exit latencies. Now that is the only usage for that value. If CPUs so much reduce wakeup latencies to make this scale too coarse, this type will be the smallest of our optimization tasks. On the other side, I have some doubts that we will be able to reach supported 2048 seconds limit on the integer side. Now even completely empty idle system has about 30 interrupts per second, that is far from 0.0005. From the other side, I don't know any system where CPUs have 2048 seconds wakeup latency. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
API explosion (Re: [RFC/RFT] calloutng)
[again, response to another issue i raised] On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote: On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote: ... Finally, a more substantial comment: - a lot of functions which formerly had only a timo argument now have timo, bt, precision, flags. Take seltdwait() as an example. seltdwait() is not part of the public KPI. It has been modified to avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e. two separate functions, even though we could share most of the code is not a clever approach, IMHO. As I told before, seltdwait() is not exposed so we can modify its argument without breaking anything. It seems that you have been undecided between two approaches: for some of these functions you have preserved the original function that deals with ticks and introduced a new one that deals with the bintime, whereas in other cases you have modified the original function to add bt, precision, flags. I'm not. All the functions which are part of the public KPI (e.g. condvar(9), sleepq(9), *) are still available. *_flags variants have been introduced so that consumers can take advantage of the new 'precision tolerance mechanism' implemented. Also, *_bt variants have been introduced. I don't see any undecision between the two approaches. Please note that now the callout backend deals with bintime, so every time callout_reset_on() is called, the 'tick' argument passed is silently converted to bintime. I will try to give more specific example, using the latest patch from mav http://people.freebsd.org/~mav/calloutng_12_16.patch In the manpage, for instance, the existing functions now are extended with two more variants (sometimes; msleep_spin() for instance is missing msleep_spin_bt() but perhaps that is just an oversight). .Nm sleepq_set_timeout , +.Nm sleepq_set_timeout_flags , +.Nm sleepq_set_timeout_bt , .Nm msleep , +.Nm msleep_flags , +.Nm msleep_bt , .Nm msleep_spin , +.Nm msleep_spin_flags , .Nm pause , +.Nm pause_flags , +.Nm pause_bt , .Nm tsleep , +.Nm tsleep_flags , +.Nm tsleep_bt , .Nm cv_timedwait , +.Nm cv_timedwait_bt , +.Nm cv_timedwait_flags , .Nm cv_timedwait_sig , +.Nm cv_timedwait_sig_bt , +.Nm cv_timedwait_sig_flags , .Nm callout_reset , +.Nm callout_reset_flags , .Nm callout_reset_on , +.Nm callout_reset_flags_on , +.Nm callout_reset_bt_on , If you look at the backends, they take both a timo and a bintime. -int_cv_timedwait(struct cv *cvp, struct lock_object *lock, int timo); -int_cv_timedwait_sig(struct cv *cvp, struct lock_object *lock, int timo); +int_cv_timedwait(struct cv *cvp, struct lock_object *lock, + struct bintime *bt, struct bintime *precision, int timo, + int flags); +int_cv_timedwait_sig(struct cv *cvp, struct lock_object *lock, + struct bintime *bt, struct bintime *precision, int timo, + int flags); and then internally they call the 'timo' or the 'bt' version depending on the case + if (bt == NULL) + sleepq_set_timeout_flags(cvp, timo, flags); + else + sleepq_set_timeout_bt(cvp, bt, precision); So basically you are doing the following: + create two new variant for each existing function foo(, ... timo, ... ) old method foo_flags(, ... timo, ... ) new method foo_bt(... , bt, precision, ...)new method (the 'API explosion' i am mentioning in the Subject) + the variants are mapped to the same internal function _foo_internal(..., timo, bt, precision, flags, ...) + and then the internal function has a (runtime) conditional if (bt == NULL) { // convert timo to bt } else { // have a bt + precision } ... I would instead do the following: + create a new API function that takes only bintime+precision+flags, no ticks. I am not sure if there is even a need to have an internal name _cv_timedwait_sig( ) or you can just have cv_timedwait_sig_bt(...) + use a macro or an inline function to remap the old API to the (single) new one, making the argument conversion immediately. #define cv_timedwait_sig(...) cv_timedwait_sig_bt( ...) This has the advantage that conversions might be done at compile time perhaps with some advantage in terms of code and performance. + do not bother creating yet another cv_timedwait_sig_flags() function. Since it did not exist before, you have to do the conversion manually anyways, at which point you just change the argument to use bintime instead of ticks.
Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng
On Mon, Dec 17, 2012 at 11:27 AM, Luigi Rizzo ri...@iet.unipi.it wrote: [addressing the various items separately] On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote: On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote: ... - for several functions the only change is the name of an argument from busy to us. Can you elaborate the reason for the change, and whether us means microseconds or the pronoun ?) Please see r242905 by mav@. i see the goal of this patch is to pass along the amount of time till the next timer. I wonder why the choice is to use (actually, call) the value microseconds rather use a bintime or something scaled and with a well defined resolution. In fact looking at the relevant diff http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905r2=242904pathrev=242905 cpu_idleclock() actually returns a value that is not even microseconds but 1/(2^20) seconds. The value seems to be ignored right now so it would be a good time to discuss the resolution. I am concerned that at some point (5 years from now perhaps ?) microseconds might start to become too coarse and we would like something with a more fine-grained resolution. On the other hand, for the purposes of this change, we can probably live with an upper limit of some seconds (waking up the machine once per second is not going to kill performance). I would talk more about power consumption problem rather than performances. Yes, you're right because now, even with calloutng changes, the CPU is woken up at least twice per second. The wheel scan, in case it doesn't find a new callout to schedule in the next half-second, schedules an interrupt half a second from now (where now is the time obtained using getbinuptime()/binuptime()). This is a threshold we set up empirically, and it resulted is good for now. But in the future someone may raise the threshold to 1 second, 10 seconds, or something like. So, I don't agree with your statement. So i would suggest to make the argument to these functions uint_32 or uint_64 (preferably the same for 32- and 64-bit machines), rename it to something different from 'us' and have at least 28-30 fractional bits to represent a bintime. Right now you are using a bintime with 20 fractional and 11 or 43 bits for the integer part. cheers luigi Thanks Davide ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: API explosion (Re: [RFC/RFT] calloutng)
Hi. I would instead do the following: I also don't very like the wide API and want to hear fresh ideas, but approaches to time measurement there are too different to do what you are proposing. Main problem is that while ticks value is relative, bintime is absolute. It is not easy to make conversion between them fast and precise. I've managed to do it, but the only function that does it now is _callout_reset_on(). All other functions are just passing values down. I am not sure I want to duplicate that code in each place, though doing it at least for for callout may be a good idea. Creating sets of three functions I had three different goals: - callout_reset() -- it is legacy variant required to keep API compatibility; - callout_reset_flags() -- it is for cases where custom precision specification needs to be added to the existing code, or where direct callout execution is needed. Conversion to bintime would additionally complicate consumer code, that I would try to avoid. - callout_reset_bt() -- API for the new code, which needs high precision and doesn't mind to operate bintime. Now there is only three such places in kernel now, and I don't think there will be much more. Respectively, these three options are replicated to other APIs where time intervals are used. PS: Please keep me in CC. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng
On Mon, Dec 17, 2012 at 12:17:54PM -0800, Davide Italiano wrote: On Mon, Dec 17, 2012 at 11:27 AM, Luigi Rizzo ri...@iet.unipi.it wrote: [addressing the various items separately] On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote: On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote: ... - for several functions the only change is the name of an argument from busy to us. Can you elaborate the reason for the change, and whether us means microseconds or the pronoun ?) Please see r242905 by mav@. i see the goal of this patch is to pass along the amount of time till the next timer. I wonder why the choice is to use (actually, call) the value microseconds rather use a bintime or something scaled and with a well defined resolution. In fact looking at the relevant diff http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905r2=242904pathrev=242905 cpu_idleclock() actually returns a value that is not even microseconds but 1/(2^20) seconds. The value seems to be ignored right now so it would be a good time to discuss the resolution. I am concerned that at some point (5 years from now perhaps ?) microseconds might start to become too coarse and we would like something with a more fine-grained resolution. On the other hand, for the purposes of this change, we can probably live with an upper limit of some seconds (waking up the machine once per second is not going to kill performance). I would talk more about power consumption problem rather than performances. Yes, you're right because now, even with calloutng changes, the CPU is woken up at least twice per second. The wheel scan, in case it doesn't find a new callout to schedule in the next half-second, schedules an interrupt half a second from now (where now is the time obtained using getbinuptime()/binuptime()). This is a threshold we set up empirically, and it resulted is good for now. But in the future someone may raise the threshold to 1 second, 10 seconds, or something like. So, I don't agree with your statement. this is only an issue if we want to use 32 bits. If we go to 64 bits, there is enogh space for picoseconds on the fractional part, and a few years in the integer part. but keep in mind, even powerwise, i doubt the exit from deep sleep and a callout takes more than 500us so even doing that once per second gives less than 0.5 per mille increase in power compared to a machine that is always idle This is really noise that we should not worry about. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng
Personally, I'd rather see some consistently used units here.. Adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng
In message 50cf79ad.9040...@freebsd.org, Alexander Motin writes: Hi. I wonder why the choice is to use (actually, call) the value microseconds rather use a bintime or something scaled and with a well defined resolution. It was kind of engineering choice. I've chosen microseconds [...] And that was the wrong choice, the format should be a binary number so arithmetic and comparisons does not become a nightmare. If people need milliseconds or microseconds, they can get that using suitable #defined multiplication factors. A 64 bit type, with 32bit before and after the binary point would be sufficient for now, and easily extensible to something larger should one or more laws of computing nature be changed. So do the following: typedef dur_t int64_t;/* signed for bug catching */ #define DURSEC ((dur_t)1 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 1000) #define DURNSEC (DURSEC / 1000) And stop crapping around with mixed-radix numbers, even the british changed to decimal coinage to avoid that crap... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Sat, Dec 15, 2012 at 11:13:26PM +0200, Alexander Motin wrote: On 15.12.2012 23:03, Alexander Motin wrote: Sorry, it's my fault. I've tried to save some time on patch generation and forgot about that change in lib/. We haven't touched user-level in our work except that file. Here is patch with that chunk added: http://people.freebsd.org/~mav/calloutng_12_15_1.patch And one more part I've missed is manual pages update, that probably needs more improvements: http://people.freebsd.org/~mav/calloutng_12_15.man.patch Perhaps use the SCM at what its good at? Sync your branch with HEAD and then do an 'svn diff ^/head' and your branch. -- -- David (obr...@freebsd.org) ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Mon, Dec 17, 2012 at 2:39 PM, David O'Brien obr...@freebsd.org wrote: On Sat, Dec 15, 2012 at 11:13:26PM +0200, Alexander Motin wrote: On 15.12.2012 23:03, Alexander Motin wrote: Sorry, it's my fault. I've tried to save some time on patch generation and forgot about that change in lib/. We haven't touched user-level in our work except that file. Here is patch with that chunk added: http://people.freebsd.org/~mav/calloutng_12_15_1.patch And one more part I've missed is manual pages update, that probably needs more improvements: http://people.freebsd.org/~mav/calloutng_12_15.man.patch Perhaps use the SCM at what its good at? Sync your branch with HEAD and then do an 'svn diff ^/head' and your branch. -- -- David (obr...@freebsd.org) Last time I tried doing that the way you describe, I got an output with tons svn:mergeinfo and I didn't find a way to suppress them if not manually. e.g. Property changes on: usr.bin/procstat ___ Modified: svn:mergeinfo Merged /head/usr.bin/procstat:r236314-239017 Index: usr.bin/calendar === --- usr.bin/calendar(.../head) (revision 239166) +++ usr.bin/calendar(.../projects/calloutng)(revision 239166) Can you help me in understanding what I did wrong? Thanks ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Sat, 15 Dec 2012, Garrett Cooper wrote: On Dec 15, 2012, at 12:34 PM, Mark Johnston wrote: On Sat, Dec 15, 2012 at 06:55:53PM +0200, Alexander Motin wrote: Hi. I'm sorry to interrupt review, but as usual good ideas came during the final testing, causing another round. :) Here is updated patch for HEAD, that includes several new changes: http://people.freebsd.org/~mav/calloutng_12_15.patch This patch breaks the libprocstat build. Specifically, the OpenSolaris sys/time.h defines the preprocessor symbols gethrestime and gethrestime_sec. These symbols are also defined in cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h. libprocstat:zfs.c is compiled using include paths that pick up the OpenSolaris time.h, and with this patch _callout.h includes sys/time.h. zfs.c includes taskqueue.h (with _KERNEL defined), which includes _callout.h, so both time.h and zfs_context.h are included in zfs.c, and the symbols are thus defined twice. Gross namespace pollution. sys/_callout.h exists so that the full namespace pollution of sys/callout.h doesn't get included nested. But sys/time.h is much more polluted than sys/callout.h. However, sys/time.h is old standard pollution in sys/param.h, and sys/callout.h is not so old standard pollution in sys/systm.h. It is a bug to not include sys/param.h and sys/systm.h in most kernel source code, so these nested includes are just style bugs -- they have no effect for correct kernel source code. The patch below fixes the build for me. Another approach might be to include sys/_task.h instead of taskqueue.h at the beginning of zfs.c. Good if it works. I had a patch open once upon a time to cleanup inclusion of sys/time.h all over the tree and deal with the sys/time.h - time.h pollution issue, but it got dropped due to lack of interest (20~30 apps/libs were affected IIRC and I only really got assistance in fixing the UFS and bsnmpd pieces, and gave up due to lack of response from maintainers). dtrace/zfs is a definite instigator in this pollution (I remember nasty cddl/... pollution with the compat sys/time.h header). Please use the unix newline character in mail. The above is difficult to quote. The standard sys/time.h pollution in sys/param.h is only in the kernel, and there aren't many direct includes of sys/time.h in the kernel. Userland is different and many of the direct includes were correct. But not POSIX specifies that struct timespec and struct timeval be defined in most places where they are needed, so the includes of sys/time.h are not necessary for POSIX or FreeBSD, although FreeBSD man pages still say that they are necessary. The sys/time.h - time.h pollution issue is also only for userland. Many places depend on one including the other, and include the wrong one themself. Bottom line: make sure anything new you're defining isn't already defined via POSIX or other OSes, and if so please try to make the implementations match (so that eventual POSIX inclusion might be possible) and when in doubt I suggest consulting standards@ / brde@. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
Hi. Here is one more version. Unless something new will be found/reported this may be the last one, because me and Davide are quite satisfied with the results. If everything will be fine, I think we could commit it to HEAD closer to the end of the week: http://people.freebsd.org/~mav/calloutng_12_16.patch Changes in this version: -- Removed couple of redundant variables in callout implementation, that reduced sizeof(struct callout) by two pointers and simplified some internal code. -- syscons driver was made to schedule only 1-2 callouts per second instead of 20-30 before when console is in graphical mode and there are few other things to do. Now my laptop has only about 30 interrupts per second total during idle periods with X running. -- i8254 eventtimer driver was optimized to work faster in disabled by default one-shot mode. -- Few kernel functions were added to make KPIs more complete. -- Man pages were updated. -- Some style fixes were made. On 15.12.2012 18:55, Alexander Motin wrote: I'm sorry to interrupt review, but as usual good ideas came during the final testing, causing another round. :) Here is updated patch for HEAD, that includes several new changes: http://people.freebsd.org/~mav/calloutng_12_15.patch The new changes are: -- Precision and event aggregation code was reworked. Instead of previous -prec/+prec representation, precision is now single-sided -- -0/+prec. It allowed to significantly improve precision on long time intervals for APIs which imply that event should not happen before the specified time. Depending on CPU activity, mistake for long time intervals now will never be more then 1-500ms, even if specified precision allows more. -- Some minor optimizations were made to reduce callout overhead and latency by 1.5-2us. Now on Core2Duo amd64 system with LAPIC eventtimer and TSC timecounter usleep(1) call from user-level executes in just 5-6us, instead of 7-8us before. Now it can do 180K cycles per second on single CPU with only partial CPU load. -- Number of kernel subsystems (dcons, syscons, yarrow, led, atkbd, setrlimit) were modified to reduce number of interrupts, also with event aggregation by explicit specification of the acceptable events precision. Now my Core2Duo test system has only 30 interrupts per second in idle. If not remaining syscons events, it could easily be 15. My IvyBridge ultrabook first time in its history shown 5.5 hours of battery time with full screen brightness and 10 hours with lid closed. -- Some kernel functions were added to make KPIs more complete. I've successfully tested this patch on amd64 and arm. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 16 December 2012 15:37, Alexander Motin m...@freebsd.org wrote: Hi. Here is one more version. Unless something new will be found/reported this may be the last one, because me and Davide are quite satisfied with the results. If everything will be fine, I think we could commit it to HEAD closer to the end of the week: http://people.freebsd.org/~mav/calloutng_12_16.patch Hi, Don't you think one week is a little on the low side for reviewing something this critical? Would you mind approaching some of the cluster peeps and seeing if they'll run this up on the ref10* boxes and VMs, just to get some further exposure? Thanks, Adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On Dec 16, 2012, at 5:29 PM, Adrian Chadd adr...@freebsd.org wrote: On 16 December 2012 15:37, Alexander Motin m...@freebsd.org wrote: Hi. Here is one more version. Unless something new will be found/reported this may be the last one, because me and Davide are quite satisfied with the results. If everything will be fine, I think we could commit it to HEAD closer to the end of the week: http://people.freebsd.org/~mav/calloutng_12_16.patch Hi, Don't you think one week is a little on the low side for reviewing something this critical? Would you mind approaching some of the cluster peeps and seeing if they'll run this up on the ref10* boxes and VMs, just to get some further exposure? And maybe tinderbox..? Thanks, -Garrett ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFC/RFT] calloutng
On 16 December 2012 18:31, Garrett Cooper yaneg...@gmail.com wrote: Would you mind approaching some of the cluster peeps and seeing if they'll run this up on the ref10* boxes and VMs, just to get some further exposure? And maybe tinderbox..? Tinderbox is a great idea. Maybe hit up the altq/pf using crowd and see if they'll test this stuff out too? What else gets heavily callout /timer driven? Try some computational workloads that stress the fairness of ULE/4BSD, maybe? Adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org