Re: [RFC/RFT] calloutng

2013-02-21 Thread Davide Italiano
The patch has been splitted in smaller logical chunks in order to
improve readability and facilitate review.
The whole code can be found here:
http://people.freebsd.org/~davide/calloutng_split/

In particular:
http://people.freebsd.org/~davide/calloutng_split/sbintime.diff brings
the new type 32.32 fixed point used in lieu of 'int ticks' for the new
callout mechanism and relative functions (conversion from to
{timeval,bintime,timespec}, sbinuptime(), getsbinuptime())
This change is preliminary to
http://people.freebsd.org/~davide/calloutng_split/callout_internals.diff
which contains all the callout changes internally (introduction of
direct dispatching, conversion to sbintime, new precision mechanism)

http://people.freebsd.org/~davide/calloutng_split/cpuidle.diff  (see
r242905) introduce some optimization in the  sleep time calculation
and choose of the optimal sleep state, when CPU is idle.

http://people.freebsd.org/~mav/calloutng-20130221/libprocstatfix.diff
is a workaround in libprocstat in order to allow world to build after
changes. About this one, I'm not quite sure is the best solution, so
if there are other opinions, I'd be more than happy to hear.

Other patches available in the directory converts some kernel
consumers of callout(9) to the new interface (nanosleep, poll, select,
kqueue, syscons, etc...).

My plan is to commit the whole patchset March 4th, unless there are
valid reasons to not do so.
Any cooment is (as always) appreciated.

Thanks,

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-02-16 Thread Luigi Rizzo
On Thu, Feb 14, 2013 at 11:56:54PM -0800, Alfred Perlstein wrote:
 [ added Luigi Rizzo to thread ]
 
 
 On 2/11/13 12:26 PM, Davide Italiano wrote:
  [trimmed old mails]
 
  Here's a new version of the patch:
  http://people.freebsd.org/~davide/patches/calloutng-11022012.diff
 
  Significant bits changed (after wider discussion and suggestion by phk@):
  - Introduction of the new sbintime_t type (32.32 fixed point) with the
  respective conversion (sbintime2bintime, sbintime2timeval etc...)
  - switch from 64.64 (struct bintime) format to measure time to sbintime_t
  - Use sbintime_t to represent expected sleep time instead of measuring
  it in microseconds (cpu_idle_acpi(), cpu_idle_spin() ...).
 
 
 Luigi and I discussed this at BAFUG tonight and he expressed an interest 
 in shepherding the code in at this point.
 
 Luigi, can you reiterate any points that still remain before we 
 integrate this code?

I am mostly ok with the patch in the current state.
I have few comments/suggestions below but they are mostly
on documenting/style/trivial bugfixes.

So now I would really like to see this code go in HEAD,
to give people a chance to see how it works and possibly
figure out whether the new API needs modifications.

To recall, my major concerns were the following:

- API explosion, with multiple versions of the callout routines.

  This seems to be solved now, with only one version (the *_sbt()
  functions) and macros remapping the old functions to the new ones.

- the use of struct bintime for timeouts and precision.

  This is also solved thanks to the introduction of sbintime_t values
  (fixed-point 32.32 times)

- Some measurements also indicated that the default precision
  could give large (absolute) errors on the timeouts, especially
  with large timeouts.

  I am not sure what is the situation with this new version, but i believe
  this a relatively minor issue because it surely simple to customize, e.g.
  having a couple of sysctl setting the default precision (percentage)
  and maximum error (absolute time). So users could e.g. set
  a 5% precision and a maximum error of 100us on a desktop,
  and 10% and 10ms on a laptop.

- Another thing that we should do (but after the patch is in) is to
  see if any existing code is converting times to ticks just to
  call the timeout routines -- right now the macros convert back
  to times, clearly it would be best to avoid the double conversion.

comments inline below, search for XXX-lr

thanks again for your work on this, and for following up with changes
after the previous discussion

cheers
luigi
  
Index: sys/dev/acpica/acpi_cpu.c
===
--- sys/dev/acpica/acpi_cpu.c   (.../head)  (revision 246685)
+++ sys/dev/acpica/acpi_cpu.c   (.../projects/calloutng)(revision 
246685)
@@ -980,13 +980,16 @@
 }
 
 /* Find the lowest state that has small enough latency. */
+us = sc-cpu_prev_sleep;
+if (sbt = 0  us  sbt / SBT_1US)XXX-lr can we have us2sbt() , 
ms2sbt() and the like ?
+   us = sbt / SBT_1US;
 cx_next_idx = 0;
 if (cpu_disable_deep_sleep)
i = min(sc-cpu_cx_lowest, sc-cpu_non_c3);
 else
i = sc-cpu_cx_lowest;
 for (; i = 0; i--) {
-   if (sc-cpu_cx_states[i].trans_lat * 3 = sc-cpu_prev_sleep) {
+   if (sc-cpu_cx_states[i].trans_lat * 3 = us) {
cx_next_idx = i;
break;
}
Index: sys/kern/kern_synch.c
===
--- sys/kern/kern_synch.c   (.../head)  (revision 246685)
+++ sys/kern/kern_synch.c   (.../projects/calloutng)(revision 
246685)
@@ -146,12 +146,12 @@
  */
 int
 _sleep(void *ident, struct lock_object *lock, int priority,
-const char *wmesg, int timo)
+const char *wmesg, sbintime_t sbt, sbintime_t pr, int flags)
 {
struct thread *td;
struct proc *p;
struct lock_class *class;
-   int catch, flags, lock_state, pri, rval;
+   int catch, sleepq_flags, lock_state, pri, rval; XXX-lr keep 
flags, use _sleep_flags as function argument ?
WITNESS_SAVE_DECL(lock_witness);
 
td = curthread;
@@ -348,28 +349,30 @@
  * to a timo value of one.
  */
 int
-pause(const char *wmesg, int timo)
+pause_sbt(const char *wmesg, sbintime_t sbt, sbintime_t pr, int flags)
 {
-   KASSERT(timo = 0, (pause: timo must be = 0));
+   int sbt_sec;XXX-lr localize 
to the cold block
 
+   sbt_sec = sbintime_getsec(sbt); 
+   KASSERT(sbt_sec = 0, (pause: timo must be = 0));
+
/* silently convert invalid timeouts */
-   if (timo  1)
-   timo = 1;
+   if (sbt == 0)
+   sbt = tick_sbt;
 
if (cold) {
/*
-* We delay one HZ at a time to avoid overflowing the
+* We delay one second at a time to avoid overflowing the
 * 

Re: [RFC/RFT] calloutng

2013-02-14 Thread Alfred Perlstein

[ added Luigi Rizzo to thread ]


On 2/11/13 12:26 PM, Davide Italiano wrote:

[trimmed old mails]

Here's a new version of the patch:
http://people.freebsd.org/~davide/patches/calloutng-11022012.diff

Significant bits changed (after wider discussion and suggestion by phk@):
- Introduction of the new sbintime_t type (32.32 fixed point) with the
respective conversion (sbintime2bintime, sbintime2timeval etc...)
- switch from 64.64 (struct bintime) format to measure time to sbintime_t
- Use sbintime_t to represent expected sleep time instead of measuring
it in microseconds (cpu_idle_acpi(), cpu_idle_spin() ...).



Luigi and I discussed this at BAFUG tonight and he expressed an interest 
in shepherding the code in at this point.


Luigi, can you reiterate any points that still remain before we 
integrate this code?


thank you,
-Alfred

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-02-11 Thread Davide Italiano
[trimmed old mails]

Here's a new version of the patch:
http://people.freebsd.org/~davide/patches/calloutng-11022012.diff

Significant bits changed (after wider discussion and suggestion by phk@):
- Introduction of the new sbintime_t type (32.32 fixed point) with the
respective conversion (sbintime2bintime, sbintime2timeval etc...)
- switch from 64.64 (struct bintime) format to measure time to sbintime_t
- Use sbintime_t to represent expected sleep time instead of measuring
it in microseconds (cpu_idle_acpi(), cpu_idle_spin() ...).

Thanks,

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-21 Thread Marius Strobl
On Sun, Jan 13, 2013 at 09:36:11PM +0200, Alexander Motin wrote:
 On 13.01.2013 20:09, Marius Strobl wrote:
  On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:
  On 06.01.2013 17:23, Marius Strobl wrote:
  I'm not really sure what to do about that. Earlier you already said
  that sched_bind(9) also isn't an option in case if td_critnest  1.
  To be honest, I don't really unerstand why using a spin lock in the
  timecounter path makes sparc64 the only problematic architecture
  for your changes. The x86 i8254_get_timecount() also uses a spin lock
  so it should be in the same boat.
 
  The problem is not in using spinlock, but in waiting for other CPU while
  spinlock is held. Other CPU may also hold spinlock and wait for
  something, causing deadlock. i8254 code uses spinlock just to atomically
  access hardware registers, so it causes no problems.
  
  Okay, but wouldn't that be a general problem then? Pretty much
  anything triggering an IPI holds smp_ipi_mtx while doing so and
  the lower level IPI stuff waits for other CPU(s), including on
  x86.
 
 The problem is general. But now it works because single smp_ipi_mtx is
 used in all cases where IPI result is waited. As soon as spinning
 happens with interrupts still enabled, there is no deadlocks. But
 problem reappears if any different lock is used, or locks are nested.

I'm having a hard time getting an alternate time counter device to
work. The crystal required for the counters in the south bridge just
doesn't seem to be mounted any where near it (I've not looked at the
bottom of the PCB though). While the time counter part of the on-
board bge(4) driven chips basically work, they don't seem to like
concurrent accesses caused by the rest of bge(4). I.e. although the
counter is just read, sooner or later this causes a fatal bus error.
I haven't tried serializing accesses to the chip, but getting to such
a complexity for just reading a non-indexed register at least doesn't
feel good ...
However, AFAICT the scenario you describe can't happen. On sparc64,
spinlock_enter() only raises the processor interrupt level, which
doesn't block the direct cross traps I've implemented remote reading
of (S)TICK as (which also means that the actions such traps may
perform are very limitted and must occur in interrupt context, but
which are sufficient for this purpose and in turn makes them very
fast). I.e. although the AP holds smp_ipi_mtx or any amount of
nested spin locks, this will not deadlock in case the BSP also holds
any spin lock when reading (S)TICK from it.

Marius

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-18 Thread Bruce Evans

On Thu, 17 Jan 2013, Ian Lepore wrote:


On Mon, 2013-01-14 at 11:38 +1100, Bruce Evans wrote:



Er, timecounters are called with a spin mutex held in existing code:
though it is dangerous to do so, timecounters are called from fast
interrupt handlers for very timekeeping-critical purposes:
- to implement the TIOCTIMESTAMP ioctl (except this is broken in
   -current).  This was a primitive version of pps timestamping.
- for pps timestamping.  The interrupt handler (which should be a fast
   interrupt handler to minimize latency) calls pps_capture() which
   calls tc_get_timecount() and does other lock-free accesses to the
   timecounter state.  This still works in -current (at least there is
   still code for it).


Unfortunately, calling pps_capture() in the primary interrupt context is
no longer an option with the stock pps driver.  Ever since the ppbus
rewrite all ppbus children must use threaded handlers.  I tried to fix
that a couple different ways, and both ended up with crazy-complex code


Hmm, I didn't notice that ppc supported pps (I try not to look at it since
it is ugly :-), and don't know of any version of it that uses non-threaded
handlers (except in FreeBSD-4 before, where normal interrupt handlers
were non-threaded, so ppc had their high latency but not the even higher
latency and overheads of threaded handlers).

OTOH, my x86 RTC interrupt handler is threaded and supports pps, and
I haven't noticed any latency problems with this.  It just can't
possibly give the  ~1 usec jitter that FreeBSD-[3-4] could give ~15
years ago using a fast interrupt handler (there must be only 1 device
using a fast interrupt handler, with this dedicated to pps, else the
multiple fast interrupt handlers will give latency much larger than
~1 usec to each other.  I don't actually use this for anything except
testing whether the RTC can be used for a poor man's pps.


scattered around the ppbus family just to support the rarely-used pps
capture.  It would have been easier to do if filter and threaded
interrupt handlers had the same function signature.

I ended up writting a separate driver that can be used instead of ppc +
ppbus + pps, since anyone who cares about precise pps capture is
unlikely to be sharing the port with a printer or plip device or some
such.


Probably all pps handlers should be special.  On x86 with reasonable
timecounter hardware, say a TSC, it takes about 10 instructions for
an entire pps interrupt handler:

XintrN:
pushl   %eax
pushl   %edx
rdtsc
# Need some ugliness for EIO here or later.
ss:movl %eax,ppscap # Hopefully lock-free via time-domain locking.
ss:movl %edx,ppscap+4
popl%edx
popl%eax
iret

After capturing the timecounter hardware value here, you convert it
to a pps event at leisure.  But since this only happens once per second,
it wouldn't be very inefficient to turn the interrupt handler into a
slow high-latency one, even a threaded one, to handle the pps event
and/or other devices attached to the interrupt.


   OTOH, all drivers that call pps_capture() from their interrupt handler
   then immediately call pps_event().  This has always been very broken,
   and became even more broken with SMPng.  pps_event() does many more
   timecounter and pps accesses whose locking is unclear at best, and
   in some configurations it calls hardpps(), which is only locked by
   Giant, despite comments in kern_ntptime.c still saying that it (and
   many other functions in kern_ntptime.c) must be called at splclock()
   or higher.  splclock() is of course now null, but the locking
   requirements in kern_ntptime.c haven't changed much.  kern_ntptime.c
   always needed to be locked by the equivalent of a spin mutex, which
   is stronger locking than was given by splclock().  pps_event() would
   have to aquire the spin mutex before calling hardpps(), although
   this is bad for fast interrupt handlers.  The correct implementation
   is probably to only do the capture part from fast interrupt handlers.


In my rewritten dedicated pps driver I call pps_capture() from the
filter handler and pps_event() from the threaded handler.  I never found


That seems right.


any good documentation on the low-level details of this stuff, and there
isn't enough good example code to work from.  My hazy memory is that I


THere seem to be no good examples.


ended up studying the pps_capture() and pps_event() code enough to infer
that their design intent seems to be to allow you to capture with no
locking and do the event processing later in some sort of deferred or
threaded context.


That seems to be the design, but there are no examples of separating
the event from the capture.

I think the correct locking is:
- capture in a fast interrupt handler, into a per-device state that
  is locked by whatever locks all of the state accessed by the fast
  interrupt handler
- switch to a less critical context later:
  - lock this step 

Re: [RFC/RFT] calloutng

2013-01-17 Thread Ian Lepore
On Mon, 2013-01-14 at 11:38 +1100, Bruce Evans wrote:
 On Sun, 13 Jan 2013, Alexander Motin wrote:
 
  On 13.01.2013 20:09, Marius Strobl wrote:
  On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:
[...]
 
  In existing code in HEAD and 9 timecounters are never called with spin
  mutex held.  I intentionally tried to avoid that in existing eventtimers
  code.
 
 Er, timecounters are called with a spin mutex held in existing code:
 though it is dangerous to do so, timecounters are called from fast
 interrupt handlers for very timekeeping-critical purposes:
 - to implement the TIOCTIMESTAMP ioctl (except this is broken in
-current).  This was a primitive version of pps timestamping.
 - for pps timestamping.  The interrupt handler (which should be a fast
interrupt handler to minimize latency) calls pps_capture() which
calls tc_get_timecount() and does other lock-free accesses to the
timecounter state.  This still works in -current (at least there is
still code for it).
 

Unfortunately, calling pps_capture() in the primary interrupt context is
no longer an option with the stock pps driver.  Ever since the ppbus
rewrite all ppbus children must use threaded handlers.  I tried to fix
that a couple different ways, and both ended up with crazy-complex code
scattered around the ppbus family just to support the rarely-used pps
capture.  It would have been easier to do if filter and threaded
interrupt handlers had the same function signature.

I ended up writting a separate driver that can be used instead of ppc +
ppbus + pps, since anyone who cares about precise pps capture is
unlikely to be sharing the port with a printer or plip device or some
such.

OTOH, all drivers that call pps_capture() from their interrupt handler
then immediately call pps_event().  This has always been very broken,
and became even more broken with SMPng.  pps_event() does many more
timecounter and pps accesses whose locking is unclear at best, and
in some configurations it calls hardpps(), which is only locked by
Giant, despite comments in kern_ntptime.c still saying that it (and
many other functions in kern_ntptime.c) must be called at splclock()
or higher.  splclock() is of course now null, but the locking
requirements in kern_ntptime.c haven't changed much.  kern_ntptime.c
always needed to be locked by the equivalent of a spin mutex, which
is stronger locking than was given by splclock().  pps_event() would
have to aquire the spin mutex before calling hardpps(), although
this is bad for fast interrupt handlers.  The correct implementation
is probably to only do the capture part from fast interrupt handlers.
 

In my rewritten dedicated pps driver I call pps_capture() from the
filter handler and pps_event() from the threaded handler.  I never found
any good documentation on the low-level details of this stuff, and there
isn't enough good example code to work from.  My hazy memory is that I
ended up studying the pps_capture() and pps_event() code enough to infer
that their design intent seems to be to allow you to capture with no
locking and do the event processing later in some sort of deferred or
threaded context.

  Callout code same time can be called in any environment with any
  locks held. And new callout code may need to know precise current time
  in any of those conditions. Attempt to use an IPI and wait there can be
  fatal.
 
 Callout code can't be called from such a general any environment as
 timecounter code.  Not from a fast interrupt handler.  Not from an NMI
 or IPI handler.  I hope.  But timecounter code has a good chance of
 working even for the last 2 environments, due to its design requirement
 of working in the first.
 
 The spinlock in the i8254 timecounter certainly breaks some cases.
 For example, suppose the lock is held for a timecounter read from
 normal context.  It masks hardware interrupts on the current CPU (except
 in my version).  It doesn't mask NMIs or other traps.  So if the NMI
 or other trap handler does a timecounter hardware call, there is
 deadlock in at least the !SMP case.  In my version, it blocks normal
 interrupts later if they occur, but doesn't block fast interrupts, so
 the pps_capture() call would deadlock if it occurs, like a timecounter
 call from an NMI.  I avoid this by not using pps in any fast interrupt
 handler, and by only using the i8254 timecounter for testing.  I do
 use pps in a (nonstandard) x86 RTC clock interrupt handler.  My clock
 interrupt handlers are all non-fast to avoid this and other locking
 problems.

Hrm, now you've got me a bit worried about capturing in the primary
context.  Not that I have much option, on a 300mhz Geode and similar
wimpy embedded processors there's enough latency on a theaded handler
that the pps signal can be de-asserted by time the handler runs
(precision timing gear often outputs a very narrow pps pulse, 1 - 10uS
isn't uncommon).

I know I don't have 

Re: [RFC/RFT] calloutng

2013-01-17 Thread Ian Lepore
On Sun, 2012-12-30 at 16:13 -0700, Ian Lepore wrote:
 On Wed, 2012-12-26 at 21:24 +0200, Alexander Motin wrote:
 [...]
  
 
 I grabbed testsleep.c to test an arm event timer implementation, and had
 to fix a couple nits... kqueueto was missing from the names[] array, and
 I had to add a * 1000 to a couple places where usec was stuffed into a
 timespec's tv_nsec.
 
 I also tested the calloutng_12_17 patches and the kqueue stuff behaved
 very strangely.  Then I noticed you had a 12_26 patchset so I tested
 that (after crudely fixing a couple uninitialized var warnings), and it
 all looks good on this arm (Raspberry Pi).  I'll attach the results.
 
 It's so sweet to be able to do precision sleeps.
 
 -- Ian
 
 
 plain text document attachment (calloutng_test.txt)
 for t in 1 300 3000 3 30 ; do
   for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
 ./testsleep $t $m
   done
 done
 
 
 With calloutng_12_26.patch...
 
 HZ=100   HZ=250   HZ=1000
 --   
 select  1 55.79  1 50.96  1 61.32
 poll1   1109.46  1   1107.86  1   1114.38
 usleep  1 56.33  1 72.90  1 62.78
 nanosleep   1 52.66  1 55.23  1 64.23
 kqueue  1   1114.23  1   1113.81  1   1121.21
 kqueueto1 65.44  1 71.00  1 75.01
 syscall 1  4.70  1  4.45  1  4.55
 select300355.79300357.76300362.35
 poll  300   1107.85300   1122.55300   1115.62
 usleep300355.28300357.28300360.79
 nanosleep 300354.49300355.82300360.62
 kqueue300   1112.57300   1118.13300   1117.16
 kqueueto  300375.98300378.62300395.61
 syscall   300  4.41300  4.45300  4.54
 select   3000   3246.75   3000   3246.74   3000   3252.72
 poll 3000   3238.10   3000   3229.12   3000   3250.10
 usleep   3000   3242.47   3000   3237.06   3000   3249.61
 nanosleep3000   3238.79   3000   3231.55   3000   3248.11
 kqueue   3000   3240.01   3000   3236.07   3000   3247.60
 kqueueto 3000   3265.36   3000   3267.22   3000   3274.96
 syscall  3000  4.69   3000  4.44   3000  4.50
 select  3  31714.60  3  31941.17  3  32467.69
 poll3  31522.76  3  31983.00  3  32497.81
 usleep  3  31459.67  3  31980.76  3  32458.71
 nanosleep   3  31431.02  3  31982.22  3  32525.20
 kqueue  3  31466.75  3  31873.90  3  31973.54
 kqueueto3  31564.67  3  32522.35  3  32475.59
 syscall 3  4.70  3  4.73  3  4.89
 select 30 319133.02 30 311562.33 30 309918.62
 poll   30 319604.27 30 311422.94 30 31.76
 usleep 30 319314.60 30 311269.69 30 309996.34
 nanosleep  30 319497.58 30 311425.40 30 309997.13
 kqueue 30 309995.55 30 303980.27 30 309908.82
 kqueueto   30 319505.88 30 311424.97 30 309996.16
 syscall30  4.41 30  4.45 30  4.89

 
 With no patches...
 
 HZ=100   HZ=250   HZ=1000
 --   
 select  1  19941.70  1   7989.10  1   1999.16
 poll1  19904.61  1   7987.32  1   1999.78
 usleep  1  19904.95  1   7993.30  1   1999.96
 nanosleep   1  19905.64  1   7993.71  1   1999.72
 kqueue  1  10001.61  1   4004.00  1   1000.27
 kqueueto1  19904.00  1   7993.03  1   1999.54
 syscall 1  4.04  1  4.05  1  4.75
 select300  19904.66300   7998.39300   2000.27
 poll  300  19904.35300   7993.47300   1999.86
 usleep300  19903.96300   7994.11300   1999.81
 nanosleep 300  19904.48300   7993.77300   1999.80
 kqueue300  10001.68300   4004.18300   1000.31
 kqueueto  300  19997.86300   7993.37300   1999.59
 syscall   300  4.01300  4.00300  4.32
 select   3000  19904.80   3000   7998.85   3000   3998.43
 poll 3000  19904.92   3000   8005.93   3000   3999.39
 usleep   3000  19904.50   3000   7992.88   3000   3999.44
 nanosleep3000  19904.84   3000 

Re: [RFC/RFT] calloutng

2013-01-13 Thread Marius Strobl
On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:
 On 06.01.2013 17:23, Marius Strobl wrote:
  On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote:
  On 26.12.2012 01:21, Marius Strobl wrote:
  On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
  Experiments with dummynet shown ineffective support for very short
  tick-based callouts. New version fixes that, allowing to get as many
  tick-based callout events as hz value permits, while still be able to
  aggregate events and generating minimum of interrupts.
 
  Also this version modifies system load average calculation to fix some
  cases existing in HEAD and 9 branches, that could be fixed with new
  direct callout functionality.
 
  http://people.freebsd.org/~mav/calloutng_12_17.patch
 
  With several important changes made last time I am going to delay commit
  to HEAD for another week to do more testing. Comments and new test cases
  are welcome. Thanks for staying tuned and commenting.
 
  FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
  try on sparc64 and it at least survives a buildworld there. However,
  with the patched kernels, buildworld times seem to increase slightly but
  reproducible by 1-2% (I only did four runs but typically buildworld
  times are rather stable and don't vary more than a minute for the
  same kernel and source here). Is this an expected trade-off (system
  time as such doesn't seem to increase)?
 
  I don't think build process uses significant number of callouts to 
  affect results directly. I think this additional time could be result of 
  the deeper next event look up, done by the new code, that is practically 
  useless for sparc64, which effectively has no cpu_idle() routine. It 
  wouldn't affect system time and wouldn't show up in any statistics 
  (except PMC or something alike) because it is executed inside timer 
  hardware interrupt handler. If my guess is right, that is a part that 
  probably still could be optimized. I'll look on it. Thanks.
 
  Is there anything specific to test?
 
  Since the most of code is MI, for sparc64 I would mostly look on related 
  MD parts (eventtimers and timecounters) to make sure they are working 
  reliably in more stressful conditions.  I still have some worries about 
  possible deadlock on hardware where IPIs are used to fetch present time 
  from other CPU.
  
  Well, I've just learnt two things the hard way:
  a) We really need the mutex in that path.
  b) Assuming that the initial synchronization of the counters is good
 enough and they won't drift considerably accross the CPUs so we can
 always use the local one makes things go south pretty soon after
 boot. At least with your calloutng_12_26.patch applied.
 
 Do you think it means they are not really synchronized for some reason?

There's definitely no hardware in place which would synchronize them.
I've no idea how to properly measure the difference between two tick
counters, but I think it's rarther their drift and not the software
synchronization we do when starting APs that is causing problems.
Mainly, because I can't really think of a better algorithm for doing
the latter when startiing the APs. The symptoms are that bout 30 to
60 seconds after that I start to see weird timeouts from device
drivers. I'd need to check how long these timeouts actually are to
see whether it could be a problem right from the start though. In
any case, it seems foolish to think that synchronizing them once
would be sufficient and they won't drift until shutdown. Linux
probably also doesn't keep re-synchronize them without a reason.
Just using a single timecounter source simply appears to be the
better choice.

 
  I'm not really sure what to do about that. Earlier you already said
  that sched_bind(9) also isn't an option in case if td_critnest  1.
  To be honest, I don't really unerstand why using a spin lock in the
  timecounter path makes sparc64 the only problematic architecture
  for your changes. The x86 i8254_get_timecount() also uses a spin lock
  so it should be in the same boat.
 
 The problem is not in using spinlock, but in waiting for other CPU while
 spinlock is held. Other CPU may also hold spinlock and wait for
 something, causing deadlock. i8254 code uses spinlock just to atomically
 access hardware registers, so it causes no problems.

Okay, but wouldn't that be a general problem then? Pretty much
anything triggering an IPI holds smp_ipi_mtx while doing so and
the lower level IPI stuff waits for other CPU(s), including on
x86.

 
  The affected machines are equipped with a x86-style south bridge
  which exposes a powermanagment unit (intended to be used as a SMBus
  bridge only in these machines) on the PCI bus. Actually, this device
  also includes an ACPI power management timer. However, I've just
  spent a day trying to get that one working without success - it
  just doesn't increment. Probably its clock input isn't connected as
  it's 

Re: [RFC/RFT] calloutng

2013-01-13 Thread Adrian Chadd
Ok,

So everyone - what _could_ be brought into -HEAD right now, without
any actual change in code behaviour?

eg, what kind of refactoring could be done to reduce the amount of
diffs between the branch and -HEAD?



Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-13 Thread Alexander Motin
On 13.01.2013 20:09, Marius Strobl wrote:
 On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:
 On 06.01.2013 17:23, Marius Strobl wrote:
 I'm not really sure what to do about that. Earlier you already said
 that sched_bind(9) also isn't an option in case if td_critnest  1.
 To be honest, I don't really unerstand why using a spin lock in the
 timecounter path makes sparc64 the only problematic architecture
 for your changes. The x86 i8254_get_timecount() also uses a spin lock
 so it should be in the same boat.

 The problem is not in using spinlock, but in waiting for other CPU while
 spinlock is held. Other CPU may also hold spinlock and wait for
 something, causing deadlock. i8254 code uses spinlock just to atomically
 access hardware registers, so it causes no problems.
 
 Okay, but wouldn't that be a general problem then? Pretty much
 anything triggering an IPI holds smp_ipi_mtx while doing so and
 the lower level IPI stuff waits for other CPU(s), including on
 x86.

The problem is general. But now it works because single smp_ipi_mtx is
used in all cases where IPI result is waited. As soon as spinning
happens with interrupts still enabled, there is no deadlocks. But
problem reappears if any different lock is used, or locks are nested.

In existing code in HEAD and 9 timecounters are never called with spin
mutex held.  I intentionally tried to avoid that in existing eventtimers
code. Callout code same time can be called in any environment with any
locks held. And new callout code may need to know precise current time
in any of those conditions. Attempt to use an IPI and wait there can be
fatal.

 The affected machines are equipped with a x86-style south bridge
 which exposes a powermanagment unit (intended to be used as a SMBus
 bridge only in these machines) on the PCI bus. Actually, this device
 also includes an ACPI power management timer. However, I've just
 spent a day trying to get that one working without success - it
 just doesn't increment. Probably its clock input isn't connected as
 it's not intended to be used in these machines.
 That south bridge also includes 8254 compatible timers on the ISA/
 LPC side, but are hidden from the OFW device tree. I can hack these
 devices into existence and give it a try, but even if that works this
 likely would use the same code as the x86 i8254_get_timecount() so I
 don't see what would be gained with that.

 The last thing in order to avoid using the tick counter as timecounter
 in the MP case I can think of is that the Broadcom MACs in the affected
 machines also provide a counter driven by a 1 MHz clock. If that's good
 enough for a timecounter I can hook these up (in case these work ...)
 and hack bge(4) to not detach in that case (given that we can't detach
 timecounters ...).

 i8254 on x86 is also just a bit above 1MHz.

 Here is small tool we are using for test correctness and performance of 
 different user-level APIs: http://people.freebsd.org/~mav/testsleep.c


 I've run Ian's set of tests on a v215 with and without your
 calloutng_12_26.patch and on a v210 (these uses the IPI approach)
 with the latter also applied.
 I'm not really sure what to make out of the numbers.

v215 w/o v215 w/  v210 w/ 
 --   
 select  1   1999.61  1 23.87  1 29.97
 poll1   1999.70  1   1069.61  1   1075.24
 usleep  1   1999.86  1 23.43  1 28.99
 nanosleep   1999.92  1 23.28  1 28.66
 kqueue  1   1000.12  1   1071.13  1   1076.35
 kqueueto1999.56  1 26.33  1 31.34
 syscall 1  1.89  1  1.92  1  2.88
 
 FYI, these are the results of the v215 (btw., these (ab)use a bus
 cycle counter of the host-PCI-bridge as timecounter) with your
 calloutng_12_17.patch and kern.timecounter.alloweddeviation=0:
 select 1 23.82
 poll   1   1008.23
 usleep 1 23.31
 nanosleep  1 23.17
 kqueue 1   1010.35
 kqueueto   1 26.26
 syscall1  1.91
 select   300307.72
 poll 300   1008.23
 usleep   300307.64
 nanosleep300 23.21
 kqueue   300   1010.49
 kqueueto 300 26.27
 syscall  300  1.92
 select  3000   3009.95
 poll3000   3013.33
 usleep  3000   3013.56
 nanosleep   3000 23.17
 kqueue  3000   3011.09
 kqueueto3000 26.24
 syscall 3000  1.91
 select 3  30013.51
 poll   3  30010.63
 usleep 3  30010.64
 nanosleep  3 36.91
 kqueue 3  30012.38
 kqueueto   3 39.90
 syscall3  1.90
 select30 300017.52
 poll  30 300013.00
 usleep30 300012.64
 nanosleep 30307.59
 kqueue30 300017.07
 kqueueto  30310.24
 

Re: [RFC/RFT] calloutng

2013-01-13 Thread Ian Lepore
On Sun, 2013-01-13 at 21:36 +0200, Alexander Motin wrote:
 On 13.01.2013 20:09, Marius Strobl wrote:
[...]
  
  Uhm, there are no NMIs on sparc64. Does it make sense to bypass this
  adjustment on sparc64?
 
 If it is not possible or not good to to stop timer during programming,
 there will always be some race window between code execution and timer
 ticking. So some minimal safety range should be reserved. Though it
 probably can be significantly reduced. In case of x86/HPET there is
 additional factor of NMI, that extends race to unpredictable level and
 so makes additional post-read almost mandatory.
 
  May be with rereading counter
  after programming comparator (same as done for HPET, reading which is
  probably much more expensive) this value could be reduced.
  
  I see. There are some bigger fish to fry at the moment though :)
 

Speaking of the HPET code, it seems to me that its restart logic can
fire the same event twice.  Is that harmless?

-- Ian


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-13 Thread Alexander Motin
On 14.01.2013 01:10, Ian Lepore wrote:
 On Sun, 2013-01-13 at 21:36 +0200, Alexander Motin wrote:
 On 13.01.2013 20:09, Marius Strobl wrote:
 [...]

 Uhm, there are no NMIs on sparc64. Does it make sense to bypass this
 adjustment on sparc64?

 If it is not possible or not good to to stop timer during programming,
 there will always be some race window between code execution and timer
 ticking. So some minimal safety range should be reserved. Though it
 probably can be significantly reduced. In case of x86/HPET there is
 additional factor of NMI, that extends race to unpredictable level and
 so makes additional post-read almost mandatory.

 May be with rereading counter
 after programming comparator (same as done for HPET, reading which is
 probably much more expensive) this value could be reduced.

 I see. There are some bigger fish to fry at the moment though :)

 
 Speaking of the HPET code, it seems to me that its restart logic can
 fire the same event twice.  Is that harmless?

It is much less harmful then not fire and loose interrupt completely,
and as result stuck for indefinite time. I have no better ideas, while
this seems works well.

-- 
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-13 Thread Bruce Evans

On Sun, 13 Jan 2013, Alexander Motin wrote:


On 13.01.2013 20:09, Marius Strobl wrote:

On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:

On 06.01.2013 17:23, Marius Strobl wrote:

I'm not really sure what to do about that. Earlier you already said
that sched_bind(9) also isn't an option in case if td_critnest  1.
To be honest, I don't really unerstand why using a spin lock in the
timecounter path makes sparc64 the only problematic architecture
for your changes. The x86 i8254_get_timecount() also uses a spin lock
so it should be in the same boat.


The problem is not in using spinlock, but in waiting for other CPU while
spinlock is held. Other CPU may also hold spinlock and wait for
something, causing deadlock. i8254 code uses spinlock just to atomically
access hardware registers, so it causes no problems.


Okay, but wouldn't that be a general problem then? Pretty much
anything triggering an IPI holds smp_ipi_mtx while doing so and
the lower level IPI stuff waits for other CPU(s), including on
x86.


The problem is general. But now it works because single smp_ipi_mtx is
used in all cases where IPI result is waited. As soon as spinning
happens with interrupts still enabled, there is no deadlocks. But
problem reappears if any different lock is used, or locks are nested.

In existing code in HEAD and 9 timecounters are never called with spin
mutex held.  I intentionally tried to avoid that in existing eventtimers
code.


Er, timecounters are called with a spin mutex held in existing code:
though it is dangerous to do so, timecounters are called from fast
interrupt handlers for very timekeeping-critical purposes:
- to implement the TIOCTIMESTAMP ioctl (except this is broken in
  -current).  This was a primitive version of pps timestamping.
- for pps timestamping.  The interrupt handler (which should be a fast
  interrupt handler to minimize latency) calls pps_capture() which
  calls tc_get_timecount() and does other lock-free accesses to the
  timecounter state.  This still works in -current (at least there is
  still code for it).

  OTOH, all drivers that call pps_capture() from their interrupt handler
  then immediately call pps_event().  This has always been very broken,
  and became even more broken with SMPng.  pps_event() does many more
  timecounter and pps accesses whose locking is unclear at best, and
  in some configurations it calls hardpps(), which is only locked by
  Giant, despite comments in kern_ntptime.c still saying that it (and
  many other functions in kern_ntptime.c) must be called at splclock()
  or higher.  splclock() is of course now null, but the locking
  requirements in kern_ntptime.c haven't changed much.  kern_ntptime.c
  always needed to be locked by the equivalent of a spin mutex, which
  is stronger locking than was given by splclock().  pps_event() would
  have to aquire the spin mutex before calling hardpps(), although
  this is bad for fast interrupt handlers.  The correct implementation
  is probably to only do the capture part from fast interrupt handlers.


Callout code same time can be called in any environment with any
locks held. And new callout code may need to know precise current time
in any of those conditions. Attempt to use an IPI and wait there can be
fatal.


Callout code can't be called from such a general any environment as
timecounter code.  Not from a fast interrupt handler.  Not from an NMI
or IPI handler.  I hope.  But timecounter code has a good chance of
working even for the last 2 environments, due to its design requirement
of working in the first.

The spinlock in the i8254 timecounter certainly breaks some cases.
For example, suppose the lock is held for a timecounter read from
normal context.  It masks hardware interrupts on the current CPU (except
in my version).  It doesn't mask NMIs or other traps.  So if the NMI
or other trap handler does a timecounter hardware call, there is
deadlock in at least the !SMP case.  In my version, it blocks normal
interrupts later if they occur, but doesn't block fast interrupts, so
the pps_capture() call would deadlock if it occurs, like a timecounter
call from an NMI.  I avoid this by not using pps in any fast interrupt
handler, and by only using the i8254 timecounter for testing.  I do
use pps in a (nonstandard) x86 RTC clock interrupt handler.  My clock
interrupt handlers are all non-fast to avoid this and other locking
problems.


FYI, these are the results of the v215 (btw., these (ab)use a bus
cycle counter of the host-PCI-bridge as timecounter) with your
calloutng_12_17.patch and kern.timecounter.alloweddeviation=0:
select 1 23.82
poll   1   1008.23
usleep 1 23.31
nanosleep  1 23.17
kqueue 1   1010.35
kqueueto   1 26.26
syscall1  1.91
select   300307.72
poll 300   1008.23
usleep   300307.64
nanosleep300 23.21


Please fix the tv_nsec initialization so that we can see if nanosleep()
and 

Re: [RFC/RFT] calloutng

2013-01-08 Thread Alexander Motin
On 06.01.2013 17:23, Marius Strobl wrote:
 On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote:
 On 26.12.2012 01:21, Marius Strobl wrote:
 On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
 Experiments with dummynet shown ineffective support for very short
 tick-based callouts. New version fixes that, allowing to get as many
 tick-based callout events as hz value permits, while still be able to
 aggregate events and generating minimum of interrupts.

 Also this version modifies system load average calculation to fix some
 cases existing in HEAD and 9 branches, that could be fixed with new
 direct callout functionality.

 http://people.freebsd.org/~mav/calloutng_12_17.patch

 With several important changes made last time I am going to delay commit
 to HEAD for another week to do more testing. Comments and new test cases
 are welcome. Thanks for staying tuned and commenting.

 FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
 try on sparc64 and it at least survives a buildworld there. However,
 with the patched kernels, buildworld times seem to increase slightly but
 reproducible by 1-2% (I only did four runs but typically buildworld
 times are rather stable and don't vary more than a minute for the
 same kernel and source here). Is this an expected trade-off (system
 time as such doesn't seem to increase)?

 I don't think build process uses significant number of callouts to 
 affect results directly. I think this additional time could be result of 
 the deeper next event look up, done by the new code, that is practically 
 useless for sparc64, which effectively has no cpu_idle() routine. It 
 wouldn't affect system time and wouldn't show up in any statistics 
 (except PMC or something alike) because it is executed inside timer 
 hardware interrupt handler. If my guess is right, that is a part that 
 probably still could be optimized. I'll look on it. Thanks.

 Is there anything specific to test?

 Since the most of code is MI, for sparc64 I would mostly look on related 
 MD parts (eventtimers and timecounters) to make sure they are working 
 reliably in more stressful conditions.  I still have some worries about 
 possible deadlock on hardware where IPIs are used to fetch present time 
 from other CPU.
 
 Well, I've just learnt two things the hard way:
 a) We really need the mutex in that path.
 b) Assuming that the initial synchronization of the counters is good
enough and they won't drift considerably accross the CPUs so we can
always use the local one makes things go south pretty soon after
boot. At least with your calloutng_12_26.patch applied.

Do you think it means they are not really synchronized for some reason?

 I'm not really sure what to do about that. Earlier you already said
 that sched_bind(9) also isn't an option in case if td_critnest  1.
 To be honest, I don't really unerstand why using a spin lock in the
 timecounter path makes sparc64 the only problematic architecture
 for your changes. The x86 i8254_get_timecount() also uses a spin lock
 so it should be in the same boat.

The problem is not in using spinlock, but in waiting for other CPU while
spinlock is held. Other CPU may also hold spinlock and wait for
something, causing deadlock. i8254 code uses spinlock just to atomically
access hardware registers, so it causes no problems.

 The affected machines are equipped with a x86-style south bridge
 which exposes a powermanagment unit (intended to be used as a SMBus
 bridge only in these machines) on the PCI bus. Actually, this device
 also includes an ACPI power management timer. However, I've just
 spent a day trying to get that one working without success - it
 just doesn't increment. Probably its clock input isn't connected as
 it's not intended to be used in these machines.
 That south bridge also includes 8254 compatible timers on the ISA/
 LPC side, but are hidden from the OFW device tree. I can hack these
 devices into existence and give it a try, but even if that works this
 likely would use the same code as the x86 i8254_get_timecount() so I
 don't see what would be gained with that.
 
 The last thing in order to avoid using the tick counter as timecounter
 in the MP case I can think of is that the Broadcom MACs in the affected
 machines also provide a counter driven by a 1 MHz clock. If that's good
 enough for a timecounter I can hook these up (in case these work ...)
 and hack bge(4) to not detach in that case (given that we can't detach
 timecounters ...).

i8254 on x86 is also just a bit above 1MHz.

 Here is small tool we are using for test correctness and performance of 
 different user-level APIs: http://people.freebsd.org/~mav/testsleep.c

 
 I've run Ian's set of tests on a v215 with and without your
 calloutng_12_26.patch and on a v210 (these uses the IPI approach)
 with the latter also applied.
 I'm not really sure what to make out of the numbers.
 
v215 w/o v215 w/  

Re: [RFC/RFT] calloutng

2013-01-08 Thread Alexander Motin
On 06.01.2013 18:20, Luigi Rizzo wrote:
 On Sun, Jan 06, 2013 at 04:23:13PM +0100, Marius Strobl wrote:
 On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote:
 Here is small tool we are using for test correctness and performance of 
 different user-level APIs: http://people.freebsd.org/~mav/testsleep.c


 I've run Ian's set of tests on a v215 with and without your
 calloutng_12_26.patch and on a v210 (these uses the IPI approach)
 with the latter also applied.
 I'm not really sure what to make out of the numbers.

v215 w/o v215 w/  v210 w/ 
 --   
 select  1   1999.61  1 23.87  1 29.97
 poll1   1999.70  1   1069.61  1   1075.24
 usleep  1   1999.86  1 23.43  1 28.99
 nanosleep   1999.92  1 23.28  1 28.66
 kqueue  1   1000.12  1   1071.13  1   1076.35
 kqueueto1999.56  1 26.33  1 31.34
 syscall 1  1.89  1  1.92  1  2.88
 select300   1999.72300326.08300332.24
 poll  300   1999.12300   1069.78300   1075.82
 usleep300   1999.91300325.63300330.94
 nanosleep 300999.82300 23.25300 26.76
 kqueue300   1000.14300   1071.06300   1075.96
 kqueueto  300999.53300 26.32300 31.42
 syscall   300  1.90300  1.93300  2.89
 select   3000   3998.18   3000   3176.51   3000   3193.86
 poll 3000   3999.29   3000   3182.21   3000   3193.12
 usleep   3000   3998.46   3000   3191.60   3000   3192.50
 nanosleep3000   1999.79   3000 23.21   3000 27.02
 kqueue   3000   3000.12   3000   3189.13   3000   3191.96
 kqueueto 3000   1999.99   3000 26.28   3000 31.91
 syscall  3000  1.91   3000  1.91   3000  2.90
 select  3  30990.85  3  31489.18  3  31548.77
 poll3  30995.25  3  31518.80  3  31487.92
 usleep  3  30992.00  3  31510.42  3  31475.50
 nanosleep   3   1999.46  3 38.67  3 41.95
 kqueue  3  30006.49  3  30991.86  3  30996.77
 kqueueto3   1999.09  3 41.67  3 46.36
 syscall 3  1.91  3  1.91  3  2.88
 select 30 300990.65 30 301864.98 30 301787.01
 poll   30 300998.09 30 301831.36 30 301741.62
 usleep 30 300990.80 30 301824.67 30 301770.10
 nanosleep  30   1999.15 30325.74 30331.01
 kqueue 30 30.87 30 301031.11 30 300992.28
 kqueueto   30   1999.39 30328.77 30333.45
 syscall30  1.91 30  1.91 30  2.88
 
 the nanosleep and kqueueto tests are probably passing the wrong
 argument to the syscall (meant to be microseconds, but nanosleep
 takes nanosecond so it should probably be multiplied by 1000).

Yes, you are right. I've used it in such way to see what will happen if
I request sub-microsecond interval. In this test these rows are
definitely incorrect.

 I think that for the time being it would be useful to run at least
 one set of tests with kern.timecounter.alloweddeviation=0 so we can
 tell how close we get to the required timeouts

May be just to be sure, because it should not significantly affect
results of the 1us tests, as 5% of 1us is much less then numbers we see
there.

-- 
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-08 Thread Luigi Rizzo
On Tue, Jan 08, 2013 at 12:55:27PM +0200, Alexander Motin wrote:
 On 06.01.2013 18:20, Luigi Rizzo wrote:
...
  I think that for the time being it would be useful to run at least
  one set of tests with kern.timecounter.alloweddeviation=0 so we can
  tell how close we get to the required timeouts
 
 May be just to be sure, because it should not significantly affect
 results of the 1us tests, as 5% of 1us is much less then numbers we see
 there.

to clarify - i don't mind if we are 50-100us (absolute error) off the
requested timeout for short intervals, but i want to be sure
that this error can be achieved also for large requests.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-06 Thread Marius Strobl
On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote:
 On 26.12.2012 01:21, Marius Strobl wrote:
  On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
  Experiments with dummynet shown ineffective support for very short
  tick-based callouts. New version fixes that, allowing to get as many
  tick-based callout events as hz value permits, while still be able to
  aggregate events and generating minimum of interrupts.
 
  Also this version modifies system load average calculation to fix some
  cases existing in HEAD and 9 branches, that could be fixed with new
  direct callout functionality.
 
  http://people.freebsd.org/~mav/calloutng_12_17.patch
 
  With several important changes made last time I am going to delay commit
  to HEAD for another week to do more testing. Comments and new test cases
  are welcome. Thanks for staying tuned and commenting.
 
  FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
  try on sparc64 and it at least survives a buildworld there. However,
  with the patched kernels, buildworld times seem to increase slightly but
  reproducible by 1-2% (I only did four runs but typically buildworld
  times are rather stable and don't vary more than a minute for the
  same kernel and source here). Is this an expected trade-off (system
  time as such doesn't seem to increase)?
 
 I don't think build process uses significant number of callouts to 
 affect results directly. I think this additional time could be result of 
 the deeper next event look up, done by the new code, that is practically 
 useless for sparc64, which effectively has no cpu_idle() routine. It 
 wouldn't affect system time and wouldn't show up in any statistics 
 (except PMC or something alike) because it is executed inside timer 
 hardware interrupt handler. If my guess is right, that is a part that 
 probably still could be optimized. I'll look on it. Thanks.
 
  Is there anything specific to test?
 
 Since the most of code is MI, for sparc64 I would mostly look on related 
 MD parts (eventtimers and timecounters) to make sure they are working 
 reliably in more stressful conditions.  I still have some worries about 
 possible deadlock on hardware where IPIs are used to fetch present time 
 from other CPU.

Well, I've just learnt two things the hard way:
a) We really need the mutex in that path.
b) Assuming that the initial synchronization of the counters is good
   enough and they won't drift considerably accross the CPUs so we can
   always use the local one makes things go south pretty soon after
   boot. At least with your calloutng_12_26.patch applied.

I'm not really sure what to do about that. Earlier you already said
that sched_bind(9) also isn't an option in case if td_critnest  1.
To be honest, I don't really unerstand why using a spin lock in the
timecounter path makes sparc64 the only problematic architecture
for your changes. The x86 i8254_get_timecount() also uses a spin lock
so it should be in the same boat.

The affected machines are equipped with a x86-style south bridge
which exposes a powermanagment unit (intended to be used as a SMBus
bridge only in these machines) on the PCI bus. Actually, this device
also includes an ACPI power management timer. However, I've just
spent a day trying to get that one working without success - it
just doesn't increment. Probably its clock input isn't connected as
it's not intended to be used in these machines.
That south bridge also includes 8254 compatible timers on the ISA/
LPC side, but are hidden from the OFW device tree. I can hack these
devices into existence and give it a try, but even if that works this
likely would use the same code as the x86 i8254_get_timecount() so I
don't see what would be gained with that.

The last thing in order to avoid using the tick counter as timecounter
in the MP case I can think of is that the Broadcom MACs in the affected
machines also provide a counter driven by a 1 MHz clock. If that's good
enough for a timecounter I can hook these up (in case these work ...)
and hack bge(4) to not detach in that case (given that we can't detach
timecounters ...).

 
 Here is small tool we are using for test correctness and performance of 
 different user-level APIs: http://people.freebsd.org/~mav/testsleep.c
 

I've run Ian's set of tests on a v215 with and without your
calloutng_12_26.patch and on a v210 (these uses the IPI approach)
with the latter also applied.
I'm not really sure what to make out of the numbers.

   v215 w/o v215 w/  v210 w/ 
--   
select  1   1999.61  1 23.87  1 29.97
poll1   1999.70  1   1069.61  1   1075.24
usleep  1   1999.86  1 23.43  1 28.99
nanosleep   1999.92  1 23.28  1 28.66
kqueue  1   1000.12  1   1071.13  1   1076.35

Re: [RFC/RFT] calloutng

2013-01-06 Thread Luigi Rizzo
On Sun, Jan 06, 2013 at 04:23:13PM +0100, Marius Strobl wrote:
 On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote:
  On 26.12.2012 01:21, Marius Strobl wrote:
   On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
   Experiments with dummynet shown ineffective support for very short
   tick-based callouts. New version fixes that, allowing to get as many
   tick-based callout events as hz value permits, while still be able to
   aggregate events and generating minimum of interrupts.
  
   Also this version modifies system load average calculation to fix some
   cases existing in HEAD and 9 branches, that could be fixed with new
   direct callout functionality.
  
   http://people.freebsd.org/~mav/calloutng_12_17.patch
  
   With several important changes made last time I am going to delay commit
   to HEAD for another week to do more testing. Comments and new test cases
   are welcome. Thanks for staying tuned and commenting.
  
   FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
   try on sparc64 and it at least survives a buildworld there. However,
   with the patched kernels, buildworld times seem to increase slightly but
   reproducible by 1-2% (I only did four runs but typically buildworld
   times are rather stable and don't vary more than a minute for the
   same kernel and source here). Is this an expected trade-off (system
   time as such doesn't seem to increase)?
  
  I don't think build process uses significant number of callouts to 
  affect results directly. I think this additional time could be result of 
  the deeper next event look up, done by the new code, that is practically 
  useless for sparc64, which effectively has no cpu_idle() routine. It 
  wouldn't affect system time and wouldn't show up in any statistics 
  (except PMC or something alike) because it is executed inside timer 
  hardware interrupt handler. If my guess is right, that is a part that 
  probably still could be optimized. I'll look on it. Thanks.
  
   Is there anything specific to test?
  
  Since the most of code is MI, for sparc64 I would mostly look on related 
  MD parts (eventtimers and timecounters) to make sure they are working 
  reliably in more stressful conditions.  I still have some worries about 
  possible deadlock on hardware where IPIs are used to fetch present time 
  from other CPU.
 
 Well, I've just learnt two things the hard way:
 a) We really need the mutex in that path.
 b) Assuming that the initial synchronization of the counters is good
enough and they won't drift considerably accross the CPUs so we can
always use the local one makes things go south pretty soon after
boot. At least with your calloutng_12_26.patch applied.
 
 I'm not really sure what to do about that. Earlier you already said
 that sched_bind(9) also isn't an option in case if td_critnest  1.
 To be honest, I don't really unerstand why using a spin lock in the
 timecounter path makes sparc64 the only problematic architecture
 for your changes. The x86 i8254_get_timecount() also uses a spin lock
 so it should be in the same boat.
 
 The affected machines are equipped with a x86-style south bridge
 which exposes a powermanagment unit (intended to be used as a SMBus
 bridge only in these machines) on the PCI bus. Actually, this device
 also includes an ACPI power management timer. However, I've just
 spent a day trying to get that one working without success - it
 just doesn't increment. Probably its clock input isn't connected as
 it's not intended to be used in these machines.
 That south bridge also includes 8254 compatible timers on the ISA/
 LPC side, but are hidden from the OFW device tree. I can hack these
 devices into existence and give it a try, but even if that works this
 likely would use the same code as the x86 i8254_get_timecount() so I
 don't see what would be gained with that.
 
 The last thing in order to avoid using the tick counter as timecounter
 in the MP case I can think of is that the Broadcom MACs in the affected
 machines also provide a counter driven by a 1 MHz clock. If that's good
 enough for a timecounter I can hook these up (in case these work ...)
 and hack bge(4) to not detach in that case (given that we can't detach
 timecounters ...).
 
  
  Here is small tool we are using for test correctness and performance of 
  different user-level APIs: http://people.freebsd.org/~mav/testsleep.c
  
 
 I've run Ian's set of tests on a v215 with and without your
 calloutng_12_26.patch and on a v210 (these uses the IPI approach)
 with the latter also applied.
 I'm not really sure what to make out of the numbers.
 
v215 w/o v215 w/  v210 w/ 
 --   
 select  1   1999.61  1 23.87  1 29.97
 poll1   1999.70  1   1069.61  1   1075.24
 usleep  1   1999.86  1 23.43  

Re: [RFC/RFT] calloutng

2013-01-03 Thread Luigi Rizzo
On Wed, Jan 02, 2013 at 09:52:37PM -0800, Kevin Oberman wrote:
 On Wed, Jan 2, 2013 at 2:06 PM, Alexander Motin m...@freebsd.org wrote:
  On 02.01.2013 18:08, Adrian Chadd wrote:
 
  .. I'm pretty damned sure we're going to need to enforce a never
  earlier than X latency.
 
 
  Do you mean here that we should never wake up before specified time (just as
  specified by the most of existing APIs), or that we should not allow sleep
  shorter then some value to avoid DoS? At least on x86 nanosleep(0) doesn't
  allow to block the system. Also there is already present mechanism for
  specifying minimum timer programming interval in eventtimers(9) KPI.
 
 I can see serious performance issues with some hardware (wireless
 comes to mind) if things happen too quickly. Intuition is that it
 could also play hob with VMs.
 
 I believe that the proper way is to wake between  T_X and T_X + D.
 This assumes that D is max_wake_delay, not deviation, which leaves us
 at the original of (T_X) = event_time = (T_X + D).

i think max delay was the intended meaning of the D parameter.
We picked bad names (tolerance, deviation,...) for it.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-03 Thread Bruce Evans

On Wed, 2 Jan 2013, Alexander Motin wrote:


On 02.01.2013 19:09, Konstantin Belousov wrote:

On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote:

Probably one way to close this discussion would be to provide
a sysctl so the sysadmin can decide which point in the interval
to pick when there is no suitable callout already scheduled.

Isn't trying to synchronize to the external events in this way unsafe ?
I remember, but cannot find the reference right now, a scheduler
exploit(s) which completely hide malicious thread from the time
accounting, by making it voluntary yielding right before statclock
should fire. If statistic gathering could be piggy-backed on the
external interrupt, and attacker can control the source of the external
events, wouldn't this give her a handle ?


Fine-grained timeouts complete fully opening this security hole.
Synchronization without fine-grained timeouts might allow the same,
but is harder to exploit since you can't control the yielding points
directly.  With fine-grained timeouts, you just have to predict the
statclock firing points.  Use one timeout to arrange to yield just
before statclock fires and another to regain control just after it has
fired.  If the timeout resolution is say 50 usec, then this can hope
to run for all except 100 usec out of every 1/stathz seconds.  With
stathz = 128, 1/stathz is 7812 usec, so this gives 7712/7812 of the
CPU with 0 statclock ticks.  Since the scheduler never sees you running,
your priority remains minimal, so the scheduler should prefer to run
you whenever a timeout expires, with only round-robin with other
minimal-priority threads preventing you getting 7712/7812 of the (user
non-rtprio) CPU.

The previous stage of fully opening this security hole was changing
(the default) HZ from 100 to 1000.  HZ must not be much smaller than
stathz, else the security hole is almost fully open.  With HZ = 100
being less than stathz and timeout granularity limiting the fine control
to 2/HZ = 20 msec (except you can use a periodic itimer to get a 1/HZ
granularity at a minor cost of getting more SIGALRMs), it is impossible
to get near 100% of the CPU with 0 statclock ticks.  After yielding,
you can't get control for another 100 or 200 msec.  Since this exceeds
1/stathz = 78.12 usec, you can only hide from statclock ticks by not
running very often or for very long.  Limited hiding is possible by
wasting even more CPU to determine when to hide: since the timeout
granularity is large, it is also ineffective for determining when to
yield.  So when running, you must poll the current time a lot to
determine when to yield.  Yield just before statclock fires, as above.
(Do it 50 usec early, as above, to avoid most races involving polling
the time.)  This actually has good chances of not limiting the hiding
too much, depending on the details of the scheduling.  It yields just
before a statclock tick.  After this tick fires, if the scheduler
reschedules for any reason, then the hiding process would most likely
be run again, since its priority is minimal.  But at least the old
4BSD scheduler doesn't reschedule after _every_ statclock tick.  This
depends on the bugfeature that the priority is not checked on _every_
return to user mode (sched_clock() does change the priority, but this
is not acted on until much later).  Without this bugfeature, there
would be excessive context switches.  OTOH, with timeouts, at least
old non-fine-grained ones, you can force a rescheduling that is acted
on soon enough simply by using timeouts (since timeouts give a context
switch to the softclock thread, the scheduler has no option to skip
checking the priority on return to user mode).

After the previous stage of changing HZ to 1000, the granuarity is fine
enough for using timeouts to hide from the scheduler.  Using a periodic
itimer to get a granularity of 1000 usec, start hiding 50-1000 usec
before each statclock tick and regain control 1000 usec later.  With
stathz = 128, 6812/7812 of the CPU with 0 statclock ticks.  Not much
worse (for the hider) than 7712/7812.

Statclock was supposed to be aperiodic to avoid hiding (see
statclk-usenix93.ps), but this was never implemented in FreeBSD.  With
fine-grained timeouts, it would have to be very aperiodic, to the point
of giving large inaccuracies, to limit the hiding very much.  For
example, suppose that it has an average period of 7812 usec with +-50%
jitter.  You would try to hide from it most of the time by running for
a bit less than 7812/2 usec before yielding in most cases.  If too
much scheduling is done on each statclock tick, then you are likely
to regain control after each one (as above) and then know that there
is almost a full minimal period until the next one.  Otherwise, it
seems to be necessary to determine when the previous statclock tick
occurred, so as to determine the minimum time until the next one.

There are many different kinds of accounting with different characteristics. 
Run time for each thread 

Re: [RFC/RFT] calloutng

2013-01-03 Thread Alexander Motin

On 03.01.2013 16:45, Bruce Evans wrote:

On Wed, 2 Jan 2013, Alexander Motin wrote:

More important for scheduling fairness thread's CPU percentage is also
based on hardclock() and hiding from it was trivial before, since all
sleep primitives were strictly aligned to hardclock(). Now it is
slightly less trivial, since this alignment was removed and user-level
APIs provide no easy way to enforce it.


%cpu is actually based on statclock(), and not even used for scheduling.


May be for SCHED_4BSD, but not for SCHED_ULE.  In SCHED_ULE both %cpu 
and thread priority based on the same ts_ticks counter, that is based on 
hardclock() as time source. Interactivity calculation uses alike logic 
and uses the same time source.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-03 Thread Bruce Evans

On Thu, 3 Jan 2013, Alexander Motin wrote:


On 03.01.2013 16:45, Bruce Evans wrote:

On Wed, 2 Jan 2013, Alexander Motin wrote:

More important for scheduling fairness thread's CPU percentage is also
based on hardclock() and hiding from it was trivial before, since all
sleep primitives were strictly aligned to hardclock(). Now it is
slightly less trivial, since this alignment was removed and user-level
APIs provide no easy way to enforce it.


%cpu is actually based on statclock(), and not even used for scheduling.


May be for SCHED_4BSD, but not for SCHED_ULE.  In SCHED_ULE both %cpu and 
thread priority based on the same ts_ticks counter, that is based on 
hardclock() as time source. Interactivity calculation uses alike logic and 
uses the same time source.


Hmm.  I missed this because it hacks on the 'ticks' global.  It is clearer
in intermediate versions which use the scheduler API sched_tick(), which
is the hardclock analogue of sched_clock() for statclock.  sched_tick() is
now bogus since it is null for all schedulers.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-02 Thread Luigi Rizzo
On Mon, Dec 31, 2012 at 12:17:27PM +0200, Alexander Motin wrote:
 On 31.12.2012 08:17, Luigi Rizzo wrote:
 On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
...
 Then I noticed you had a 12_26 patchset so I tested
 that (after crudely fixing a couple uninitialized var warnings), and it
 all looks good on this arm (Raspberry Pi).  I'll attach the results.
 
 It's so sweet to be able to do precision sleeps.
 
 Thank you for testing, Ian.
 
 interesting numbers, but there seems to be some problem in computing
 the exact interval; delays are much larger than expected.
 
 In this test, the original timer code used to round to the next multiple
 of 1 tick and then add another tick (except for the kqueue case),
 which is exactly what you see in the second set of measurements.
 
 The calloutng code however seems to do something odd:
 in addition to fixed overhead (some 50us, which you can see in
 the tests for 1us and 300us), all delay seem to be ~10% larger
 than what is requested, upper bounded to 10ms (note, the
 numbers are averages so i cannot tell whether all samples are
 the same or there is some distribution of values).
 
 I am not sure if this error is peculiar of the ARM version or also
 appears on x86/amd64 but I believe it should be fixed.
 
 If you look at the results below:
 
 1us  possily ok:
  for very short intervals i would expect some kind
  of 'reschedule' without actually firing a timer; maybe
  50us are what it takes to do a round through the scheduler ?
 
 300usprobably ok
  i guess the extra 50-90us are what it takes to do a round
  through the scheduler
 
 1000us   borderline (this is the case for poll and kqueue, which are
  rounded to 1ms)
  here intervals seem to be increased by 10%, and i cannot see
  a good reason for this (more below).
 
 3000us and above: wrong
  here again, the intervals seem to be 10% larger than what is
  requested, perhaps limiting the error to 10-20ms.
 
 
 Maybe the 10% extension results from creating a default 'precision'
 for legacy calls, but i do not think this is done correctly.
 
 First of all, if users do not specify a precision themselves, the
 automatically generated value should never exceed one tick.
 
 Second, the only point of a 'precision' parameter is to merge
 requests that may be close in time, so if there is already a
 timer scheduled within [Treq, Treq+precision] i will get it;
 but if there no pending timer, then one should schedule it
 for the requested interval.
 
 Davide/Alexander, any ideas ?
 
 All mentioned effects could be explained with implemented logic. 50us at 
 1us is probably sum of minimal latency of the hardware eventtimer on the 
 specific platform and some software processing overhead (syscall, 
 callout, timecouters, scheduler, etc). At later points system starts to 
 noticeably use precision specified by kern.timecounter.alloweddeviation 
 sysctl. It affects results from two sides: 1) extending intervals for 
 specified percent of time to allow event aggregation, and 2) choosing 
 time base between fast getbinuptime() and precise binuptime(). Extending 
 interval is needed to aggregate not only callouts with each other, but 
 also callouts with other system events, which are impossible to schedule 
 in advance. It gives specified relative error, but no more then one CPU 
 wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) 
 it is 1/hz, for completely idle one it can be up to 0.5s. Second point 
 allows to reduce processing overhead by the cost of error up to 1/hz for 
 long periods ((100/allowed)*(1/hz)), when it is used.

i am not sure what you mean by extending interval, but i believe the
logic should be the following:

- say user requests a timeout after X seconds and with a tolerance of D second
  (both X and D are fractional, so they can be short).  Interpret this as

   the system should do its best to generate an event between X and X+D 
seconds

- convert X to an absolute time, T_X

- if there are any pending events already scheduled between T_X and T_X+D,
  then by definition they are acceptable. Attach the requested timeout
  to the earliest of these events.

- otherwise, schedule an event at time T_X (because there is no valid
  reason to generate a late event, and it makes no sense from an
  energy saving standpoint, either -- see below).

It seems to me that you are instead extending the requested interval
upfront, which causes some gratuitous pessimizations in scheduling
the callout.

Re. energy savings: the gain in extending the timeout cannot exceed
the value D/X. So while it may make sense to extend a 1us request
to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s,
it is completely pointless from an energy saving standpoint to
introduce a 10ms error on a 300ms request.

(even though i hate the idea that a 1us request defaults to
a 50us delay; but that is hopefully something that can be tuned
in a 

Re: [RFC/RFT] calloutng

2013-01-02 Thread Alexander Motin

On 02.01.2013 12:57, Luigi Rizzo wrote:

On Mon, Dec 31, 2012 at 12:17:27PM +0200, Alexander Motin wrote:

On 31.12.2012 08:17, Luigi Rizzo wrote:

On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:

...

Then I noticed you had a 12_26 patchset so I tested
that (after crudely fixing a couple uninitialized var warnings), and it
all looks good on this arm (Raspberry Pi).  I'll attach the results.

It's so sweet to be able to do precision sleeps.


Thank you for testing, Ian.


interesting numbers, but there seems to be some problem in computing
the exact interval; delays are much larger than expected.

In this test, the original timer code used to round to the next multiple
of 1 tick and then add another tick (except for the kqueue case),
which is exactly what you see in the second set of measurements.

The calloutng code however seems to do something odd:
in addition to fixed overhead (some 50us, which you can see in
the tests for 1us and 300us), all delay seem to be ~10% larger
than what is requested, upper bounded to 10ms (note, the
numbers are averages so i cannot tell whether all samples are
the same or there is some distribution of values).

I am not sure if this error is peculiar of the ARM version or also
appears on x86/amd64 but I believe it should be fixed.

If you look at the results below:

1us possily ok:
for very short intervals i would expect some kind
of 'reschedule' without actually firing a timer; maybe
50us are what it takes to do a round through the scheduler ?

300us   probably ok
i guess the extra 50-90us are what it takes to do a round
through the scheduler

1000us  borderline (this is the case for poll and kqueue, which are
rounded to 1ms)
here intervals seem to be increased by 10%, and i cannot see
a good reason for this (more below).

3000us and above: wrong
here again, the intervals seem to be 10% larger than what is
requested, perhaps limiting the error to 10-20ms.


Maybe the 10% extension results from creating a default 'precision'
for legacy calls, but i do not think this is done correctly.

First of all, if users do not specify a precision themselves, the
automatically generated value should never exceed one tick.

Second, the only point of a 'precision' parameter is to merge
requests that may be close in time, so if there is already a
timer scheduled within [Treq, Treq+precision] i will get it;
but if there no pending timer, then one should schedule it
for the requested interval.

Davide/Alexander, any ideas ?


All mentioned effects could be explained with implemented logic. 50us at
1us is probably sum of minimal latency of the hardware eventtimer on the
specific platform and some software processing overhead (syscall,
callout, timecouters, scheduler, etc). At later points system starts to
noticeably use precision specified by kern.timecounter.alloweddeviation
sysctl. It affects results from two sides: 1) extending intervals for
specified percent of time to allow event aggregation, and 2) choosing
time base between fast getbinuptime() and precise binuptime(). Extending
interval is needed to aggregate not only callouts with each other, but
also callouts with other system events, which are impossible to schedule
in advance. It gives specified relative error, but no more then one CPU
wakeup period in absolute: for busy CPU (not skipping hardclock() ticks)
it is 1/hz, for completely idle one it can be up to 0.5s. Second point
allows to reduce processing overhead by the cost of error up to 1/hz for
long periods ((100/allowed)*(1/hz)), when it is used.


i am not sure what you mean by extending interval, but i believe the
logic should be the following:

- say user requests a timeout after X seconds and with a tolerance of D second
   (both X and D are fractional, so they can be short).  Interpret this as

the system should do its best to generate an event between X and X+D 
seconds

- convert X to an absolute time, T_X

- if there are any pending events already scheduled between T_X and T_X+D,
   then by definition they are acceptable. Attach the requested timeout
   to the earliest of these events.


All above is true, but not following.


- otherwise, schedule an event at time T_X (because there is no valid
   reason to generate a late event, and it makes no sense from an
   energy saving standpoint, either -- see below).


System may have many interrupts except timer: network, disk, ... WiFi 
cards generate interrupts with AP beacon rate -- dozens times per 
second. It is not very efficient to wake up CPU precisely at T_X time, 
that may be just 100us earlier then next hardware interrupt. That's why 
timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next 
statclock, ...). As result, event will be handled within allowed range, 
but real delay will depends on current environment conditions.



It seems to me that you are instead extending the requested interval

Re: [RFC/RFT] calloutng

2013-01-02 Thread Luigi Rizzo
On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote:
 On 02.01.2013 12:57, Luigi Rizzo wrote:
...
 i am not sure what you mean by extending interval, but i believe the
 logic should be the following:
 
 - say user requests a timeout after X seconds and with a tolerance of D 
 second
(both X and D are fractional, so they can be short).  Interpret this as
 
 the system should do its best to generate an event between X and X+D 
 seconds
 
 - convert X to an absolute time, T_X
 
 - if there are any pending events already scheduled between T_X and T_X+D,
then by definition they are acceptable. Attach the requested timeout
to the earliest of these events.
 
 All above is true, but not following.
 
 - otherwise, schedule an event at time T_X (because there is no valid
reason to generate a late event, and it makes no sense from an
energy saving standpoint, either -- see below).
 
 System may have many interrupts except timer: network, disk, ... WiFi 
 cards generate interrupts with AP beacon rate -- dozens times per 
 second. It is not very efficient to wake up CPU precisely at T_X time, 
 that may be just 100us earlier then next hardware interrupt. That's why 
 timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next 
 statclock, ...). As result, event will be handled within allowed range, 
 but real delay will depends on current environment conditions.

I don't see why system events (hardclock, statclock, 0.5s,...)
need to be treated specially -- and i am saying this also in
the interest of simplifying the logic of the code.

First of all, if you know that there is already a hardclock/statclock/*
scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
was no event scheduled in [T_X, T_X+D] so you need to generate
a new one.

Surely scheduling the event at T_X+D instead of T_X increases the
chance of merging events. But the saving are smaller and smaller
as the value X increases. This particular client will only
change its request rate from 1/X to 1/(X+D) so in relative terms
the gain is ( 1/X - 1/(X+D) ) / (1/(X+D) ) = D/X

Example: if X = 300ms, and D = 10ms (as in the test case)
you just save one interrupt every 30seconds by scheduling at
T_X+D instead of T_X. Are we actually able to measure the
difference ?

Even at high interrupt rates (e.g. X = 1ms) you are not
going to save a lot unless the tolerance D is very large,
which is generally undesirable for other reasons
(presumably, applications are not going to be happy
if you artificially double their timeouts).
Now, say your application requests timeouts every X = 300ms.

 It seems to me that you are instead extending the requested interval
 upfront, which causes some gratuitous pessimizations in scheduling
 the callout.
 
 Re. energy savings: the gain in extending the timeout cannot exceed
 the value D/X. So while it may make sense to extend a 1us request
 to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s,
 it is completely pointless from an energy saving standpoint to
 introduce a 10ms error on a 300ms request.
 
 I am not so sure in this. When CPU package is in C7 sleep state with all 
 buses and caches shut down and memory set to self refresh, it consumes 
 very few (some milli-Watts) of power. Wake up from that state takes 
 100us or even more with power consumption much higher then normal 
 operational one. Sure, if we compare it with power consumption of 100% 
 CPU load, difference between 10 and 100 wakeups per second may be small, 
 but when comparing to each other in some low-power environment for 
 mostly idle system it may be much more significant.

see above -- at low rates the difference is not measurable,
at high rates thCe only obvious answer is do not use C7 unless
if the next interrupt is due in less than 2..5 milliseconds

 (even though i hate the idea that a 1us request defaults to
 a 50us delay; but that is hopefully something that can be tuned
 in a platform-specific way and perhaps at runtime).
 
 It is 50us on this ARM. On SandyBridge Core i7 it is only about 2us.

very good, i suspected something similar, just wanted to be sure :)

cheers
luigi

 -- 
 Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-02 Thread Alexander Motin
02.01.2013 14:28 пользователь Luigi Rizzo ri...@iet.unipi.it написал:

 On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote:
  On 02.01.2013 12:57, Luigi Rizzo wrote:
 ...
  i am not sure what you mean by extending interval, but i believe the
  logic should be the following:
  
  - say user requests a timeout after X seconds and with a tolerance of D
  second
 (both X and D are fractional, so they can be short).  Interpret
this as
  
  the system should do its best to generate an event between X and
X+D
  seconds
  
  - convert X to an absolute time, T_X
  
  - if there are any pending events already scheduled between T_X and
T_X+D,
 then by definition they are acceptable. Attach the requested timeout
 to the earliest of these events.
 
  All above is true, but not following.
 
  - otherwise, schedule an event at time T_X (because there is no valid
 reason to generate a late event, and it makes no sense from an
 energy saving standpoint, either -- see below).
 
  System may have many interrupts except timer: network, disk, ... WiFi
  cards generate interrupts with AP beacon rate -- dozens times per
  second. It is not very efficient to wake up CPU precisely at T_X time,
  that may be just 100us earlier then next hardware interrupt. That's why
  timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next
  statclock, ...). As result, event will be handled within allowed range,
  but real delay will depends on current environment conditions.

 I don't see why system events (hardclock, statclock, 0.5s,...)
 need to be treated specially -- and i am saying this also in
 the interest of simplifying the logic of the code.

Sure. That is mostly for historical reasons. At some point they should
disappear, just not now, as patch is already quite big.

 First of all, if you know that there is already a hardclock/statclock/*
 scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
 was no event scheduled in [T_X, T_X+D] so you need to generate
 a new one.

That is true, but my main point was about merging with external events,
which I can't predict and the only way to merge is increase sleep period,
hoping for better.

 Surely scheduling the event at T_X+D instead of T_X increases the
 chance of merging events. But the saving are smaller and smaller
 as the value X increases. This particular client will only
 change its request rate from 1/X to 1/(X+D) so in relative terms
 the gain is ( 1/X - 1/(X+D) ) / (1/(X+D) ) = D/X

 Example: if X = 300ms, and D = 10ms (as in the test case)
 you just save one interrupt every 30seconds by scheduling at
 T_X+D instead of T_X. Are we actually able to measure the
 difference ?

 Even at high interrupt rates (e.g. X = 1ms) you are not
 going to save a lot unless the tolerance D is very large,
 which is generally undesirable for other reasons
 (presumably, applications are not going to be happy
 if you artificially double their timeouts).
 Now, say your application requests timeouts every X = 300ms.

With default precision set to 5% it will be only 5% save from periods
increase. But that is absolutely not my goal!

Imagine different case: you have NIC interrupts at 1000Hz. Also you have
100 callouts with 100ms period each. If we program timer with absolute
precision, you will get about 2000Hz of total interrupt rate. But if we
allow just 2% deviation, most of callouts will be grouped with NIC
interrupts and total rate will be 1000Hz. Loosing _less_ then 2% of
precision we are reducing interrupt rate in _half_!

  It seems to me that you are instead extending the requested interval
  upfront, which causes some gratuitous pessimizations in scheduling
  the callout.
  
  Re. energy savings: the gain in extending the timeout cannot exceed
  the value D/X. So while it may make sense to extend a 1us request
  to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s,
  it is completely pointless from an energy saving standpoint to
  introduce a 10ms error on a 300ms request.
 
  I am not so sure in this. When CPU package is in C7 sleep state with all
  buses and caches shut down and memory set to self refresh, it consumes
  very few (some milli-Watts) of power. Wake up from that state takes
  100us or even more with power consumption much higher then normal
  operational one. Sure, if we compare it with power consumption of 100%
  CPU load, difference between 10 and 100 wakeups per second may be small,
  but when comparing to each other in some low-power environment for
  mostly idle system it may be much more significant.

 see above -- at low rates the difference is not measurable,
 at high rates thCe only obvious answer is do not use C7 unless
 if the next interrupt is due in less than 2..5 milliseconds

  (even though i hate the idea that a 1us request defaults to
  a 50us delay; but that is hopefully something that can be tuned
  in a platform-specific way and perhaps at runtime).
 
  It is 50us on this ARM. 

Re: [RFC/RFT] calloutng

2013-01-02 Thread Ian Lepore
On Wed, 2013-01-02 at 15:11 +0200, Alexander Motin wrote:
 02.01.2013 14:28 пользователь Luigi Rizzo ri...@iet.unipi.it написал:
 
  On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote:
   On 02.01.2013 12:57, Luigi Rizzo wrote:

  First of all, if you know that there is already a hardclock/statclock/*
  scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
  was no event scheduled in [T_X, T_X+D] so you need to generate
  a new one.
 
 That is true, but my main point was about merging with external events,
 which I can't predict and the only way to merge is increase sleep period,
 hoping for better.
 

This really is the crux of the problem, because you can't *by default*
dispatch an event earlier than requested because that's just a violation
of the usual rules of precision timing (where you expect to be late but
never early).

Sometimes there is no need for such precision, and an early wakeup is no
more or less detrimental than a late wakeup.  In fact, that may even be
the majority case.  I wonder if it might make sense to allow the
precision specification to indicate whether it needs traditional never
early behavior or whether it can be interpretted as plus or minus this
amount is fine.  Like maybe negative precision is interpretted as plus
or minus abs(precision) or something like that.

Or maybe even the other way around... you get plus or minus precision
by default and the few things that really care about precision timing
have a way of indicating that.  (But in that case the userland sleeps
would have to assume the traditional behavior because that's how they've
always been documented.)

-- Ian


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2013-01-02 Thread Luigi Rizzo
On Wed, Jan 02, 2013 at 03:11:05PM +0200, Alexander Motin wrote:
...
  First of all, if you know that there is already a hardclock/statclock/*
  scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
  was no event scheduled in [T_X, T_X+D] so you need to generate
  a new one.
 
 That is true, but my main point was about merging with external events,
 which I can't predict and the only way to merge is increase sleep period,
 hoping for better.

ok, now i understand why you want to schedule for T_X+D.

Probably one way to close this discussion would be to provide
a sysctl so the sysadmin can decide which point in the interval
to pick when there is no suitable callout already scheduled.

cheers
luigi

-+---
  Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
  http://www.iet.unipi.it/~luigi/. Universita` di Pisa
  TEL  +39-050-2211611   . via Diotisalvi 2
  Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-02 Thread Konstantin Belousov
On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote:
 On Wed, Jan 02, 2013 at 03:11:05PM +0200, Alexander Motin wrote:
 ...
   First of all, if you know that there is already a hardclock/statclock/*
   scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
   was no event scheduled in [T_X, T_X+D] so you need to generate
   a new one.
  
  That is true, but my main point was about merging with external events,
  which I can't predict and the only way to merge is increase sleep period,
  hoping for better.
 
 ok, now i understand why you want to schedule for T_X+D.
 
 Probably one way to close this discussion would be to provide
 a sysctl so the sysadmin can decide which point in the interval
 to pick when there is no suitable callout already scheduled.
Isn't trying to synchronize to the external events in this way unsafe ?
I remember, but cannot find the reference right now, a scheduler
exploit(s) which completely hide malicious thread from the time
accounting, by making it voluntary yielding right before statclock
should fire. If statistic gathering could be piggy-backed on the
external interrupt, and attacker can control the source of the external
events, wouldn't this give her a handle ?


pgpVV7IP9DZbJ.pgp
Description: PGP signature


Re: [RFC/RFT] calloutng

2013-01-02 Thread Adrian Chadd
.. I'm pretty damned sure we're going to need to enforce a never
earlier than X latency.

Is there a more detailed writeup of calloutng somewhere, besides
David's slides? The wiki page is rather empty.

Eg - I think this work does coalesce wakeups, right? Or it can? So
when in low-power scenarios you can end up with lower-resolution
callout periods, but many less CPU wakeups a second?

(Do we actually _expose_ wakeups-per-second somewhere?)



Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-02 Thread Alexander Motin

On 02.01.2013 19:09, Konstantin Belousov wrote:

On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote:

Probably one way to close this discussion would be to provide
a sysctl so the sysadmin can decide which point in the interval
to pick when there is no suitable callout already scheduled.

Isn't trying to synchronize to the external events in this way unsafe ?
I remember, but cannot find the reference right now, a scheduler
exploit(s) which completely hide malicious thread from the time
accounting, by making it voluntary yielding right before statclock
should fire. If statistic gathering could be piggy-backed on the
external interrupt, and attacker can control the source of the external
events, wouldn't this give her a handle ?


There are many different kinds of accounting with different 
characteristics. Run time for each thread calculated using high 
resolution per-CPU clocks on each context switch. It is impossible to 
hide from it. System load average updated using callout and aligned with 
hardclock(). Hiding from it was easy before, but it can be made more 
complicated (asynchronous) now. Per-CPU SYSTEM/INTERRUPT/USER/NICE/IDLE 
counters are updated by statcklock(), that is asynchronous to 
hardclock() and hiding from it supposed to be more complicated. But even 
before it was possible, since hardclock() frequency is 8 times higher 
then statclock() one. More important for scheduling fairness thread's 
CPU percentage is also based on hardclock() and hiding from it was 
trivial before, since all sleep primitives were strictly aligned to 
hardclock(). Now it is slightly less trivial, since this alignment was 
removed and user-level APIs provide no easy way to enforce it.


The only way to get really safe accounting is forget about sampling and 
use potentially more expensive other ways. It was always stopped by lack 
of cheap and reliable clocks. But since TSC is P-state invariant on most 
of CPUs present now it could probably be reconsidered.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-02 Thread Alexander Motin

On 02.01.2013 18:08, Adrian Chadd wrote:

.. I'm pretty damned sure we're going to need to enforce a never
earlier than X latency.


Do you mean here that we should never wake up before specified time 
(just as specified by the most of existing APIs), or that we should not 
allow sleep shorter then some value to avoid DoS? At least on x86 
nanosleep(0) doesn't allow to block the system. Also there is already 
present mechanism for specifying minimum timer programming interval in 
eventtimers(9) KPI.



Is there a more detailed writeup of calloutng somewhere, besides
David's slides? The wiki page is rather empty.


There are updated manual pages in the patch. Also Davide written some 
blog during GSoC. Now we are working on papers for the AsiaBSDCon.



Eg - I think this work does coalesce wakeups, right? Or it can? So
when in low-power scenarios you can end up with lower-resolution
callout periods, but many less CPU wakeups a second?


This work does coalesce wakeups out of the box, but also provide ways to 
improve it further, where possible. With additional tuning of some 
kernel subsystems and drivers I was able to drop total idle interrupt 
rate down to 10-15Hz on arm and 20-30Hz on x86.



(Do we actually _expose_ wakeups-per-second somewhere?)


On systems with ACPI there are average per-CPU sleep times exposed via 
sysctls dev.cpu.X.cx_usage. Also cpu_idle() call rate calculated by both 
schedulers for purposes of idle loop optimizations, but it is not 
exposed outside now. Also for idle SMP system enabling COUNT_IPIS should 
give number of interrupts in systat comparable to number of wakeups. I 
am mostly using the last way.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2013-01-02 Thread Kevin Oberman
On Wed, Jan 2, 2013 at 2:06 PM, Alexander Motin m...@freebsd.org wrote:
 On 02.01.2013 18:08, Adrian Chadd wrote:

 .. I'm pretty damned sure we're going to need to enforce a never
 earlier than X latency.


 Do you mean here that we should never wake up before specified time (just as
 specified by the most of existing APIs), or that we should not allow sleep
 shorter then some value to avoid DoS? At least on x86 nanosleep(0) doesn't
 allow to block the system. Also there is already present mechanism for
 specifying minimum timer programming interval in eventtimers(9) KPI.

I can see serious performance issues with some hardware (wireless
comes to mind) if things happen too quickly. Intuition is that it
could also play hob with VMs.

I believe that the proper way is to wake between  T_X and T_X + D.
This assumes that D is max_wake_delay, not deviation, which leaves us
at the original of (T_X) = event_time = (T_X + D).
-- 
R. Kevin Oberman, Network Engineer
E-mail: kob6...@gmail.com
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-31 Thread Davide Italiano
On Sun, Dec 30, 2012 at 10:17 PM, Luigi Rizzo ri...@iet.unipi.it wrote:
 On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
 ...
 I grabbed testsleep.c to test an arm event timer implementation, and had
 to fix a couple nits... kqueueto was missing from the names[] array, and
 I had to add a * 1000 to a couple places where usec was stuffed into a
 timespec's tv_nsec.

 I also tested the calloutng_12_17 patches and the kqueue stuff behaved
 very strangely.  Then I noticed you had a 12_26 patchset so I tested
 that (after crudely fixing a couple uninitialized var warnings), and it
 all looks good on this arm (Raspberry Pi).  I'll attach the results.

 It's so sweet to be able to do precision sleeps.

 interesting numbers, but there seems to be some problem in computing
 the exact interval; delays are much larger than expected.

 In this test, the original timer code used to round to the next multiple
 of 1 tick and then add another tick (except for the kqueue case),
 which is exactly what you see in the second set of measurements.

 The calloutng code however seems to do something odd:
 in addition to fixed overhead (some 50us, which you can see in
 the tests for 1us and 300us), all delay seem to be ~10% larger
 than what is requested, upper bounded to 10ms (note, the
 numbers are averages so i cannot tell whether all samples are
 the same or there is some distribution of values).

 I am not sure if this error is peculiar of the ARM version or also
 appears on x86/amd64 but I believe it should be fixed.

 If you look at the results below:

 1us possily ok:
 for very short intervals i would expect some kind
 of 'reschedule' without actually firing a timer; maybe
 50us are what it takes to do a round through the scheduler ?

 300us   probably ok
 i guess the extra 50-90us are what it takes to do a round
 through the scheduler

 1000us  borderline (this is the case for poll and kqueue, which are
 rounded to 1ms)
 here intervals seem to be increased by 10%, and i cannot see
 a good reason for this (more below).

 3000us and above: wrong
 here again, the intervals seem to be 10% larger than what is
 requested, perhaps limiting the error to 10-20ms.


 Maybe the 10% extension results from creating a default 'precision'
 for legacy calls, but i do not think this is done correctly.

 First of all, if users do not specify a precision themselves, the
 automatically generated value should never exceed one tick.

 Second, the only point of a 'precision' parameter is to merge
 requests that may be close in time, so if there is already a
 timer scheduled within [Treq, Treq+precision] i will get it;
 but if there no pending timer, then one should schedule it
 for the requested interval.

 Davide/Alexander, any ideas ?


The first question is what couple timecounter/eventtimer is used.
Some hardware is slower than other, and this may affect numbers.

Thanks for testing, BTW.

Davide

 cheers
 luigi

 for t in 1 300 3000 3 30 ; do
   for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
 ./testsleep $t $m
   done
 done


 With calloutng_12_26.patch...

 HZ=100   HZ=250   HZ=1000
 --   
 select  1 55.79  1 50.96  1 61.32
 poll1   1109.46  1   1107.86  1   1114.38
 usleep  1 56.33  1 72.90  1 62.78
 nanosleep   1 52.66  1 55.23  1 64.23
 kqueue  1   1114.23  1   1113.81  1   1121.21
 kqueueto1 65.44  1 71.00  1 75.01
 syscall 1  4.70  1  4.45  1  4.55
 select300355.79300357.76300362.35
 poll  300   1107.85300   1122.55300   1115.62
 usleep300355.28300357.28300360.79
 nanosleep 300354.49300355.82300360.62
 kqueue300   1112.57300   1118.13300   1117.16
 kqueueto  300375.98300378.62300395.61
 syscall   300  4.41300  4.45300  4.54
 select   3000   3246.75   3000   3246.74   3000   3252.72
 poll 3000   3238.10   3000   3229.12   3000   3250.10
 usleep   3000   3242.47   3000   3237.06   3000   3249.61
 nanosleep3000   3238.79   3000   3231.55   3000   3248.11
 kqueue   3000   3240.01   3000   3236.07   3000   3247.60
 kqueueto 3000   3265.36   3000   3267.22   3000   3274.96
 syscall  3000  4.69   3000  4.44   3000  4.50
 select  3  31714.60  3  31941.17  3  32467.69
 poll3  31522.76  3  31983.00  3  32497.81
 usleep  3  

Re: [RFC/RFT] calloutng

2012-12-31 Thread Alexander Motin

On 31.12.2012 08:17, Luigi Rizzo wrote:

On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
...

I grabbed testsleep.c to test an arm event timer implementation, and had
to fix a couple nits... kqueueto was missing from the names[] array, and
I had to add a * 1000 to a couple places where usec was stuffed into a
timespec's tv_nsec.

I also tested the calloutng_12_17 patches and the kqueue stuff behaved
very strangely.


I've rewritten kqueue timeouts at the calloutng_12_26.patch.


Then I noticed you had a 12_26 patchset so I tested
that (after crudely fixing a couple uninitialized var warnings), and it
all looks good on this arm (Raspberry Pi).  I'll attach the results.

It's so sweet to be able to do precision sleeps.


Thank you for testing, Ian.


interesting numbers, but there seems to be some problem in computing
the exact interval; delays are much larger than expected.

In this test, the original timer code used to round to the next multiple
of 1 tick and then add another tick (except for the kqueue case),
which is exactly what you see in the second set of measurements.

The calloutng code however seems to do something odd:
in addition to fixed overhead (some 50us, which you can see in
the tests for 1us and 300us), all delay seem to be ~10% larger
than what is requested, upper bounded to 10ms (note, the
numbers are averages so i cannot tell whether all samples are
the same or there is some distribution of values).

I am not sure if this error is peculiar of the ARM version or also
appears on x86/amd64 but I believe it should be fixed.

If you look at the results below:

1us possily ok:
for very short intervals i would expect some kind
of 'reschedule' without actually firing a timer; maybe
50us are what it takes to do a round through the scheduler ?

300us   probably ok
i guess the extra 50-90us are what it takes to do a round
through the scheduler

1000us  borderline (this is the case for poll and kqueue, which are
rounded to 1ms)
here intervals seem to be increased by 10%, and i cannot see
a good reason for this (more below).

3000us and above: wrong
here again, the intervals seem to be 10% larger than what is
requested, perhaps limiting the error to 10-20ms.


Maybe the 10% extension results from creating a default 'precision'
for legacy calls, but i do not think this is done correctly.

First of all, if users do not specify a precision themselves, the
automatically generated value should never exceed one tick.

Second, the only point of a 'precision' parameter is to merge
requests that may be close in time, so if there is already a
timer scheduled within [Treq, Treq+precision] i will get it;
but if there no pending timer, then one should schedule it
for the requested interval.

Davide/Alexander, any ideas ?


All mentioned effects could be explained with implemented logic. 50us at 
1us is probably sum of minimal latency of the hardware eventtimer on the 
specific platform and some software processing overhead (syscall, 
callout, timecouters, scheduler, etc). At later points system starts to 
noticeably use precision specified by kern.timecounter.alloweddeviation 
sysctl. It affects results from two sides: 1) extending intervals for 
specified percent of time to allow event aggregation, and 2) choosing 
time base between fast getbinuptime() and precise binuptime(). Extending 
interval is needed to aggregate not only callouts with each other, but 
also callouts with other system events, which are impossible to schedule 
in advance. It gives specified relative error, but no more then one CPU 
wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) 
it is 1/hz, for completely idle one it can be up to 0.5s. Second point 
allows to reduce processing overhead by the cost of error up to 1/hz for 
long periods ((100/allowed)*(1/hz)), when it is used.


To get best possible precision kern.timecounter.alloweddeviation sysctl 
can be set to smaller value. Setting it to 0 will effectively disable 
all optimizations, but should give 50us precision in all cases.



for t in 1 300 3000 3 30 ; do
   for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
 ./testsleep $t $m
   done
done


With calloutng_12_26.patch...

 HZ=100   HZ=250   HZ=1000
--   
select  1 55.79  1 50.96  1 61.32
poll1   1109.46  1   1107.86  1   1114.38
usleep  1 56.33  1 72.90  1 62.78
nanosleep   1 52.66  1 55.23  1 64.23
kqueue  1   1114.23  1   1113.81  1   1121.21
kqueueto1 65.44  1 71.00  1 75.01
syscall 1  4.70  1  4.45  1  4.55
select300355.79300357.76 

Re: [RFC/RFT] calloutng

2012-12-31 Thread Ian Lepore
On Mon, 2012-12-31 at 12:17 +0200, Alexander Motin wrote:
 On 31.12.2012 08:17, Luigi Rizzo wrote:
  On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
  ...
  I grabbed testsleep.c to test an arm event timer implementation, and had
  to fix a couple nits... kqueueto was missing from the names[] array, and
  I had to add a * 1000 to a couple places where usec was stuffed into a
  timespec's tv_nsec.
 
  I also tested the calloutng_12_17 patches and the kqueue stuff behaved
  very strangely.
 
 I've rewritten kqueue timeouts at the calloutng_12_26.patch.
 
  Then I noticed you had a 12_26 patchset so I tested
  that (after crudely fixing a couple uninitialized var warnings), and it
  all looks good on this arm (Raspberry Pi).  I'll attach the results.
 
  It's so sweet to be able to do precision sleeps.
 
 Thank you for testing, Ian.
 
  interesting numbers, but there seems to be some problem in computing
  the exact interval; delays are much larger than expected.
 
  In this test, the original timer code used to round to the next multiple
  of 1 tick and then add another tick (except for the kqueue case),
  which is exactly what you see in the second set of measurements.
 
  The calloutng code however seems to do something odd:
  in addition to fixed overhead (some 50us, which you can see in
  the tests for 1us and 300us), all delay seem to be ~10% larger
  than what is requested, upper bounded to 10ms (note, the
  numbers are averages so i cannot tell whether all samples are
  the same or there is some distribution of values).
 
  I am not sure if this error is peculiar of the ARM version or also
  appears on x86/amd64 but I believe it should be fixed.
 
  If you look at the results below:
 
  1us possily ok:
  for very short intervals i would expect some kind
  of 'reschedule' without actually firing a timer; maybe
  50us are what it takes to do a round through the scheduler ?
 
  300us   probably ok
  i guess the extra 50-90us are what it takes to do a round
  through the scheduler
 
  1000us  borderline (this is the case for poll and kqueue, which are
  rounded to 1ms)
  here intervals seem to be increased by 10%, and i cannot see
  a good reason for this (more below).
 
  3000us and above: wrong
  here again, the intervals seem to be 10% larger than what is
  requested, perhaps limiting the error to 10-20ms.
 
 
  Maybe the 10% extension results from creating a default 'precision'
  for legacy calls, but i do not think this is done correctly.
 
  First of all, if users do not specify a precision themselves, the
  automatically generated value should never exceed one tick.
 
  Second, the only point of a 'precision' parameter is to merge
  requests that may be close in time, so if there is already a
  timer scheduled within [Treq, Treq+precision] i will get it;
  but if there no pending timer, then one should schedule it
  for the requested interval.
 
  Davide/Alexander, any ideas ?
 
 All mentioned effects could be explained with implemented logic. 50us at 
 1us is probably sum of minimal latency of the hardware eventtimer on the 
 specific platform and some software processing overhead (syscall, 
 callout, timecouters, scheduler, etc). At later points system starts to 
 noticeably use precision specified by kern.timecounter.alloweddeviation 
 sysctl. It affects results from two sides: 1) extending intervals for 
 specified percent of time to allow event aggregation, and 2) choosing 
 time base between fast getbinuptime() and precise binuptime(). Extending 
 interval is needed to aggregate not only callouts with each other, but 
 also callouts with other system events, which are impossible to schedule 
 in advance. It gives specified relative error, but no more then one CPU 
 wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) 
 it is 1/hz, for completely idle one it can be up to 0.5s. Second point 
 allows to reduce processing overhead by the cost of error up to 1/hz for 
 long periods ((100/allowed)*(1/hz)), when it is used.
 
 To get best possible precision kern.timecounter.alloweddeviation sysctl 
 can be set to smaller value. Setting it to 0 will effectively disable 
 all optimizations, but should give 50us precision in all cases.
 
  for t in 1 300 3000 3 30 ; do
 for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
   ./testsleep $t $m
 done
  done
 
  [test results snipped]
 

I should have posted some information about the test platform...  It's a
single-processor 700mhz arm chip.  There was essentially nothing else
running during the tests other than mostly-idle daemons (sshd, ntpd, the
usual suspects).  Kernel debugging options off (INVARIANTS[_SUPPORT],
DIAGNOSTIC, and WITNESS).  

Some sysctl values of interest...

rpi# sysctl kern.timecounter
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: BCM2835 Timecounter(1000) dummy(-100)

Re: [RFC/RFT] calloutng

2012-12-31 Thread Alexander Motin

On 31.12.2012 17:02, Ian Lepore wrote:

On Mon, 2012-12-31 at 12:17 +0200, Alexander Motin wrote:

On 31.12.2012 08:17, Luigi Rizzo wrote:

On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
...

I grabbed testsleep.c to test an arm event timer implementation, and had
to fix a couple nits... kqueueto was missing from the names[] array, and
I had to add a * 1000 to a couple places where usec was stuffed into a
timespec's tv_nsec.

I also tested the calloutng_12_17 patches and the kqueue stuff behaved
very strangely.


I've rewritten kqueue timeouts at the calloutng_12_26.patch.


Then I noticed you had a 12_26 patchset so I tested
that (after crudely fixing a couple uninitialized var warnings), and it
all looks good on this arm (Raspberry Pi).  I'll attach the results.

It's so sweet to be able to do precision sleeps.


Thank you for testing, Ian.


interesting numbers, but there seems to be some problem in computing
the exact interval; delays are much larger than expected.

In this test, the original timer code used to round to the next multiple
of 1 tick and then add another tick (except for the kqueue case),
which is exactly what you see in the second set of measurements.

The calloutng code however seems to do something odd:
in addition to fixed overhead (some 50us, which you can see in
the tests for 1us and 300us), all delay seem to be ~10% larger
than what is requested, upper bounded to 10ms (note, the
numbers are averages so i cannot tell whether all samples are
the same or there is some distribution of values).

I am not sure if this error is peculiar of the ARM version or also
appears on x86/amd64 but I believe it should be fixed.

If you look at the results below:

1us possily ok:
for very short intervals i would expect some kind
of 'reschedule' without actually firing a timer; maybe
50us are what it takes to do a round through the scheduler ?

300us   probably ok
i guess the extra 50-90us are what it takes to do a round
through the scheduler

1000us  borderline (this is the case for poll and kqueue, which are
rounded to 1ms)
here intervals seem to be increased by 10%, and i cannot see
a good reason for this (more below).

3000us and above: wrong
here again, the intervals seem to be 10% larger than what is
requested, perhaps limiting the error to 10-20ms.


Maybe the 10% extension results from creating a default 'precision'
for legacy calls, but i do not think this is done correctly.

First of all, if users do not specify a precision themselves, the
automatically generated value should never exceed one tick.

Second, the only point of a 'precision' parameter is to merge
requests that may be close in time, so if there is already a
timer scheduled within [Treq, Treq+precision] i will get it;
but if there no pending timer, then one should schedule it
for the requested interval.

Davide/Alexander, any ideas ?


All mentioned effects could be explained with implemented logic. 50us at
1us is probably sum of minimal latency of the hardware eventtimer on the
specific platform and some software processing overhead (syscall,
callout, timecouters, scheduler, etc). At later points system starts to
noticeably use precision specified by kern.timecounter.alloweddeviation
sysctl. It affects results from two sides: 1) extending intervals for
specified percent of time to allow event aggregation, and 2) choosing
time base between fast getbinuptime() and precise binuptime(). Extending
interval is needed to aggregate not only callouts with each other, but
also callouts with other system events, which are impossible to schedule
in advance. It gives specified relative error, but no more then one CPU
wakeup period in absolute: for busy CPU (not skipping hardclock() ticks)
it is 1/hz, for completely idle one it can be up to 0.5s. Second point
allows to reduce processing overhead by the cost of error up to 1/hz for
long periods ((100/allowed)*(1/hz)), when it is used.

To get best possible precision kern.timecounter.alloweddeviation sysctl
can be set to smaller value. Setting it to 0 will effectively disable
all optimizations, but should give 50us precision in all cases.


for t in 1 300 3000 3 30 ; do
for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
  ./testsleep $t $m
done
done

[test results snipped]




I should have posted some information about the test platform...  It's a
single-processor 700mhz arm chip.  There was essentially nothing else
running during the tests other than mostly-idle daemons (sshd, ntpd, the
usual suspects).  Kernel debugging options off (INVARIANTS[_SUPPORT],
DIAGNOSTIC, and WITNESS).

Some sysctl values of interest...

rpi# sysctl kern.timecounter
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: BCM2835 Timecounter(1000) dummy(-100)
kern.timecounter.hardware: BCM2835 Timecounter
kern.timecounter.alloweddeviation: 5

Re: [RFC/RFT] calloutng

2012-12-30 Thread Ian Lepore
On Wed, 2012-12-26 at 21:24 +0200, Alexander Motin wrote:
 On 26.12.2012 01:21, Marius Strobl wrote:
  On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
  Experiments with dummynet shown ineffective support for very short
  tick-based callouts. New version fixes that, allowing to get as many
  tick-based callout events as hz value permits, while still be able to
  aggregate events and generating minimum of interrupts.
 
  Also this version modifies system load average calculation to fix some
  cases existing in HEAD and 9 branches, that could be fixed with new
  direct callout functionality.
 
  http://people.freebsd.org/~mav/calloutng_12_17.patch
 
  With several important changes made last time I am going to delay commit
  to HEAD for another week to do more testing. Comments and new test cases
  are welcome. Thanks for staying tuned and commenting.
 
  FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
  try on sparc64 and it at least survives a buildworld there. However,
  with the patched kernels, buildworld times seem to increase slightly but
  reproducible by 1-2% (I only did four runs but typically buildworld
  times are rather stable and don't vary more than a minute for the
  same kernel and source here). Is this an expected trade-off (system
  time as such doesn't seem to increase)?
 
 I don't think build process uses significant number of callouts to 
 affect results directly. I think this additional time could be result of 
 the deeper next event look up, done by the new code, that is practically 
 useless for sparc64, which effectively has no cpu_idle() routine. It 
 wouldn't affect system time and wouldn't show up in any statistics 
 (except PMC or something alike) because it is executed inside timer 
 hardware interrupt handler. If my guess is right, that is a part that 
 probably still could be optimized. I'll look on it. Thanks.
 
  Is there anything specific to test?
 
 Since the most of code is MI, for sparc64 I would mostly look on related 
 MD parts (eventtimers and timecounters) to make sure they are working 
 reliably in more stressful conditions.  I still have some worries about 
 possible deadlock on hardware where IPIs are used to fetch present time 
 from other CPU.
 
 Here is small tool we are using for test correctness and performance of 
 different user-level APIs: http://people.freebsd.org/~mav/testsleep.c
 

I grabbed testsleep.c to test an arm event timer implementation, and had
to fix a couple nits... kqueueto was missing from the names[] array, and
I had to add a * 1000 to a couple places where usec was stuffed into a
timespec's tv_nsec.

I also tested the calloutng_12_17 patches and the kqueue stuff behaved
very strangely.  Then I noticed you had a 12_26 patchset so I tested
that (after crudely fixing a couple uninitialized var warnings), and it
all looks good on this arm (Raspberry Pi).  I'll attach the results.

It's so sweet to be able to do precision sleeps.

-- Ian


for t in 1 300 3000 3 30 ; do
  for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
./testsleep $t $m
  done
done


With calloutng_12_26.patch...

HZ=100   HZ=250   HZ=1000
--   
select  1 55.79  1 50.96  1 61.32
poll1   1109.46  1   1107.86  1   1114.38
usleep  1 56.33  1 72.90  1 62.78
nanosleep   1 52.66  1 55.23  1 64.23
kqueue  1   1114.23  1   1113.81  1   1121.21
kqueueto1 65.44  1 71.00  1 75.01
syscall 1  4.70  1  4.45  1  4.55
select300355.79300357.76300362.35
poll  300   1107.85300   1122.55300   1115.62
usleep300355.28300357.28300360.79
nanosleep 300354.49300355.82300360.62
kqueue300   1112.57300   1118.13300   1117.16
kqueueto  300375.98300378.62300395.61
syscall   300  4.41300  4.45300  4.54
select   3000   3246.75   3000   3246.74   3000   3252.72
poll 3000   3238.10   3000   3229.12   3000   3250.10
usleep   3000   3242.47   3000   3237.06   3000   3249.61
nanosleep3000   3238.79   3000   3231.55   3000   3248.11
kqueue   3000   3240.01   3000   3236.07   3000   3247.60
kqueueto 3000   3265.36   3000   3267.22   3000   3274.96
syscall  3000  4.69   3000  4.44   3000  4.50
select  3  31714.60  3  31941.17  3  32467.69
poll3  31522.76  3  31983.00  3  32497.81
usleep  3  31459.67  3  31980.76  3  32458.71
nanosleep   

Re: [RFC/RFT] calloutng

2012-12-30 Thread Luigi Rizzo
On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
...
 I grabbed testsleep.c to test an arm event timer implementation, and had
 to fix a couple nits... kqueueto was missing from the names[] array, and
 I had to add a * 1000 to a couple places where usec was stuffed into a
 timespec's tv_nsec.
 
 I also tested the calloutng_12_17 patches and the kqueue stuff behaved
 very strangely.  Then I noticed you had a 12_26 patchset so I tested
 that (after crudely fixing a couple uninitialized var warnings), and it
 all looks good on this arm (Raspberry Pi).  I'll attach the results.
 
 It's so sweet to be able to do precision sleeps.

interesting numbers, but there seems to be some problem in computing
the exact interval; delays are much larger than expected.

In this test, the original timer code used to round to the next multiple
of 1 tick and then add another tick (except for the kqueue case),
which is exactly what you see in the second set of measurements.

The calloutng code however seems to do something odd:
in addition to fixed overhead (some 50us, which you can see in
the tests for 1us and 300us), all delay seem to be ~10% larger
than what is requested, upper bounded to 10ms (note, the
numbers are averages so i cannot tell whether all samples are
the same or there is some distribution of values).

I am not sure if this error is peculiar of the ARM version or also
appears on x86/amd64 but I believe it should be fixed.

If you look at the results below:

1us possily ok:
for very short intervals i would expect some kind
of 'reschedule' without actually firing a timer; maybe
50us are what it takes to do a round through the scheduler ?

300us   probably ok
i guess the extra 50-90us are what it takes to do a round
through the scheduler

1000us  borderline (this is the case for poll and kqueue, which are
rounded to 1ms)
here intervals seem to be increased by 10%, and i cannot see
a good reason for this (more below).

3000us and above: wrong
here again, the intervals seem to be 10% larger than what is
requested, perhaps limiting the error to 10-20ms.


Maybe the 10% extension results from creating a default 'precision'
for legacy calls, but i do not think this is done correctly.

First of all, if users do not specify a precision themselves, the
automatically generated value should never exceed one tick.

Second, the only point of a 'precision' parameter is to merge
requests that may be close in time, so if there is already a
timer scheduled within [Treq, Treq+precision] i will get it;
but if there no pending timer, then one should schedule it
for the requested interval.

Davide/Alexander, any ideas ?

cheers
luigi

 for t in 1 300 3000 3 30 ; do
   for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
 ./testsleep $t $m
   done
 done
 
 
 With calloutng_12_26.patch...
 
 HZ=100   HZ=250   HZ=1000
 --   
 select  1 55.79  1 50.96  1 61.32
 poll1   1109.46  1   1107.86  1   1114.38
 usleep  1 56.33  1 72.90  1 62.78
 nanosleep   1 52.66  1 55.23  1 64.23
 kqueue  1   1114.23  1   1113.81  1   1121.21
 kqueueto1 65.44  1 71.00  1 75.01
 syscall 1  4.70  1  4.45  1  4.55
 select300355.79300357.76300362.35
 poll  300   1107.85300   1122.55300   1115.62
 usleep300355.28300357.28300360.79
 nanosleep 300354.49300355.82300360.62
 kqueue300   1112.57300   1118.13300   1117.16
 kqueueto  300375.98300378.62300395.61
 syscall   300  4.41300  4.45300  4.54
 select   3000   3246.75   3000   3246.74   3000   3252.72
 poll 3000   3238.10   3000   3229.12   3000   3250.10
 usleep   3000   3242.47   3000   3237.06   3000   3249.61
 nanosleep3000   3238.79   3000   3231.55   3000   3248.11
 kqueue   3000   3240.01   3000   3236.07   3000   3247.60
 kqueueto 3000   3265.36   3000   3267.22   3000   3274.96
 syscall  3000  4.69   3000  4.44   3000  4.50
 select  3  31714.60  3  31941.17  3  32467.69
 poll3  31522.76  3  31983.00  3  32497.81
 usleep  3  31459.67  3  31980.76  3  32458.71
 nanosleep   3  31431.02  3  31982.22  3  32525.20
 kqueue  3  31466.75  3  31873.90  3  31973.54
 kqueueto3  31564.67  3  32522.35  3  32475.59
 syscall

Re: [RFC/RFT] calloutng

2012-12-26 Thread Alexander Motin

On 26.12.2012 01:21, Marius Strobl wrote:

On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:

Experiments with dummynet shown ineffective support for very short
tick-based callouts. New version fixes that, allowing to get as many
tick-based callout events as hz value permits, while still be able to
aggregate events and generating minimum of interrupts.

Also this version modifies system load average calculation to fix some
cases existing in HEAD and 9 branches, that could be fixed with new
direct callout functionality.

http://people.freebsd.org/~mav/calloutng_12_17.patch

With several important changes made last time I am going to delay commit
to HEAD for another week to do more testing. Comments and new test cases
are welcome. Thanks for staying tuned and commenting.


FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
try on sparc64 and it at least survives a buildworld there. However,
with the patched kernels, buildworld times seem to increase slightly but
reproducible by 1-2% (I only did four runs but typically buildworld
times are rather stable and don't vary more than a minute for the
same kernel and source here). Is this an expected trade-off (system
time as such doesn't seem to increase)?


I don't think build process uses significant number of callouts to 
affect results directly. I think this additional time could be result of 
the deeper next event look up, done by the new code, that is practically 
useless for sparc64, which effectively has no cpu_idle() routine. It 
wouldn't affect system time and wouldn't show up in any statistics 
(except PMC or something alike) because it is executed inside timer 
hardware interrupt handler. If my guess is right, that is a part that 
probably still could be optimized. I'll look on it. Thanks.



Is there anything specific to test?


Since the most of code is MI, for sparc64 I would mostly look on related 
MD parts (eventtimers and timecounters) to make sure they are working 
reliably in more stressful conditions.  I still have some worries about 
possible deadlock on hardware where IPIs are used to fetch present time 
from other CPU.


Here is small tool we are using for test correctness and performance of 
different user-level APIs: http://people.freebsd.org/~mav/testsleep.c


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-25 Thread Marius Strobl
On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
 Experiments with dummynet shown ineffective support for very short 
 tick-based callouts. New version fixes that, allowing to get as many 
 tick-based callout events as hz value permits, while still be able to 
 aggregate events and generating minimum of interrupts.
 
 Also this version modifies system load average calculation to fix some 
 cases existing in HEAD and 9 branches, that could be fixed with new 
 direct callout functionality.
 
 http://people.freebsd.org/~mav/calloutng_12_17.patch
 
 With several important changes made last time I am going to delay commit 
 to HEAD for another week to do more testing. Comments and new test cases 
 are welcome. Thanks for staying tuned and commenting.

FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
try on sparc64 and it at least survives a buildworld there. However,
with the patched kernels, buildworld times seem to increase slightly but
reproducible by 1-2% (I only did four runs but typically buildworld
times are rather stable and don't vary more than a minute for the
same kernel and source here). Is this an expected trade-off (system
time as such doesn't seem to increase)?
Is there anything specific to test?

Marius

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-22 Thread Jilles Tjoelker
On Wed, Dec 19, 2012 at 11:00:06AM +, Poul-Henning Kamp wrote:
 
 In message 50d192e8.3020...@freebsd.org, Alexander Motin writes:

 Linux uses 32.32 format in their eventtimers code.

 (And that is no accident, I know who they got the number from :-)

 But if at some point we want to be able to 
 handle absolute wall time, [...]

 Then you have other problems, including but not limited to clock
 being stepped, leap-seconds, suspend/resume and frequency stability.

 If you want to support callouts of the type (At 14:00 UTC tomorrow)
 (disregarding the time-zone issue), you need to catch all significant
 changes to our UTC estimate and recalibrate your callout based on
 that.

 It is not obvious that we have applications for such an API that
 warrant the complexity.

 Either way, such a facility should be layered on top of the callout
 facility, which should always run in elapsed time[1] with no attention
 paid to what NTPD might do to the UTC estimate.

POSIX specifies functions that assume such a facility exists, although
applications may not care much if we implement them incorrectly.

For example, pthread_mutex_timedlock() and sem_timedwait() shall time
out when the CLOCK_REALTIME clock reaches the given value, and
pthread_cond_timedwait() and clock_nanosleep() (with TIMER_ABSTIME)
shall time out when the specified clock reaches the given value.

 So summary: 32.32 is the right format.

 Poul-Henning

 [1] Notice that elapsed time needs a firm definition with respect
 to suspend/resume, and that this decision has big implications
 for the API use and code duplication.

 I think it prudent to specify a flag to callouts, to tell what
 should happen on suspend/resume, something like:

   SR_CANCEL   /* Cancel the callout on S/R */
   /* no flag* /* Toll this callout only when system is running */
   SR_IGNORE   /* Toll suspended time from callout */

 If you get this right, callouts from device drivers will just DTRT,
 if you get it wrong, all device drivers will need boilerplate code
 to handle S/R

Userland could get access to this via CLOCK_REALTIME vs CLOCK_MONOTONIC
vs CLOCK_UPTIME.

-- 
Jilles Tjoelker
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-22 Thread Poul-Henning Kamp

In message 2012125025.ga46...@stack.nl, Jilles Tjoelker writes:

 Either way, such a facility should be layered on top of the callout
 facility, which should always run in elapsed time[1] with no attention
 paid to what NTPD might do to the UTC estimate.

POSIX specifies functions that assume such a facility exists, although
applications may not care much if we implement them incorrectly.

It should still be implemented op top of callouts, not as part of:
it is an entirely different thing to try to do right.

 I think it prudent to specify a flag to callouts, to tell what
 should happen on suspend/resume, something like:

  SR_CANCEL   /* Cancel the callout on S/R */
  /* no flag* /* Toll this callout only when system is running */
  SR_IGNORE   /* Toll suspended time from callout */

 If you get this right, callouts from device drivers will just DTRT,
 if you get it wrong, all device drivers will need boilerplate code
 to handle S/R

Userland could get access to this via CLOCK_REALTIME vs CLOCK_MONOTONIC
vs CLOCK_UPTIME.

I have _no_ idea what you are trying to say here...

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-21 Thread Fabian Keil
Fabian Keil freebsd-lis...@fabiankeil.de wrote:

 Alexander Motin m...@freebsd.org wrote:
 
  On 20.12.2012 15:26, Fabian Keil wrote:
   Alexander Motin m...@freebsd.org wrote:
  
   On 20.12.2012 12:56, Fabian Keil wrote:
   Alexander Motin m...@freebsd.org wrote:
  
   Experiments with dummynet shown ineffective support for very short
   tick-based callouts. New version fixes that, allowing to get as many
   tick-based callout events as hz value permits, while still be able to
   aggregate events and generating minimum of interrupts.
  
   Also this version modifies system load average calculation to fix some
   cases existing in HEAD and 9 branches, that could be fixed with new
   direct callout functionality.
  
   http://people.freebsd.org/~mav/calloutng_12_17.patch
  
   With this patch (and the previous one, I didn't test the others)
   my mouse cursor is occasionally not reacting for short amounts of
   time (less than a second, but long enough to be noticeable).
 
   Could you try to revert part of the patch, related to dev/atkbdc? I am
   not strong in details of that hardware, but in comments there mention
   that they are related. May be lost keyboard interrupts (which polling
   rate was increased to 1 second) cause PS/2 mouse delays.
  
   I reverted the changes to sys/dev/atkbdc/* about an hour ago
   and so far it's looking good. I'll report back tomorrow after
   some more testing.

Still looking good.

Fabian


signature.asc
Description: PGP signature


Re: [RFC/RFT] calloutng

2012-12-21 Thread Alexander Motin

On 21.12.2012 14:16, Fabian Keil wrote:

Fabian Keil freebsd-lis...@fabiankeil.de wrote:


Alexander Motin m...@freebsd.org wrote:


On 20.12.2012 15:26, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


On 20.12.2012 12:56, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


Experiments with dummynet shown ineffective support for very short
tick-based callouts. New version fixes that, allowing to get as many
tick-based callout events as hz value permits, while still be able to
aggregate events and generating minimum of interrupts.

Also this version modifies system load average calculation to fix some
cases existing in HEAD and 9 branches, that could be fixed with new
direct callout functionality.

http://people.freebsd.org/~mav/calloutng_12_17.patch


With this patch (and the previous one, I didn't test the others)
my mouse cursor is occasionally not reacting for short amounts of
time (less than a second, but long enough to be noticeable).



Could you try to revert part of the patch, related to dev/atkbdc? I am
not strong in details of that hardware, but in comments there mention
that they are related. May be lost keyboard interrupts (which polling
rate was increased to 1 second) cause PS/2 mouse delays.


I reverted the changes to sys/dev/atkbdc/* about an hour ago
and so far it's looking good. I'll report back tomorrow after
some more testing.


Still looking good.


Thank you. I think it may be some locking issue in atkbdc code. I'll 
revert that part of the patch until somebody rewrite it properly.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-20 Thread Fabian Keil
Alexander Motin m...@freebsd.org wrote:

 Experiments with dummynet shown ineffective support for very short 
 tick-based callouts. New version fixes that, allowing to get as many 
 tick-based callout events as hz value permits, while still be able to 
 aggregate events and generating minimum of interrupts.
 
 Also this version modifies system load average calculation to fix some 
 cases existing in HEAD and 9 branches, that could be fixed with new 
 direct callout functionality.
 
 http://people.freebsd.org/~mav/calloutng_12_17.patch

With this patch (and the previous one, I didn't test the others)
my mouse cursor is occasionally not reacting for short amounts of
time (less than a second, but long enough to be noticeable).

Every now and then the window manager (i3-wm) changes window focus
which could be explained by either phantom keyboard or mouse input,
or terminal lines are marked as if the cursor was moved with the
left button pressed.

The problems happen a couple of times per hour but I haven't
been able to intentionally reproduce them. They only seem to
occur while I'm moving the cursor, but of course I wouldn't
otherwise notice a unresponsive cursor anyway.

While the cursor is unresponsive, keyboard input and the rest
of the system works as expected as far as I can tell.

If I set debug.psm.loglevel=4 I get a psm0: lost interrupt?
message once per second when not moving the mouse, however that
also happens without the patch and thus might be unrelated.

I'm using moused.

I'm not sure what additional information is necessary to debug this,
so here a bunch of sysctl values that may or may not be relevant:

fk@r500 ~ $sysctl kern.eventtimer kern.timecounter
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.HPET.flags: 3
kern.eventtimer.et.HPET.frequency: 14318180
kern.eventtimer.et.HPET.quality: 450
kern.eventtimer.et.HPET1.flags: 3
kern.eventtimer.et.HPET1.frequency: 14318180
kern.eventtimer.et.HPET1.quality: 440
kern.eventtimer.et.HPET2.flags: 3
kern.eventtimer.et.HPET2.frequency: 14318180
kern.eventtimer.et.HPET2.quality: 440
kern.eventtimer.et.HPET3.flags: 3
kern.eventtimer.et.HPET3.frequency: 14318180
kern.eventtimer.et.HPET3.quality: 440
kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440) HPET3(440) i8254(100)
kern.eventtimer.singlemul: 2
kern.eventtimer.idletick: 0
kern.eventtimer.activetick: 1
kern.eventtimer.timer: HPET
kern.eventtimer.periodic: 0
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 25970
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.HPET.counter: 3963519587
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 7323739
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 454465294
kern.timecounter.tc.TSC.frequency: 1995040520
kern.timecounter.tc.TSC.quality: -1000
kern.timecounter.stepwarnings: 0
kern.timecounter.hardware: HPET
kern.timecounter.choice: TSC(-1000) ACPI-fast(900) HPET(950) i8254(0) 
dummy(-100)
kern.timecounter.tick: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.invariant_tsc: 1
kern.timecounter.smp_tsc: 0

The system is a Lenovo R500 with a
Intel(R) Core(TM)2 Duo CPU T5870  @ 2.00GHz (1995.04-MHz K8-class CPU)

Fabian


signature.asc
Description: PGP signature


Re: [RFC/RFT] calloutng

2012-12-20 Thread Alexander Motin

On 20.12.2012 12:56, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


Experiments with dummynet shown ineffective support for very short
tick-based callouts. New version fixes that, allowing to get as many
tick-based callout events as hz value permits, while still be able to
aggregate events and generating minimum of interrupts.

Also this version modifies system load average calculation to fix some
cases existing in HEAD and 9 branches, that could be fixed with new
direct callout functionality.

http://people.freebsd.org/~mav/calloutng_12_17.patch


With this patch (and the previous one, I didn't test the others)
my mouse cursor is occasionally not reacting for short amounts of
time (less than a second, but long enough to be noticeable).

Every now and then the window manager (i3-wm) changes window focus
which could be explained by either phantom keyboard or mouse input,
or terminal lines are marked as if the cursor was moved with the
left button pressed.

The problems happen a couple of times per hour but I haven't
been able to intentionally reproduce them. They only seem to
occur while I'm moving the cursor, but of course I wouldn't
otherwise notice a unresponsive cursor anyway.

While the cursor is unresponsive, keyboard input and the rest
of the system works as expected as far as I can tell.

If I set debug.psm.loglevel=4 I get a psm0: lost interrupt?
message once per second when not moving the mouse, however that
also happens without the patch and thus might be unrelated.

I'm using moused.


Could you try to revert part of the patch, related to dev/atkbdc? I am 
not strong in details of that hardware, but in comments there mention 
that they are related. May be lost keyboard interrupts (which polling 
rate was increased to 1 second) cause PS/2 mouse delays.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-20 Thread Fabian Keil
Alexander Motin m...@freebsd.org wrote:

 On 20.12.2012 12:56, Fabian Keil wrote:
  Alexander Motin m...@freebsd.org wrote:
 
  Experiments with dummynet shown ineffective support for very short
  tick-based callouts. New version fixes that, allowing to get as many
  tick-based callout events as hz value permits, while still be able to
  aggregate events and generating minimum of interrupts.
 
  Also this version modifies system load average calculation to fix some
  cases existing in HEAD and 9 branches, that could be fixed with new
  direct callout functionality.
 
  http://people.freebsd.org/~mav/calloutng_12_17.patch
 
  With this patch (and the previous one, I didn't test the others)
  my mouse cursor is occasionally not reacting for short amounts of
  time (less than a second, but long enough to be noticeable).
 
  Every now and then the window manager (i3-wm) changes window focus
  which could be explained by either phantom keyboard or mouse input,
  or terminal lines are marked as if the cursor was moved with the
  left button pressed.
 
  The problems happen a couple of times per hour but I haven't
  been able to intentionally reproduce them. They only seem to
  occur while I'm moving the cursor, but of course I wouldn't
  otherwise notice a unresponsive cursor anyway.
 
  While the cursor is unresponsive, keyboard input and the rest
  of the system works as expected as far as I can tell.
 
  If I set debug.psm.loglevel=4 I get a psm0: lost interrupt?
  message once per second when not moving the mouse, however that
  also happens without the patch and thus might be unrelated.
 
  I'm using moused.
 
 Could you try to revert part of the patch, related to dev/atkbdc? I am 
 not strong in details of that hardware, but in comments there mention 
 that they are related. May be lost keyboard interrupts (which polling 
 rate was increased to 1 second) cause PS/2 mouse delays.
 
I reverted the changes to sys/dev/atkbdc/* about an hour ago
and so far it's looking good. I'll report back tomorrow after
some more testing.

Fabian


signature.asc
Description: PGP signature


Re: [RFC/RFT] calloutng

2012-12-20 Thread Alexander Motin

On 20.12.2012 15:26, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


On 20.12.2012 12:56, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


Experiments with dummynet shown ineffective support for very short
tick-based callouts. New version fixes that, allowing to get as many
tick-based callout events as hz value permits, while still be able to
aggregate events and generating minimum of interrupts.

Also this version modifies system load average calculation to fix some
cases existing in HEAD and 9 branches, that could be fixed with new
direct callout functionality.

http://people.freebsd.org/~mav/calloutng_12_17.patch


With this patch (and the previous one, I didn't test the others)
my mouse cursor is occasionally not reacting for short amounts of
time (less than a second, but long enough to be noticeable).

Every now and then the window manager (i3-wm) changes window focus
which could be explained by either phantom keyboard or mouse input,
or terminal lines are marked as if the cursor was moved with the
left button pressed.

The problems happen a couple of times per hour but I haven't
been able to intentionally reproduce them. They only seem to
occur while I'm moving the cursor, but of course I wouldn't
otherwise notice a unresponsive cursor anyway.

While the cursor is unresponsive, keyboard input and the rest
of the system works as expected as far as I can tell.

If I set debug.psm.loglevel=4 I get a psm0: lost interrupt?
message once per second when not moving the mouse, however that
also happens without the patch and thus might be unrelated.

I'm using moused.


Could you try to revert part of the patch, related to dev/atkbdc? I am
not strong in details of that hardware, but in comments there mention
that they are related. May be lost keyboard interrupts (which polling
rate was increased to 1 second) cause PS/2 mouse delays.


I reverted the changes to sys/dev/atkbdc/* about an hour ago
and so far it's looking good. I'll report back tomorrow after
some more testing.


Thank you for the report. If it will be fine. you can try to reapply 
that part of the patch, just changing line:

callout_reset_flags(sc-callout, hz, atkbdtimeout, dev, C_PRELSET(0));
to the:
callout_reset_flags(sc-callout, hz/10, atkbdtimeout, dev, C_PRELSET(0));

It should about to restore original polling interval, but still make it 
more flexible then original.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-20 Thread Fabian Keil
Alexander Motin m...@freebsd.org wrote:

 On 20.12.2012 15:26, Fabian Keil wrote:
  Alexander Motin m...@freebsd.org wrote:
 
  On 20.12.2012 12:56, Fabian Keil wrote:
  Alexander Motin m...@freebsd.org wrote:
 
  Experiments with dummynet shown ineffective support for very short
  tick-based callouts. New version fixes that, allowing to get as many
  tick-based callout events as hz value permits, while still be able to
  aggregate events and generating minimum of interrupts.
 
  Also this version modifies system load average calculation to fix some
  cases existing in HEAD and 9 branches, that could be fixed with new
  direct callout functionality.
 
  http://people.freebsd.org/~mav/calloutng_12_17.patch
 
  With this patch (and the previous one, I didn't test the others)
  my mouse cursor is occasionally not reacting for short amounts of
  time (less than a second, but long enough to be noticeable).

  Could you try to revert part of the patch, related to dev/atkbdc? I am
  not strong in details of that hardware, but in comments there mention
  that they are related. May be lost keyboard interrupts (which polling
  rate was increased to 1 second) cause PS/2 mouse delays.
 
  I reverted the changes to sys/dev/atkbdc/* about an hour ago
  and so far it's looking good. I'll report back tomorrow after
  some more testing.

 Thank you for the report. If it will be fine. you can try to reapply 
 that part of the patch, just changing line:
 callout_reset_flags(sc-callout, hz, atkbdtimeout, dev, C_PRELSET(0));
 to the:
 callout_reset_flags(sc-callout, hz/10, atkbdtimeout, dev, C_PRELSET(0));
 
 It should about to restore original polling interval, but still make it 
 more flexible then original.

With this change the mouse stops working completely after moving
the cursor a couple of pixels, at which point dmesg shows:
 
kbdc: TEST_AUX_PORT status:
kbdc: RESET_AUX return code:00fa
kbdc: RESET_AUX status:00aa
kbdc: RESET_AUX ID:

Using the keyboard frequently causes duplicated characters.

I'm aware of a problem in vanilla CURRENT where the mouse stops working
with similar looking messages when the cursor is being moved at the wrong
moment, but this happens less than once a week and I haven't been able to
track it down yet. It's possible that it's the same bug and your patch just
triggers it more reliable (two times in a row). I haven't seen the
duplicated-characters issue before, though.

Just for kicks I also tried:

callout_reset_flags(sc-callout, hz/100, atkbdtimeout, dev, C_PRELSET(0));

but with this modification I can't even enter my login password
and thus can't tell if the mouse would work.

Fabian


signature.asc
Description: PGP signature


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Alexander Motin

On 19.12.2012 01:37, Ian Lepore wrote:

On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote:

On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote:

On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:

[top posting for readability;
in summary we were discussing the new callout API trying to avoid
an explosion of methods and arguments while at the same time
supporting the old API and the new one]
(I am also Cc-ing phk as he might have better insight
on the topic).

I think the patch you propose is a step in the right direction,
but i still remain concerned by having to pass two bintimes
(by reference, but they should really go by value)
and one 'ticks' value to all these functions.

I am also dubious that we need a full 128 bits to specify
the 'precision': there would be absolutely no loss of functionality
if we decided to specify the precision in powers of 2, so a precision
'k' (signed) means 2^k seconds. This way 8 bits are enough to
represent any precision we want.


...

I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.  It would
typical to want a 500uS timeout but be willing to late by up to 250uS if


i said k is signed so negative values represent fractions of a
second. 2^-128 is pretty short :)


Ahh, I missed that.  Good enough then!  Hmmm, if that ideas survives
further review, then could precision be encoded in 8 bits of the flags,
eliminating another parm?


Those who tracked the branch could see that actually was our first 
approach to handle precision. Unfortunately, it appeared not very 
convenient when you need to convert relative precision in percents or 
fraction of interval to the absolute precision in power of 2. We were 
worried that using some ffsll() for it can be inconvenient or expensive. 
But since we are now talking about passing relative bintime as an 
argument, that may be more viable option. I'll make another try.


Thanks for the input. Pity it didn't happen couple of months ago.

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Poul-Henning Kamp

In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore writes
:
On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:

I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.

It is a bad idea, and it is physically pointless, given the stabilities
of the timebases available for computers in general.

Please just take my word as a time-nut, and use a 32.32 binary format
in seconds (see previous email) and be done with it.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Davide Italiano
On Wed, Dec 19, 2012 at 1:54 AM, Poul-Henning Kamp p...@phk.freebsd.dk wrote:
 
 In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore 
 writes
 :
On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:

I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.

 It is a bad idea, and it is physically pointless, given the stabilities
 of the timebases available for computers in general.

 Please just take my word as a time-nut, and use a 32.32 binary format
 in seconds (see previous email) and be done with it.

 --
 Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
 p...@freebsd.org | TCP/IP since RFC 956
 FreeBSD committer   | BSD since 4.3-tahoe
 Never attribute to malice what can adequately be explained by incompetence.

Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.
I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code
duplication for all the basic conversion/math operation. On the other
hand, 32.32 may not be enough in the long future.
What do you think about that?

Thanks,

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Alexander Motin

On 19.12.2012 12:03, Davide Italiano wrote:

On Wed, Dec 19, 2012 at 1:54 AM, Poul-Henning Kamp p...@phk.freebsd.dk wrote:


In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore writes
:

On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:



I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.


It is a bad idea, and it is physically pointless, given the stabilities
of the timebases available for computers in general.

Please just take my word as a time-nut, and use a 32.32 binary format
in seconds (see previous email) and be done with it.


Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.
I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code
duplication for all the basic conversion/math operation. On the other
hand, 32.32 may not be enough in the long future.
What do you think about that?


Linux uses 32.32 format in their eventtimers code. Respecting that now 
we have no any timer hardware with frequency about 4GHz, that precision 
is probably sufficient. But if at some point we want to be able to 
handle absolute wall time, then 32bit integer part may be not enough. 
Then we return to the question: how many different data types do we 
want to see in one subsystem? Sure, using single 64bit value would be 
much easier then struct bintime from many perspectives, but what's about 
the edge cases?


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Poul-Henning Kamp

In message CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com
, Davide Italiano writes:

Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.

And that is way overkill for specifying a callout, at best your clock
has short term stabilities approaching 1e-8, but likely as bad as 1e-6.

(The reason why bintime is important for timekeeping is that we
accumulate timeintervals approx 1e3 times a second, so the rounding
error has to be much smaller than the short term stability in order
to not dominate)

I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code

No, that was exactly my point:  It should be an integer so that
comparisons and arithmetic is trivial.   A 32.32 format fits
nicely into a int64_t which is readily available in the language.

As I said in my previous email:


typedef dur_t   int64_t;/* signed for bug catching */
#define DURSEC  ((dur_t)1  32)
#define DURMIN  (DURSEC * 60)
#define DURMSEC (DURSEC / 1000)
#define DURUSEC (DURSEC / 1000)
#define DURNSEC (DURSEC / 100)

(Bikeshed the names at your convenience)

Then you can say

callout_foo(34 * DURSEC)
callout_foo(2400 * DURMSEC)
or 
callout_foo(500 * DURNSEC)

With this format you can specify callouts 68 years into the future
with quarter nanosecond resolution, and you can trivially and
efficiently compare dur_t's with
if (d1  d2)

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Poul-Henning Kamp

In message 50d192e8.3020...@freebsd.org, Alexander Motin writes:

Linux uses 32.32 format in their eventtimers code.

(And that is no accident, I know who they got the number from :-)

But if at some point we want to be able to 
handle absolute wall time, [...]

Then you have other problems, including but not limited to clock
being stepped, leap-seconds, suspend/resume and frequency stability.

If you want to support callouts of the type (At 14:00 UTC tomorrow)
(disregarding the time-zone issue), you need to catch all significant
changes to our UTC estimate and recalibrate your callout based on
that.

It is not obvious that we have applications for such an API that
warrant the complexity.

Either way, such a facility should be layered on top of the callout
facility, which should always run in elapsed time[1] with no attention
paid to what NTPD might do to the UTC estimate.

So summary: 32.32 is the right format.

Poul-Henning

[1] Notice that elapsed time needs a firm definition with respect
to suspend/resume, and that this decision has big implications
for the API use and code duplication.

I think it prudent to specify a flag to callouts, to tell what
should happen on suspend/resume, something like:

SR_CANCEL   /* Cancel the callout on S/R */
/* no flag* /* Toll this callout only when system is running */
SR_IGNORE   /* Toll suspended time from callout */

If you get this right, callouts from device drivers will just DTRT,
if you get it wrong, all device drivers will need boilerplate code
to handle S/R

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans

On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com
, Davide Italiano writes:


Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.


And that is way overkill for specifying a callout, at best your clock
has short term stabilities approaching 1e-8, but likely as bad as 1e-6.


So you always agreed with me that bintimes are unsuitable for almost
everything, and especially unsuitable for timeouts? :-)


(The reason why bintime is important for timekeeping is that we
accumulate timeintervals approx 1e3 times a second, so the rounding
error has to be much smaller than the short term stability in order
to not dominate)


bintimes are not unsuitable for timekeeping, but they a painful to use
for other APIs.  You have to either put bintimes in layers in the other
APIs, or convert them to a more suitable format, and there is a problem
placing the conversion at points where it is efficient.  This thread
seems to be mostly about putting the conversion in wrong places.  My
original objection was about using bintimes for almost everything at
the implementation level.


I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code


No, that was exactly my point:  It should be an integer so that
comparisons and arithmetic is trivial.   A 32.32 format fits
nicely into a int64_t which is readily available in the language.


I would have tried a 32 bit format with a variable named 'ticks'.
Something like:
- ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
  this.  The tick period would be constant but for virtual ticks and
  not too small.  hz = 1000 now makes the period too small, and not a
  power of 2.  So make the period 1/128 second.  This gives a 1.24.7
  binary format.  2**24 seconds is 194 days.
- ticks  0.  The 31 value bits are now a cookie (descriptor) referring
  to a bintime or whatever.  This case should rarely be used.  I don't
  like it that a tickless kernel, which is needed mainly for power
  saving, has expanded into complications to support short timeouts
  which should rarely be used.


As I said in my previous email:

   typedef dur_t   int64_t;/* signed for bug catching */
   #define DURSEC  ((dur_t)1  32)
   #define DURMIN  (DURSEC * 60)
   #define DURMSEC (DURSEC / 1000)
   #define DURUSEC (DURSEC / 1000)
   #define DURNSEC (DURSEC / 100)

(Bikeshed the names at your convenience)

Then you can say

callout_foo(34 * DURSEC)
callout_foo(2400 * DURMSEC)
or
callout_foo(500 * DURNSEC)


Constructing the cookie for my special case would not be so easy.


With this format you can specify callouts 68 years into the future
with quarter nanosecond resolution, and you can trivially and
efficiently compare dur_t's with
if (d1  d2)


This would make a better general format than timevals, timespecs and
of course bintimes :-).  It is a bit wasteful for timeouts since
its extremes are rarely used.  Malicious and broken callers can
still cause overflow at 68 years, so you have to check for it and
handle it.  The limit of 194 days is just as good for timeouts.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Poul-Henning Kamp

In message 20121219221518.e1...@besplex.bde.org, Bruce Evans writes:

 With this format you can specify callouts 68 years into the future
 with quarter nanosecond resolution, and you can trivially and
 efficiently compare dur_t's with
  if (d1  d2)

This would make a better general format than timevals, timespecs and
of course bintimes :-).

Except that for absolute timescales, we're running out of the 32 bits
integer part.

Bintimes is a necessary superset of the 32.32 which tries to work
around the necessary but missing int96_t or int128_t[1].

Poul-Henning


[1] A good addition to C would be a general multi-word integer type
where you could ask for any int%d_t or uint%d_t you cared for, and
have the compiler DTRT.  In difference from using a multiword-library,
this would still give these types their natural integer behaviour.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Davide Italiano
On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote:
 On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:

 
 In message
 CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com
 , Davide Italiano writes:

 Right now -- the precision is specified in 'bintime', which is a binary
 number.
 It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
 the specific platform.


 And that is way overkill for specifying a callout, at best your clock
 has short term stabilities approaching 1e-8, but likely as bad as 1e-6.


 So you always agreed with me that bintimes are unsuitable for almost
 everything, and especially unsuitable for timeouts? :-)


 (The reason why bintime is important for timekeeping is that we
 accumulate timeintervals approx 1e3 times a second, so the rounding
 error has to be much smaller than the short term stability in order
 to not dominate)


 bintimes are not unsuitable for timekeeping, but they a painful to use
 for other APIs.  You have to either put bintimes in layers in the other
 APIs, or convert them to a more suitable format, and there is a problem
 placing the conversion at points where it is efficient.  This thread
 seems to be mostly about putting the conversion in wrong places.  My
 original objection was about using bintimes for almost everything at
 the implementation level.


 I do not really think it worth to create another structure for
 handling time (e.g. struct bintime32), as it will lead to code


 No, that was exactly my point:  It should be an integer so that
 comparisons and arithmetic is trivial.   A 32.32 format fits
 nicely into a int64_t which is readily available in the language.


 I would have tried a 32 bit format with a variable named 'ticks'.
 Something like:
 - ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
   this.  The tick period would be constant but for virtual ticks and
   not too small.  hz = 1000 now makes the period too small, and not a
   power of 2.  So make the period 1/128 second.  This gives a 1.24.7
   binary format.  2**24 seconds is 194 days.
 - ticks  0.  The 31 value bits are now a cookie (descriptor) referring
   to a bintime or whatever.  This case should rarely be used.  I don't
   like it that a tickless kernel, which is needed mainly for power
   saving, has expanded into complications to support short timeouts
   which should rarely be used.


Bruce, I don't really agree with this.
The data addressed by cookie should be still stored somewhere, and KBI
will result broken. This, indeed, is not real problem as long as
current calloutng code heavily breaks KBI, but if that was your point,
I don't see how your proposed change could help.


 As I said in my previous email:

typedef dur_t   int64_t;/* signed for bug catching */
#define DURSEC  ((dur_t)1  32)
#define DURMIN  (DURSEC * 60)
#define DURMSEC (DURSEC / 1000)
#define DURUSEC (DURSEC / 1000)
#define DURNSEC (DURSEC / 100)

 (Bikeshed the names at your convenience)

 Then you can say

 callout_foo(34 * DURSEC)
 callout_foo(2400 * DURMSEC)
 or
 callout_foo(500 * DURNSEC)


 Constructing the cookie for my special case would not be so easy.


 With this format you can specify callouts 68 years into the future
 with quarter nanosecond resolution, and you can trivially and
 efficiently compare dur_t's with
 if (d1  d2)


 This would make a better general format than timevals, timespecs and
 of course bintimes :-).  It is a bit wasteful for timeouts since
 its extremes are rarely used.  Malicious and broken callers can
 still cause overflow at 68 years, so you have to check for it and
 handle it.  The limit of 194 days is just as good for timeouts.

 Bruce

I think the phk's proposal  is better. About your overflow objection,
I think is really unlikely to happen, but better safe than sorry.

Thanks

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Poul-Henning Kamp

In message 20121220005706.i1...@besplex.bde.org, Bruce Evans writes:
On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:

 Except that for absolute timescales, we're running out of the 32 bits
 integer part.

Except 32 bit time_t works until 2106 if it is unsigned.

That's sort of not an option.

The real problem was that time_t was not defined as a floating
point number.

 [1] A good addition to C would be a general multi-word integer type
 where you could ask for any int%d_t or uint%d_t you cared for, and
 have the compiler DTRT.  In difference from using a multiword-library,
 this would still give these types their natural integer behaviour.

That would be convenient, but bad for efficiency if it were actually
used much.

You can say that about anything but CPU-native operations, and I doubt
it would be as inefficient as struct bintime, which does not have access
to the carry bit.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans

On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message 20121219221518.e1...@besplex.bde.org, Bruce Evans writes:


With this format you can specify callouts 68 years into the future
with quarter nanosecond resolution, and you can trivially and
efficiently compare dur_t's with
if (d1  d2)


This would make a better general format than timevals, timespecs and
of course bintimes :-).


Except that for absolute timescales, we're running out of the 32 bits
integer part.


Except 32 bit time_t works until 2106 if it is unsigned.


Bintimes is a necessary superset of the 32.32 which tries to work
around the necessary but missing int96_t or int128_t[1].

[1] A good addition to C would be a general multi-word integer type
where you could ask for any int%d_t or uint%d_t you cared for, and
have the compiler DTRT.  In difference from using a multiword-library,
this would still give these types their natural integer behaviour.


That would be convenient, but bad for efficiency if it were actually
used much.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Luigi Rizzo
On Wed, Dec 19, 2012 at 10:51:48AM +, Poul-Henning Kamp wrote:
 
 In message 
 CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com
 , Davide Italiano writes:
 
 Right now -- the precision is specified in 'bintime', which is a binary 
 number.
 It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
 the specific platform.
 
 And that is way overkill for specifying a callout, at best your clock
 has short term stabilities approaching 1e-8, but likely as bad as 1e-6.
 
 (The reason why bintime is important for timekeeping is that we
 accumulate timeintervals approx 1e3 times a second, so the rounding
 error has to be much smaller than the short term stability in order
 to not dominate)
 
 I do not really think it worth to create another structure for
 handling time (e.g. struct bintime32), as it will lead to code
 
 No, that was exactly my point:  It should be an integer so that
 comparisons and arithmetic is trivial.   A 32.32 format fits
 nicely into a int64_t which is readily available in the language.
 
 As I said in my previous email:
 
 
 typedef dur_t   int64_t;/* signed for bug catching */
 #define DURSEC  ((dur_t)1  32)
 #define DURMIN  (DURSEC * 60)
 #define DURMSEC (DURSEC / 1000)
 #define DURUSEC (DURSEC / 1000)
 #define DURNSEC (DURSEC / 100)
 
 (Bikeshed the names at your convenience)
 
 Then you can say
 
   callout_foo(34 * DURSEC)
   callout_foo(2400 * DURMSEC)
 or 
   callout_foo(500 * DURNSEC)

only thing, we must be careful with the parentheses

For instance, in your macro, DURNSEC evaluates to 0 and so
does any multiple of it.
We should define them as

#define DURNSEC DURSEC / 100
...

so DURNSEC is still 0 and 500*DURNSEC gives 214

I am curious that Bruce did not mention this :)

(btw the typedef is swapped, should be typedef int64_t dur_t)

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Poul-Henning Kamp

In message 20121219150809.ga98...@onelab2.iet.unipi.it, Luigi Rizzo writes:

 typedef dur_t   int64_t;/* signed for bug catching */
 #define DURSEC  ((dur_t)1  32)
 #define DURMIN  (DURSEC * 60)
 #define DURMSEC (DURSEC / 1000)
 #define DURUSEC (DURSEC / 1000)
 #define DURNSEC (DURSEC / 100)


only thing, we must be careful with the parentheses

Actually, it's more impportant to be careful with zeros, if you
adjust the above to the correct number of zeros, DURNSEC is 4,
which is within seven percent of the correct value.

(btw the typedef is swapped, should be typedef int64_t dur_t)

Yes, I'm trying to find out of people even listen to me :-)


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Alexander Motin

On 19.12.2012 16:20, Bruce Evans wrote:

On Wed, 19 Dec 2012, Davide Italiano wrote:


On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au
wrote:



I would have tried a 32 bit format with a variable named 'ticks'.
Something like:
- ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
  this.  The tick period would be constant but for virtual ticks and
  not too small.  hz = 1000 now makes the period too small, and not a
  power of 2.  So make the period 1/128 second.  This gives a 1.24.7
  binary format.  2**24 seconds is 194 days.
- ticks  0.  The 31 value bits are now a cookie (descriptor) referring
  to a bintime or whatever.  This case should rarely be used.  I don't
  like it that a tickless kernel, which is needed mainly for power
  saving, has expanded into complications to support short timeouts
  which should rarely be used.


Bruce, I don't really agree with this.
The data addressed by cookie should be still stored somewhere, and KBI
will result broken. This, indeed, is not real problem as long as
current calloutng code heavily breaks KBI, but if that was your point,
I don't see how your proposed change could help.


In the old API, it is an error to pass ticks  0, so only broken old
callers are affected.  Of course, if there are any then it would be
hard to detect their garbage cookies.

Anywy, it's too later to change to this, and maybe also to a 32.32
format.


It would be late to change this after committing. I would definitely 
like it to be done earlier to not redo all the tests, but I think we 
could convert callout and eventtimers code to 32.32 format in several 
days. The only two questions are: do we really want it (won't there be 
any reasons to regret about it) and how do we want it to look?


For the first question my personal showstopper since eventtimers 
creation always was the wish to keep consistency. But benefits of 32.32 
format are clear, and if there are more votes for it, let's consider. If 
now it will be decided that full range will never be useful for callout 
subsystem, then it is obviously not needed for eventtimers also.


About the second question, what do you think about such prototypes:

typedef int64_t sbintime_t

static __inline sbintime_t
bintime_shrink(struct bintime *bt)
{}

static __inline struct bintime
bintime_expand(sbintime_t sbt)
{}

...

int
callout_reset_bt(struct callout *, sbintime_t sbt, sbintime_t pr,
void (*fn)(void *), void *arg, int flags);

, where pr used only for absolute precision, and flags includes: direct 
execution, absolute/relative time in argument, relative precision in 
case of relative sbt, flag for aligning to hardclock() to emulate legacy 
behavior, and potentially flags for reaction on suspend/resume.


Another option is to move absolute precision also to flags, using log2() 
representation, as we tried and as was proposed before. With possibility 
to use precise relative time there will be few cases requiring absolute 
value of precision, that should depend on period. Then there will be no 
extra arguments in the most usual cases.


Wrapper for existing API compatibility will look just like this:

#define callout_reset(c, ticks, fn, arg)\
callout_reset_bt(c, ticks2sbintime(ticks), -1,  \
(fn), (arg), C_HARDCLOCK)

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Davide Italiano
dropping phk _AT_ onelab2 _DOT_ something from CC as long as it
doesn't seem a valid mail address and I'm annoyed mails bounce back.

On Wed, Dec 19, 2012 at 6:20 AM, Bruce Evans b...@optusnet.com.au wrote:
 On Wed, 19 Dec 2012, Davide Italiano wrote:

 On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote:


 I would have tried a 32 bit format with a variable named 'ticks'.
 Something like:
 - ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
   this.  The tick period would be constant but for virtual ticks and
   not too small.  hz = 1000 now makes the period too small, and not a
   power of 2.  So make the period 1/128 second.  This gives a 1.24.7
   binary format.  2**24 seconds is 194 days.
 - ticks  0.  The 31 value bits are now a cookie (descriptor) referring
   to a bintime or whatever.  This case should rarely be used.  I don't
   like it that a tickless kernel, which is needed mainly for power
   saving, has expanded into complications to support short timeouts
   which should rarely be used.


 Bruce, I don't really agree with this.
 The data addressed by cookie should be still stored somewhere, and KBI
 will result broken. This, indeed, is not real problem as long as
 current calloutng code heavily breaks KBI, but if that was your point,
 I don't see how your proposed change could help.


 In the old API, it is an error to pass ticks  0, so only broken old
 callers are affected.  Of course, if there are any then it would be
 hard to detect their garbage cookies.

 Anywy, it's too later to change to this, and maybe also to a 32.32
 format.

 [32.32 format]

It's not too late. What I'd like to do, right now people got
interested in the problem is agreeing on the interface used.
Following this thread, as I've already discussed to mav@, we would
like to decide what of the two is better:
- specify precision as additional argument (as we're doing right now)
- use 'flags' argument
If we allow time argument to be relative and not absolute, as
suggested by luigi@, we can easily use relative precision where we had
to use ffl() before.



 This would make a better general format than timevals, timespecs and
 of course bintimes :-).  It is a bit wasteful for timeouts since
 its extremes are rarely used.  Malicious and broken callers can
 still cause overflow at 68 years, so you have to check for it and
 handle it.  The limit of 194 days is just as good for timeouts.


 I think the phk's proposal  is better. About your overflow objection,
 I think is really unlikely to happen, but better safe than sorry.


 It's very easy for applications to cause kernel overflow using valid
 syscall args like tv_sec = TIME_T_MAX for a relative time in
 nanosleep().  Adding TIME_T_MAX to the current time in seconds overflow
 for all current times except for the first second after the Epoch.
 There is no difference between the overflow for 32-bit and 64-bit
 time_t's for this.  This is now mostly handled so that the behaviour is
 harmless although wrong.  E.g., the timeout might become negative,
 and then since it is not a cookie it is silently replaced by a timeout
 of 1 tick.  In nanosleep(), IIRC there are further overflows that result
 in returning early instead of retrying the 1-tick timeouts endlessly.

 Bruce

Thanks,

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans

On Wed, 19 Dec 2012, Davide Italiano wrote:


On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote:



I would have tried a 32 bit format with a variable named 'ticks'.
Something like:
- ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
  this.  The tick period would be constant but for virtual ticks and
  not too small.  hz = 1000 now makes the period too small, and not a
  power of 2.  So make the period 1/128 second.  This gives a 1.24.7
  binary format.  2**24 seconds is 194 days.
- ticks  0.  The 31 value bits are now a cookie (descriptor) referring
  to a bintime or whatever.  This case should rarely be used.  I don't
  like it that a tickless kernel, which is needed mainly for power
  saving, has expanded into complications to support short timeouts
  which should rarely be used.


Bruce, I don't really agree with this.
The data addressed by cookie should be still stored somewhere, and KBI
will result broken. This, indeed, is not real problem as long as
current calloutng code heavily breaks KBI, but if that was your point,
I don't see how your proposed change could help.


In the old API, it is an error to pass ticks  0, so only broken old
callers are affected.  Of course, if there are any then it would be
hard to detect their garbage cookies.

Anywy, it's too later to change to this, and maybe also to a 32.32
format.

[32.32 format]

This would make a better general format than timevals, timespecs and
of course bintimes :-).  It is a bit wasteful for timeouts since
its extremes are rarely used.  Malicious and broken callers can
still cause overflow at 68 years, so you have to check for it and
handle it.  The limit of 194 days is just as good for timeouts.


I think the phk's proposal  is better. About your overflow objection,
I think is really unlikely to happen, but better safe than sorry.


It's very easy for applications to cause kernel overflow using valid
syscall args like tv_sec = TIME_T_MAX for a relative time in
nanosleep().  Adding TIME_T_MAX to the current time in seconds overflow
for all current times except for the first second after the Epoch.
There is no difference between the overflow for 32-bit and 64-bit
time_t's for this.  This is now mostly handled so that the behaviour is
harmless although wrong.  E.g., the timeout might become negative,
and then since it is not a cookie it is silently replaced by a timeout
of 1 tick.  In nanosleep(), IIRC there are further overflows that result
in returning early instead of retrying the 1-tick timeouts endlessly.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans

On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message 20121220005706.i1...@besplex.bde.org, Bruce Evans writes:

On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



Except that for absolute timescales, we're running out of the 32 bits
integer part.


Except 32 bit time_t works until 2106 if it is unsigned.


That's sort of not an option.


I think it is.  It is just probably not necessary since 32-bit systems
will go away before 2038.


The real problem was that time_t was not defined as a floating
point number.


That would be convenient too, but bad for efficiency on some systems.
Kernels might not be able to use it, and then would have to use an
alternative representation, which they should have done all along.


[1] A good addition to C would be a general multi-word integer type
where you could ask for any int%d_t or uint%d_t you cared for, and
have the compiler DTRT.  In difference from using a multiword-library,
this would still give these types their natural integer behaviour.


That would be convenient, but bad for efficiency if it were actually
used much.


You can say that about anything but CPU-native operations, and I doubt
it would be as inefficient as struct bintime, which does not have access
to the carry bit.


Yes, I would say that about non-native.  It goes against the spirit of C.

OTOH, compilers are getting closer to giving full access to the carry
bit.  I just checked what clang does in a home-made 128-bit add function:

% static void __noinline
% uadd(struct u *xup, struct u *yup)
% {
%   unsigned long long t;
% 
% 	t = xup-w[0] + yup-w[0];

%   if (t  xup-w[0])
%   xup-w[1]++;
%   xup-w[0] = t;
%   xup-w[1] += yup-w[1];
% }
% 
% 	.align	16, 0x90

%   .type   uadd,@function
% uadd:   # @uadd
%   .cfi_startproc
% # BB#0: # %entry
%   movq(%rdi), %rcx
%   movq8(%rdi), %rax
%   addq(%rsi), %rcx

gcc generates an additional cmpq instruction here.

%   jae .LBB2_2

clang uses the carry bit set by the first addition to avoid the comparison,
but still branches.

% # BB#1: # %if.then
%   incq%rax
%   movq%rax, 8(%rdi)

This adds 1 explicitly instead of using adcq, but this is the slow path.

% .LBB2_2:# %if.end
%   movq%rcx, (%rdi)
%   addq8(%rsi), %rax

This is as efficient as possible except for the extra branch, and the
branch is almost perfectly predictable.

%   movq%rax, 8(%rdi)
%   ret
% .Ltmp22:
%   .size   uadd, .Ltmp22-uadd
%   .cfi_endproc

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Poul-Henning Kamp

In message 50d1e0d8.9070...@freebsd.org, Alexander Motin writes:

It would be late to change this after committing. I would definitely 
like it to be done earlier to not redo all the tests, but I think we 
could convert callout and eventtimers code to 32.32 format in several 
days. The only two questions are: do we really want it (won't there be 
any reasons to regret about it) and how do we want it to look?

As much as it pains me to raise this point, we would regret it
if we did not use 32.32, because Linux already went that way.

As much as there is to be said for doing things right, we should
also try to avoid pointless incompatibilities which will make it
needlessly hard for people to move code, particular device drivers
forth and back.


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Bruce Evans

I finally remembered to remove the .it phk :-).

On Wed, 19 Dec 2012, Luigi Rizzo wrote:


On Wed, Dec 19, 2012 at 10:51:48AM +, Poul-Henning Kamp wrote:

...
As I said in my previous email:


typedef dur_t   int64_t;/* signed for bug catching */
#define DURSEC  ((dur_t)1  32)
#define DURMIN  (DURSEC * 60)
#define DURMSEC (DURSEC / 1000)
#define DURUSEC (DURSEC / 1000)
#define DURNSEC (DURSEC / 100)

(Bikeshed the names at your convenience)

Then you can say

callout_foo(34 * DURSEC)
callout_foo(2400 * DURMSEC)
or
callout_foo(500 * DURNSEC)


only thing, we must be careful with the parentheses

For instance, in your macro, DURNSEC evaluates to 0 and so
does any multiple of it.
We should define them as

#define DURNSEC DURSEC / 100
...

so DURNSEC is still 0 and 500*DURNSEC gives 214

I am curious that Bruce did not mention this :)


Er, he was careful.  DURNSEC gives 4, not 0.  This is not very accurate,
but probably good enough.

Your version without parentheses is not so careful and depends on
a magic order of operations and no overflow from this.  E.g.:

500*DURNSEC = 500*DURSEC / 10 = 500*((dur_t)1  32) / 10

This is very accurate and happens not to overflow.  But 5 seconds represented
a little strangely in nanoseconds would overflow:

50*DURNSEC = 50*((dur_t)1  32) / 10

So would 5 billion times DURSEC, but 5 billion seconds is more unreasobable
than 5 billion nanoseconds and the format just can't represent that.



(btw the typedef is swapped, should be typedef int64_t dur_t)


Didn't notice this.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-18 Thread Alexander Motin
Experiments with dummynet shown ineffective support for very short 
tick-based callouts. New version fixes that, allowing to get as many 
tick-based callout events as hz value permits, while still be able to 
aggregate events and generating minimum of interrupts.


Also this version modifies system load average calculation to fix some 
cases existing in HEAD and 9 branches, that could be fixed with new 
direct callout functionality.


http://people.freebsd.org/~mav/calloutng_12_17.patch

With several important changes made last time I am going to delay commit 
to HEAD for another week to do more testing. Comments and new test cases 
are welcome. Thanks for staying tuned and commenting.


On 17.12.2012 01:37, Alexander Motin wrote:

Here is one more version. Unless something new will be found/reported
this may be the last one, because me and Davide are quite satisfied with
the results. If everything will be fine, I think we could commit it to
HEAD closer to the end of the week:
http://people.freebsd.org/~mav/calloutng_12_16.patch

Changes in this version:
  -- Removed couple of redundant variables in callout implementation,
that reduced sizeof(struct callout) by two pointers and simplified some
internal code.
  -- syscons driver was made to schedule only 1-2 callouts per second
instead of 20-30 before when console is in graphical mode and there are
few other things to do. Now my laptop has only about 30 interrupts per
second total during idle periods with X running.
  -- i8254 eventtimer driver was optimized to work faster in disabled by
default one-shot mode.
  -- Few kernel functions were added to make KPIs more complete.
  -- Man pages were updated.
  -- Some style fixes were made.

On 15.12.2012 18:55, Alexander Motin wrote:

I'm sorry to interrupt review, but as usual good ideas came during the
final testing, causing another round. :)  Here is updated patch for
HEAD, that includes several new changes:
http://people.freebsd.org/~mav/calloutng_12_15.patch

The new changes are:
  -- Precision and event aggregation code was reworked. Instead of
previous -prec/+prec representation, precision is now single-sided --
-0/+prec. It allowed to significantly improve precision on long time
intervals for APIs which imply that event should not happen before the
specified time. Depending on CPU activity, mistake for long time
intervals now will never be more then 1-500ms, even if specified
precision allows more.
  -- Some minor optimizations were made to reduce callout overhead and
latency by 1.5-2us. Now on Core2Duo amd64 system with LAPIC eventtimer
and TSC timecounter usleep(1) call from user-level executes in just
5-6us, instead of 7-8us before. Now it can do 180K cycles per second on
single CPU with only partial CPU load.
  -- Number of kernel subsystems (dcons, syscons, yarrow, led, atkbd,
setrlimit) were modified to reduce number of interrupts, also with event
aggregation by explicit specification of the acceptable events
precision. Now my Core2Duo test system has only 30 interrupts per second
in idle. If not remaining syscons events, it could easily be 15. My
IvyBridge ultrabook first time in its history shown 5.5 hours of battery
time with full screen brightness and 10 hours with lid closed.
  -- Some kernel functions were added to make KPIs more complete.

I've successfully tested this patch on amd64 and arm.



--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

2012-12-18 Thread Luigi Rizzo
On Mon, Dec 17, 2012 at 01:22:59PM -0800, Adrian Chadd wrote:
 Personally, I'd rather see some consistently used units here..

bintime (or something similar) is the correct choice here.
If we are concerned about the size (128 bit) then we
can map it to a shorter, fixed point format, such
as sign+31+32 as phk was suggesting.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Luigi Rizzo
On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:
 Hi.
 
  I would instead do the following:
 
 I also don't very like the wide API and want to hear fresh ideas, but 
 approaches to time measurement there are too different to do what you 
 are proposing.  Main problem is that while ticks value is relative, 
 bintime is absolute. It is not easy to make conversion between them fast 
 and precise. I've managed to do it, but the only function that does it 
 now is _callout_reset_on(). All other functions are just passing values 
 down. I am not sure I want to duplicate that code in each place, though 
 doing it at least for for callout may be a good idea.

I am afraid the above is not convincing.

Most/all of the APIs i mentioned still have the conversion from
ticks to bintime, and the code in your patch is just
building multiple parallel paths (one for each of the
three versions of the same function) to some final
piece of code where the conversion takes place.

The problem is that all of this goes through a set of obfuscating
macros and the end result is horrible.

To be clear, i believe the work you have been doing on cleaning up
callout is great, i am just saying that this is the time to look
at the code from a few steps away and clean up all those design
decisions that perhaps were made in a haste to make things work.

I will give you another example to show how convoluted
is the code now:

cv_timedwait() and cv_timedwait_sig() now have three
versions each (plain, bt, flags).

These six are remapped through macros to two functions, _cv_timedwait()
and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
maps to _cv_timedwait_sig() )

These two _cv_timedwait*() take both ticks and bintimes,
and contain this sequence:

+   if (bt == NULL)
+   sleepq_set_timeout_flags(cvp, timo, flags);
+   else
+   sleepq_set_timeout_bt(cvp, bt, precision);

Guess what, both sleepq_* are macros that remap to the same
_sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.

But then if you dig into _sleepq_set_timeout() you'll see

+   if (bt == NULL)
+   callout_reset_flags_on(td-td_slpcallout, timo,
+   sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
+   else
+   callout_reset_bt_on(td-td_slpcallout, bt, precision,
+   sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);

and again both callout_reset*() are again remapped through
macros to _callout_reset_on(), so another useless if (bt == NULL)
And in the end you have the conversion from ticks to bintime.

So basically the code path for cv_timedwait() has those
two useless switches and one useless extra argument,
and the conversion from ticks to bintime is down
deep down in _callout_reset_on() where it can only
be resolved at runtime, whereas by doing the conversion
at the beginning the decision could have been made at compile
time.

So I believe my proposal would give large simplifications in
the code and lead to a much cleaner implementation of what
you have designed:

1. acknowledge the fact that the only representation of time
   that callouts use internally is a bintime+precision, define one
   single function (instead of two or three or six) that implements
   the blessed API, and implement the others with macros or
   inline functions doing the appropriate conversions;

2. specifically, the *_flags() variant has no reason to exist.
   It can be implemented through the *_bt() variant, and
   being a new function the only places where you introduce it
   require manual modifications so you can directly invoke
   the new function.

Again, please take this as constructive criticism, as i really
like the work you have been doing and appreciate the time and
effort you are putting on it

cheers
luigi

 
 Creating sets of three functions I had three different goals:
  - callout_reset() -- it is legacy variant required to keep API 
 compatibility;
  - callout_reset_flags() -- it is for cases where custom precision 
 specification needs to be added to the existing code, or where direct 
 callout execution is needed. Conversion to bintime would additionally 
 complicate consumer code, that I would try to avoid.
  - callout_reset_bt() -- API for the new code, which needs high 
 precision and doesn't mind to operate bintime. Now there is only three 
 such places in kernel now, and I don't think there will be much more.
 
 Respectively, these three options are replicated to other APIs where 
 time intervals are used.
 
 PS: Please keep me in CC.
 
 -- 
 Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Luigi Rizzo
On Tue, Dec 18, 2012 at 06:36:43PM +0100, Luigi Rizzo wrote:
 On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:
...
 So I believe my proposal would give large simplifications in
 the code and lead to a much cleaner implementation of what
 you have designed:
 
 1. acknowledge the fact that the only representation of time
that callouts use internally is a bintime+precision, define one
single function (instead of two or three or six) that implements
the blessed API, and implement the others with macros or
inline functions doing the appropriate conversions;
 
 2. specifically, the *_flags() variant has no reason to exist.
It can be implemented through the *_bt() variant, and
being a new function the only places where you introduce it
require manual modifications so you can directly invoke
the new function.

to clarify: i am not sure if now the *_bt() variant takes flags too,
but my point is that the main API function should take all
supported arguments (including flags) and others should
simply be regarded as simplified versions. More or less what
we have for sockets, with send() and sendmsg() and friend

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Alexander Motin

On 18.12.2012 19:36, Luigi Rizzo wrote:

On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:

I would instead do the following:


I also don't very like the wide API and want to hear fresh ideas, but
approaches to time measurement there are too different to do what you
are proposing.  Main problem is that while ticks value is relative,
bintime is absolute. It is not easy to make conversion between them fast
and precise. I've managed to do it, but the only function that does it
now is _callout_reset_on(). All other functions are just passing values
down. I am not sure I want to duplicate that code in each place, though
doing it at least for for callout may be a good idea.


I am afraid the above is not convincing.

Most/all of the APIs i mentioned still have the conversion from
ticks to bintime, and the code in your patch is just
building multiple parallel paths (one for each of the
three versions of the same function) to some final
piece of code where the conversion takes place.

The problem is that all of this goes through a set of obfuscating
macros and the end result is horrible.

To be clear, i believe the work you have been doing on cleaning up
callout is great, i am just saying that this is the time to look
at the code from a few steps away and clean up all those design
decisions that perhaps were made in a haste to make things work.

I will give you another example to show how convoluted
is the code now:

cv_timedwait() and cv_timedwait_sig() now have three
versions each (plain, bt, flags).

These six are remapped through macros to two functions, _cv_timedwait()
and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
maps to _cv_timedwait_sig() )

These two _cv_timedwait*() take both ticks and bintimes,
and contain this sequence:

+   if (bt == NULL)
+   sleepq_set_timeout_flags(cvp, timo, flags);
+   else
+   sleepq_set_timeout_bt(cvp, bt, precision);

Guess what, both sleepq_* are macros that remap to the same
_sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.

But then if you dig into _sleepq_set_timeout() you'll see

+   if (bt == NULL)
+   callout_reset_flags_on(td-td_slpcallout, timo,
+   sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
+   else
+   callout_reset_bt_on(td-td_slpcallout, bt, precision,
+   sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);

and again both callout_reset*() are again remapped through
macros to _callout_reset_on(), so another useless if (bt == NULL)
And in the end you have the conversion from ticks to bintime.

So basically the code path for cv_timedwait() has those
two useless switches and one useless extra argument,
and the conversion from ticks to bintime is down
deep down in _callout_reset_on() where it can only
be resolved at runtime, whereas by doing the conversion
at the beginning the decision could have been made at compile
time.

So I believe my proposal would give large simplifications in
the code and lead to a much cleaner implementation of what
you have designed:

1. acknowledge the fact that the only representation of time
that callouts use internally is a bintime+precision, define one
single function (instead of two or three or six) that implements
the blessed API, and implement the others with macros or
inline functions doing the appropriate conversions;

2. specifically, the *_flags() variant has no reason to exist.
It can be implemented through the *_bt() variant, and
being a new function the only places where you introduce it
require manual modifications so you can directly invoke
the new function.

Again, please take this as constructive criticism, as i really
like the work you have been doing and appreciate the time and
effort you are putting on it


Your words about useless cascaded ifs touched me. Actually I've looked 
on _callout_reset_bt_on() yesterday, thinking about moving tick to bt 
conversion to separate function or wrapper. The only thing we would save 
in such case is single integer argument (ticks), as all others (bt, 
prec, flags) are used in the new world order. From the other side, to 
make the conversion process really effective and correct, I've used 
quite specific way of obtaining time, that may change in the future. I 
would not really like it to be inlined in every consumer function and 
become an ABI. So I see two possible ways: make that conversion a 
separate non-inline function (that will require two temporary variables 
to return results and will consume some time on call/return), or make 
callout_reset_bt_on() to have extra ticks argument, allowing to use it 
in one or another way without external ifs and macros. In last case all 
_bt functions in other APIs will also obtain ticks, bt, pr and flags 
arguments. Actually flags there could be used to specify time scale 
(monotonic or wall) and time base (relative or absolute), if we 

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Alexander Motin

On 18.12.2012 20:03, Alexander Motin wrote:

On 18.12.2012 19:36, Luigi Rizzo wrote:

On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:

I would instead do the following:


I also don't very like the wide API and want to hear fresh ideas, but
approaches to time measurement there are too different to do what you
are proposing.  Main problem is that while ticks value is relative,
bintime is absolute. It is not easy to make conversion between them fast
and precise. I've managed to do it, but the only function that does it
now is _callout_reset_on(). All other functions are just passing values
down. I am not sure I want to duplicate that code in each place, though
doing it at least for for callout may be a good idea.


I am afraid the above is not convincing.

Most/all of the APIs i mentioned still have the conversion from
ticks to bintime, and the code in your patch is just
building multiple parallel paths (one for each of the
three versions of the same function) to some final
piece of code where the conversion takes place.

The problem is that all of this goes through a set of obfuscating
macros and the end result is horrible.

To be clear, i believe the work you have been doing on cleaning up
callout is great, i am just saying that this is the time to look
at the code from a few steps away and clean up all those design
decisions that perhaps were made in a haste to make things work.

I will give you another example to show how convoluted
is the code now:

cv_timedwait() and cv_timedwait_sig() now have three
versions each (plain, bt, flags).

These six are remapped through macros to two functions, _cv_timedwait()
and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
maps to _cv_timedwait_sig() )

These two _cv_timedwait*() take both ticks and bintimes,
and contain this sequence:

+if (bt == NULL)
+sleepq_set_timeout_flags(cvp, timo, flags);
+else
+sleepq_set_timeout_bt(cvp, bt, precision);

Guess what, both sleepq_* are macros that remap to the same
_sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.

But then if you dig into _sleepq_set_timeout() you'll see

+if (bt == NULL)
+callout_reset_flags_on(td-td_slpcallout, timo,
+sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
+else
+callout_reset_bt_on(td-td_slpcallout, bt, precision,
+sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);

and again both callout_reset*() are again remapped through
macros to _callout_reset_on(), so another useless if (bt == NULL)
And in the end you have the conversion from ticks to bintime.

So basically the code path for cv_timedwait() has those
two useless switches and one useless extra argument,
and the conversion from ticks to bintime is down
deep down in _callout_reset_on() where it can only
be resolved at runtime, whereas by doing the conversion
at the beginning the decision could have been made at compile
time.

So I believe my proposal would give large simplifications in
the code and lead to a much cleaner implementation of what
you have designed:

1. acknowledge the fact that the only representation of time
that callouts use internally is a bintime+precision, define one
single function (instead of two or three or six) that implements
the blessed API, and implement the others with macros or
inline functions doing the appropriate conversions;

2. specifically, the *_flags() variant has no reason to exist.
It can be implemented through the *_bt() variant, and
being a new function the only places where you introduce it
require manual modifications so you can directly invoke
the new function.

Again, please take this as constructive criticism, as i really
like the work you have been doing and appreciate the time and
effort you are putting on it


Your words about useless cascaded ifs touched me. Actually I've looked
on _callout_reset_bt_on() yesterday, thinking about moving tick to bt
conversion to separate function or wrapper. The only thing we would save
in such case is single integer argument (ticks), as all others (bt,
prec, flags) are used in the new world order. From the other side, to
make the conversion process really effective and correct, I've used
quite specific way of obtaining time, that may change in the future. I
would not really like it to be inlined in every consumer function and
become an ABI. So I see two possible ways: make that conversion a
separate non-inline function (that will require two temporary variables
to return results and will consume some time on call/return), or make
callout_reset_bt_on() to have extra ticks argument, allowing to use it
in one or another way without external ifs and macros. In last case all
_bt functions in other APIs will also obtain ticks, bt, pr and flags
arguments. Actually flags there could be used to specify time scale
(monotonic or wall) and time base (relative or absolute), if we decide
to implement all 

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Luigi Rizzo
[top posting for readability;
in summary we were discussing the new callout API trying to avoid
an explosion of methods and arguments while at the same time
supporting the old API and the new one]
(I am also Cc-ing phk as he might have better insight
on the topic).

I think the patch you propose is a step in the right direction,
but i still remain concerned by having to pass two bintimes
(by reference, but they should really go by value)
and one 'ticks' value to all these functions.

I am also dubious that we need a full 128 bits to specify
the 'precision': there would be absolutely no loss of functionality
if we decided to specify the precision in powers of 2, so a precision
'k' (signed) means 2^k seconds. This way 8 bits are enough to
represent any precision we want.

The key difference between 'ticks' and bintimes (and the main
difficulty in the conversion) is that ticks are relative and bintimes
are interpreted as absolute. This could be easily solved by using
a flag to specify if the 'bt' argument is absolute or relative, and
passing the argument by value.

So now the flags could contain C_DIRECT_EXEC, C_BT_IS_RELATIVE, the
precision, and another 64 or 128 bit field contains the bintime.

How does this look ?

cheers
luigi

 On 18.12.2012 20:03, Alexander Motin wrote:
 On 18.12.2012 19:36, Luigi Rizzo wrote:
 On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:
 I would instead do the following:
 
 I also don't very like the wide API and want to hear fresh ideas, but
 approaches to time measurement there are too different to do what you
 are proposing.  Main problem is that while ticks value is relative,
 bintime is absolute. It is not easy to make conversion between them fast
 and precise. I've managed to do it, but the only function that does it
 now is _callout_reset_on(). All other functions are just passing values
 down. I am not sure I want to duplicate that code in each place, though
 doing it at least for for callout may be a good idea.
 
 I am afraid the above is not convincing.
 
 Most/all of the APIs i mentioned still have the conversion from
 ticks to bintime, and the code in your patch is just
 building multiple parallel paths (one for each of the
 three versions of the same function) to some final
 piece of code where the conversion takes place.
 
 The problem is that all of this goes through a set of obfuscating
 macros and the end result is horrible.
 
 To be clear, i believe the work you have been doing on cleaning up
 callout is great, i am just saying that this is the time to look
 at the code from a few steps away and clean up all those design
 decisions that perhaps were made in a haste to make things work.
 
 I will give you another example to show how convoluted
 is the code now:
 
 cv_timedwait() and cv_timedwait_sig() now have three
 versions each (plain, bt, flags).
 
 These six are remapped through macros to two functions, _cv_timedwait()
 and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
 maps to _cv_timedwait_sig() )
 
 These two _cv_timedwait*() take both ticks and bintimes,
 and contain this sequence:
 
 +if (bt == NULL)
 +sleepq_set_timeout_flags(cvp, timo, flags);
 +else
 +sleepq_set_timeout_bt(cvp, bt, precision);
 
 Guess what, both sleepq_* are macros that remap to the same
 _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.
 
 But then if you dig into _sleepq_set_timeout() you'll see
 
 +if (bt == NULL)
 +callout_reset_flags_on(td-td_slpcallout, timo,
 +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
 +else
 +callout_reset_bt_on(td-td_slpcallout, bt, precision,
 +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
 
 and again both callout_reset*() are again remapped through
 macros to _callout_reset_on(), so another useless if (bt == NULL)
 And in the end you have the conversion from ticks to bintime.
 
 So basically the code path for cv_timedwait() has those
 two useless switches and one useless extra argument,
 and the conversion from ticks to bintime is down
 deep down in _callout_reset_on() where it can only
 be resolved at runtime, whereas by doing the conversion
 at the beginning the decision could have been made at compile
 time.
 
 So I believe my proposal would give large simplifications in
 the code and lead to a much cleaner implementation of what
 you have designed:
 
 1. acknowledge the fact that the only representation of time
 that callouts use internally is a bintime+precision, define one
 single function (instead of two or three or six) that implements
 the blessed API, and implement the others with macros or
 inline functions doing the appropriate conversions;
 
 2. specifically, the *_flags() variant has no reason to exist.
 It can be implemented through the *_bt() variant, and
 being a new function the only places where you introduce it
 require manual modifications so you can 

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Ian Lepore
On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:
 [top posting for readability;
 in summary we were discussing the new callout API trying to avoid
 an explosion of methods and arguments while at the same time
 supporting the old API and the new one]
 (I am also Cc-ing phk as he might have better insight
 on the topic).
 
 I think the patch you propose is a step in the right direction,
 but i still remain concerned by having to pass two bintimes
 (by reference, but they should really go by value)
 and one 'ticks' value to all these functions.
 
 I am also dubious that we need a full 128 bits to specify
 the 'precision': there would be absolutely no loss of functionality
 if we decided to specify the precision in powers of 2, so a precision
 'k' (signed) means 2^k seconds. This way 8 bits are enough to
 represent any precision we want.
 
 The key difference between 'ticks' and bintimes (and the main
 difficulty in the conversion) is that ticks are relative and bintimes
 are interpreted as absolute. This could be easily solved by using
 a flag to specify if the 'bt' argument is absolute or relative, and
 passing the argument by value.
 
 So now the flags could contain C_DIRECT_EXEC, C_BT_IS_RELATIVE, the
 precision, and another 64 or 128 bit field contains the bintime.
 
 How does this look ?
 
 cheers
 luigi
 

I tend to agree that the bintime should be passed by value instead of
reference.  That would allow an inline tickstobintime() that converts
relative ticks to an absolute bintime returned by value and passed right
along in one tidy line/clump of code without any temporary variables
cluttering things up.  

While the 1980s C programmer in me still wants to avoid returning
complex objects by value, the reality is that modern compilers tend to
generate really nice code for such constructs, usually without any
copying of the return value at all.

I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.  It would
typical to want a 500uS timeout but be willing to late by up to 250uS if
that helped scheduling and performance.  Also, my idea of precision
would virtually always be I'm willing to be late by this much, but
never early by any amount.

To reinforce the point of that last paragraph... the way I'm looking at
these changes has nothing to do with power saving (I've never owned a
battery-operated computer, probably never will) and everything to do
with performance and being able to sleep accurately for less than a
tick.

-- Ian

  On 18.12.2012 20:03, Alexander Motin wrote:
  On 18.12.2012 19:36, Luigi Rizzo wrote:
  On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:
  I would instead do the following:
  
  I also don't very like the wide API and want to hear fresh ideas, but
  approaches to time measurement there are too different to do what you
  are proposing.  Main problem is that while ticks value is relative,
  bintime is absolute. It is not easy to make conversion between them fast
  and precise. I've managed to do it, but the only function that does it
  now is _callout_reset_on(). All other functions are just passing values
  down. I am not sure I want to duplicate that code in each place, though
  doing it at least for for callout may be a good idea.
  
  I am afraid the above is not convincing.
  
  Most/all of the APIs i mentioned still have the conversion from
  ticks to bintime, and the code in your patch is just
  building multiple parallel paths (one for each of the
  three versions of the same function) to some final
  piece of code where the conversion takes place.
  
  The problem is that all of this goes through a set of obfuscating
  macros and the end result is horrible.
  
  To be clear, i believe the work you have been doing on cleaning up
  callout is great, i am just saying that this is the time to look
  at the code from a few steps away and clean up all those design
  decisions that perhaps were made in a haste to make things work.
  
  I will give you another example to show how convoluted
  is the code now:
  
  cv_timedwait() and cv_timedwait_sig() now have three
  versions each (plain, bt, flags).
  
  These six are remapped through macros to two functions, _cv_timedwait()
  and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
  maps to _cv_timedwait_sig() )
  
  These two _cv_timedwait*() take both ticks and bintimes,
  and contain this sequence:
  
  +if (bt == NULL)
  +sleepq_set_timeout_flags(cvp, timo, flags);
  +else
  +sleepq_set_timeout_bt(cvp, bt, precision);
  
  Guess what, both sleepq_* are macros that remap to the same
  _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.
  
  But then if you dig into _sleepq_set_timeout() you'll see
  
  +if (bt == NULL)
  +callout_reset_flags_on(td-td_slpcallout, timo,
  +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
  +else
  +

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Luigi Rizzo
On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote:
 On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:
  [top posting for readability;
  in summary we were discussing the new callout API trying to avoid
  an explosion of methods and arguments while at the same time
  supporting the old API and the new one]
  (I am also Cc-ing phk as he might have better insight
  on the topic).
  
  I think the patch you propose is a step in the right direction,
  but i still remain concerned by having to pass two bintimes
  (by reference, but they should really go by value)
  and one 'ticks' value to all these functions.
  
  I am also dubious that we need a full 128 bits to specify
  the 'precision': there would be absolutely no loss of functionality
  if we decided to specify the precision in powers of 2, so a precision
  'k' (signed) means 2^k seconds. This way 8 bits are enough to
  represent any precision we want.

...
 I'm not so sure about the 2^k precision.  You speak of seconds, but I
 would be worrying about sub-second precision in my work.  It would
 typical to want a 500uS timeout but be willing to late by up to 250uS if

i said k is signed so negative values represent fractions of a
second. 2^-128 is pretty short :)

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Ian Lepore
On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote:
 On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote:
  On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:
   [top posting for readability;
   in summary we were discussing the new callout API trying to avoid
   an explosion of methods and arguments while at the same time
   supporting the old API and the new one]
   (I am also Cc-ing phk as he might have better insight
   on the topic).
   
   I think the patch you propose is a step in the right direction,
   but i still remain concerned by having to pass two bintimes
   (by reference, but they should really go by value)
   and one 'ticks' value to all these functions.
   
   I am also dubious that we need a full 128 bits to specify
   the 'precision': there would be absolutely no loss of functionality
   if we decided to specify the precision in powers of 2, so a precision
   'k' (signed) means 2^k seconds. This way 8 bits are enough to
   represent any precision we want.
 
 ...
  I'm not so sure about the 2^k precision.  You speak of seconds, but I
  would be worrying about sub-second precision in my work.  It would
  typical to want a 500uS timeout but be willing to late by up to 250uS if
 
 i said k is signed so negative values represent fractions of a
 second. 2^-128 is pretty short :)
 
 cheers
 luigi

Ahh, I missed that.  Good enough then!  Hmmm, if that ideas survives
further review, then could precision be encoded in 8 bits of the flags,
eliminating another parm?

-- Ian


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Luigi Rizzo
On Tue, Dec 18, 2012 at 04:37:10PM -0700, Ian Lepore wrote:
 On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote:
  On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote:
   On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:
[top posting for readability;
in summary we were discussing the new callout API trying to avoid
an explosion of methods and arguments while at the same time
supporting the old API and the new one]
(I am also Cc-ing phk as he might have better insight
on the topic).

I think the patch you propose is a step in the right direction,
but i still remain concerned by having to pass two bintimes
(by reference, but they should really go by value)
and one 'ticks' value to all these functions.

I am also dubious that we need a full 128 bits to specify
the 'precision': there would be absolutely no loss of functionality
if we decided to specify the precision in powers of 2, so a precision
'k' (signed) means 2^k seconds. This way 8 bits are enough to
represent any precision we want.
  
  ...
   I'm not so sure about the 2^k precision.  You speak of seconds, but I
   would be worrying about sub-second precision in my work.  It would
   typical to want a 500uS timeout but be willing to late by up to 250uS if
  
  i said k is signed so negative values represent fractions of a
  second. 2^-128 is pretty short :)
  
  cheers
  luigi
 
 Ahh, I missed that.  Good enough then!  Hmmm, if that ideas survives
 further review, then could precision be encoded in 8 bits of the flags,
 eliminating another parm?

that was also what i wrote later in the message :)

now we should figure out some use for the remaining 22 bits of the flags

cheers
luigi

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-17 Thread Alexander Motin

On 17.12.2012 03:29, Adrian Chadd wrote:

On 16 December 2012 15:37, Alexander Motin m...@freebsd.org wrote:

Here is one more version. Unless something new will be found/reported this
may be the last one, because me and Davide are quite satisfied with the
results. If everything will be fine, I think we could commit it to HEAD
closer to the end of the week:
http://people.freebsd.org/~mav/calloutng_12_16.patch


Don't you think one week is a little on the low side for reviewing
something this critical?


It was in public development for the last half a year. IIRC, previous 
announce by Davide few months ago caused no any feedback. If you say you 
will review it in two weeks, I will wait for two weeks. But I don't want 
to wait without a purpose.



Would you mind approaching some of the cluster peeps and seeing if
they'll run this up on the ref10* boxes and VMs, just to get some
further exposure?


Are the ref* system have any load to see anything?

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-17 Thread Alexander Motin

On 17.12.2012 05:38, Adrian Chadd wrote:

On 16 December 2012 18:31, Garrett Cooper yaneg...@gmail.com wrote:


Would you mind approaching some of the cluster peeps and seeing if
they'll run this up on the ref10* boxes and VMs, just to get some
further exposure?


And maybe tinderbox..?


Tinderbox is a great idea.

Maybe hit up the altq/pf using crowd and see if they'll test this stuff out too?


It would be good to test, though I know that at least dummynet is 
written awful from the point of this project with its

callout_reset(dn_timeout, 1, dummynet, NULL);
It should work, but kill most of power benefits. I was promised it will 
be fixed after this project end.



What else gets heavily callout /timer driven? Try some computational
workloads that stress the fairness of ULE/4BSD, maybe?


Schedulers are driven directly by hardclock()/statclock(), so fairness 
is not affected here. If CPU is not idle, it will receive full set of 
required events with maximum available precision.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-17 Thread Adrian Chadd
On 16 December 2012 23:57, Andriy Gapon a...@freebsd.org wrote:

 Thank god that this feature was developed in a branch, it was developed for a
 long period of time and there were people who routinely reviewed and tested 
 (and
 really used) it.  And yeah, its design was announced and discussed well in
 advance too.

I can see that; I was even there at David's presentation at BSDCan
2012 about it.

Now that it's finished though, it would be nice to get some more
stress testing before committing it. Just because it's been developed
in a branch doesn't at all imply that it's had wide testing. It's now
imminently going into the tree, so that may scare (heh!) a few people
into testing it.

I'm sure it'll mostly just work, that it'll not really break anything.
It just to me feels that a week of warning before committing something
like this is a little soon. It's _just_ been finished and the authors
have been doing some last minute changes as they get better ideas on
implementing things. I think it's great work, I'd just like to see
some wider testing.

So to put my money where my mouth is, I bought a new hard disk for my
T60 last week. I'll install -HEAD on it tomorrow and give it a whirl.

Thanks,


Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


calloutng and dummynet (Re: [RFC/RFT] calloutng)

2012-12-17 Thread Luigi Rizzo
On Mon, Dec 17, 2012 at 10:14:29AM +0200, Alexander Motin wrote:
 On 17.12.2012 05:38, Adrian Chadd wrote:
...
 Maybe hit up the altq/pf using crowd and see if they'll test this stuff 
 out too?
 
 It would be good to test, though I know that at least dummynet is 
 written awful from the point of this project with its
   callout_reset(dn_timeout, 1, dummynet, NULL);
 It should work, but kill most of power benefits. I was promised it will 
 be fixed after this project end.

never trust italians :)

but it is good that you are reminding me this, hopefully
i will be able to give it a shot after the holidays

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

2012-12-17 Thread Luigi Rizzo
[addressing the various items separately]

On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote:
 On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
...
  - for several functions the only change is the name of an argument
from busy to us. Can you elaborate the reason for the change,
and whether us means microseconds or the pronoun ?)
 
 
 Please see r242905 by mav@.

i see the goal of this patch is to pass along the amount of
time till the next timer.

I wonder why the choice is to use (actually, call) the value
microseconds rather use a bintime or something scaled and with a
well defined resolution.

In fact looking at the relevant diff

http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905r2=242904pathrev=242905

cpu_idleclock() actually returns a value that is not even microseconds
but 1/(2^20) seconds. The value seems to be ignored right now
so it would be a good time to discuss the resolution.

I am concerned that at some point (5 years from now perhaps ?)
microseconds might start to become too coarse and we would like
something with a more fine-grained resolution. On the
other hand, for the purposes of this change, we can probably
live with an upper limit of some seconds (waking up the machine
once per second is not going to kill performance).

So i would suggest to make the argument to these functions
uint_32 or uint_64 (preferably the same for 32- and 64-bit machines),
rename it to something different from 'us'
and have at least 28-30 fractional bits to represent a bintime.

Right now you are using a bintime with 20 fractional and 11 or 43
bits for the integer part.


cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

2012-12-17 Thread Alexander Motin

Hi.

 I wonder why the choice is to use (actually, call) the value
 microseconds rather use a bintime or something scaled and with a
 well defined resolution.

It was kind of engineering choice. I've chosen microseconds, following 
values used by ACPI to represent CPU sleep states exit latencies.  Now 
that is the only usage for that value.  If CPUs so much reduce wakeup 
latencies to make this scale too coarse, this type will be the smallest 
of our optimization tasks. On the other side, I have some doubts that we 
will be able to reach supported 2048 seconds limit on the integer side. 
Now even completely empty idle system has about 30 interrupts per 
second, that is far from 0.0005. From the other side, I don't know any 
system where CPUs have 2048 seconds wakeup latency.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


API explosion (Re: [RFC/RFT] calloutng)

2012-12-17 Thread Luigi Rizzo
[again, response to another issue i raised]

On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote:
 On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
...
  Finally, a more substantial comment:
  - a lot of functions which formerly had only a timo argument
now have timo, bt, precision, flags. Take seltdwait() as an example.
 
 
 seltdwait() is not part of the public KPI. It has been modified to
 avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e.
 two separate functions, even though we could share most of the code is
 not a clever approach, IMHO.
 As I told before, seltdwait() is not exposed so we can modify its
 argument without breaking anything.
 
It seems that you have been undecided between two approaches:
for some of these functions you have preserved the original function
that deals with ticks and introduced a new one that deals with the
  bintime,
whereas in other cases you have modified the original function to add
bt, precision, flags.
 
 
 I'm not. All the functions which are part of the public KPI (e.g.
 condvar(9), sleepq(9), *) are still available.  *_flags variants have
 been introduced so that consumers can take advantage of the new
 'precision tolerance mechanism' implemented. Also, *_bt variants have
 been introduced. I don't see any undecision between the two
 approaches.
 Please note that now the callout backend deals with bintime, so every
 time callout_reset_on() is called, the 'tick' argument passed is
 silently converted to bintime.

I will try to give more specific example, using the latest patch
from mav

http://people.freebsd.org/~mav/calloutng_12_16.patch

In the manpage, for instance, the existing functions
now are extended with two more variants (sometimes;
msleep_spin() for instance is missing msleep_spin_bt()
but perhaps that is just an oversight).

 .Nm sleepq_set_timeout ,
+.Nm sleepq_set_timeout_flags ,
+.Nm sleepq_set_timeout_bt ,

 .Nm msleep ,
+.Nm msleep_flags ,
+.Nm msleep_bt ,
 .Nm msleep_spin ,
+.Nm msleep_spin_flags ,
 .Nm pause ,
+.Nm pause_flags ,
+.Nm pause_bt ,
 .Nm tsleep ,
+.Nm tsleep_flags ,
+.Nm tsleep_bt ,

 .Nm cv_timedwait ,
+.Nm cv_timedwait_bt ,
+.Nm cv_timedwait_flags ,
 .Nm cv_timedwait_sig ,
+.Nm cv_timedwait_sig_bt ,
+.Nm cv_timedwait_sig_flags ,

 .Nm callout_reset ,
+.Nm callout_reset_flags ,
 .Nm callout_reset_on ,
+.Nm callout_reset_flags_on ,
+.Nm callout_reset_bt_on ,

If you look at the backends, they take both a timo and a bintime.

-int_cv_timedwait(struct cv *cvp, struct lock_object *lock, int 
timo);
-int_cv_timedwait_sig(struct cv *cvp, struct lock_object *lock, int 
timo);
+int_cv_timedwait(struct cv *cvp, struct lock_object *lock,
+   struct bintime *bt, struct bintime *precision, int timo,
+   int flags);
+int_cv_timedwait_sig(struct cv *cvp, struct lock_object *lock,
+   struct bintime *bt, struct bintime *precision, int timo,
+   int flags);

and then internally they call the 'timo' or the 'bt' version
depending on the case

+   if (bt == NULL)
+   sleepq_set_timeout_flags(cvp, timo, flags);
+   else
+   sleepq_set_timeout_bt(cvp, bt, precision);

So basically you are doing the following:
   + create two new variant for each existing function

foo(, ... timo, ... )   old method
foo_flags(, ... timo, ... ) new method
foo_bt(... , bt, precision, ...)new method

 (the 'API explosion' i am mentioning in the Subject)

   + the variants are mapped to the same internal function
_foo_internal(..., timo, bt, precision, flags, ...)

   + and then the internal function has a (runtime) conditional

if (bt == NULL) {
// convert timo to bt
} else {
// have a bt + precision
}
...

I would instead do the following:

+ create a new API function that takes only bintime+precision+flags, no ticks.
  I am not sure if there is even a need to have an internal name
_cv_timedwait_sig(  )
  or you can just have
cv_timedwait_sig_bt(...)

+ use a macro or an inline function to remap the old API to
  the (single) new one, making the argument conversion immediately.
#define cv_timedwait_sig(...)   cv_timedwait_sig_bt( ...)

  This has the advantage that conversions might be done at compile
  time perhaps with some advantage in terms of code and performance.

+ do not bother creating yet another cv_timedwait_sig_flags() function.
  Since it did not exist before, you have to do the conversion manually
  anyways, at which point you just change the argument to use bintime
  instead of ticks.


Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

2012-12-17 Thread Davide Italiano
On Mon, Dec 17, 2012 at 11:27 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
 [addressing the various items separately]

 On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote:
 On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
 ...
  - for several functions the only change is the name of an argument
from busy to us. Can you elaborate the reason for the change,
and whether us means microseconds or the pronoun ?)
 

 Please see r242905 by mav@.

 i see the goal of this patch is to pass along the amount of
 time till the next timer.

 I wonder why the choice is to use (actually, call) the value
 microseconds rather use a bintime or something scaled and with a
 well defined resolution.

 In fact looking at the relevant diff

 http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905r2=242904pathrev=242905

 cpu_idleclock() actually returns a value that is not even microseconds
 but 1/(2^20) seconds. The value seems to be ignored right now
 so it would be a good time to discuss the resolution.

 I am concerned that at some point (5 years from now perhaps ?)
 microseconds might start to become too coarse and we would like
 something with a more fine-grained resolution. On the
 other hand, for the purposes of this change, we can probably
 live with an upper limit of some seconds (waking up the machine
 once per second is not going to kill performance).


I would talk more about power consumption problem rather than performances.
Yes, you're right because now, even with calloutng changes, the CPU is
woken up at least twice per second.
The wheel scan, in case it doesn't find a new callout to schedule in
the next half-second, schedules an interrupt half a second from now
(where now is the time obtained using getbinuptime()/binuptime()).
This is a threshold we set up empirically, and it resulted is good
for now. But in the future someone may raise the threshold to 1
second, 10 seconds, or something like. So, I don't agree with your
statement.

 So i would suggest to make the argument to these functions
 uint_32 or uint_64 (preferably the same for 32- and 64-bit machines),
 rename it to something different from 'us'
 and have at least 28-30 fractional bits to represent a bintime.

 Right now you are using a bintime with 20 fractional and 11 or 43
 bits for the integer part.


 cheers
 luigi

Thanks

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-17 Thread Alexander Motin

Hi.

 I would instead do the following:

I also don't very like the wide API and want to hear fresh ideas, but 
approaches to time measurement there are too different to do what you 
are proposing.  Main problem is that while ticks value is relative, 
bintime is absolute. It is not easy to make conversion between them fast 
and precise. I've managed to do it, but the only function that does it 
now is _callout_reset_on(). All other functions are just passing values 
down. I am not sure I want to duplicate that code in each place, though 
doing it at least for for callout may be a good idea.


Creating sets of three functions I had three different goals:
 - callout_reset() -- it is legacy variant required to keep API 
compatibility;
 - callout_reset_flags() -- it is for cases where custom precision 
specification needs to be added to the existing code, or where direct 
callout execution is needed. Conversion to bintime would additionally 
complicate consumer code, that I would try to avoid.
 - callout_reset_bt() -- API for the new code, which needs high 
precision and doesn't mind to operate bintime. Now there is only three 
such places in kernel now, and I don't think there will be much more.


Respectively, these three options are replicated to other APIs where 
time intervals are used.


PS: Please keep me in CC.

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

2012-12-17 Thread Luigi Rizzo
On Mon, Dec 17, 2012 at 12:17:54PM -0800, Davide Italiano wrote:
 On Mon, Dec 17, 2012 at 11:27 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
  [addressing the various items separately]
 
  On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote:
  On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
  ...
   - for several functions the only change is the name of an argument
 from busy to us. Can you elaborate the reason for the change,
 and whether us means microseconds or the pronoun ?)
  
 
  Please see r242905 by mav@.
 
  i see the goal of this patch is to pass along the amount of
  time till the next timer.
 
  I wonder why the choice is to use (actually, call) the value
  microseconds rather use a bintime or something scaled and with a
  well defined resolution.
 
  In fact looking at the relevant diff
 
  http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905r2=242904pathrev=242905
 
  cpu_idleclock() actually returns a value that is not even microseconds
  but 1/(2^20) seconds. The value seems to be ignored right now
  so it would be a good time to discuss the resolution.
 
  I am concerned that at some point (5 years from now perhaps ?)
  microseconds might start to become too coarse and we would like
  something with a more fine-grained resolution. On the
  other hand, for the purposes of this change, we can probably
  live with an upper limit of some seconds (waking up the machine
  once per second is not going to kill performance).
 
 
 I would talk more about power consumption problem rather than performances.
 Yes, you're right because now, even with calloutng changes, the CPU is
 woken up at least twice per second.
 The wheel scan, in case it doesn't find a new callout to schedule in
 the next half-second, schedules an interrupt half a second from now
 (where now is the time obtained using getbinuptime()/binuptime()).
 This is a threshold we set up empirically, and it resulted is good
 for now. But in the future someone may raise the threshold to 1
 second, 10 seconds, or something like. So, I don't agree with your
 statement.

this is only an issue if we want to use 32 bits.
If we go to 64 bits, there is enogh space for picoseconds
on the fractional part, and a few years in the integer part.
but keep in mind, even powerwise, i doubt the exit from deep
sleep and a callout takes more than 500us so even doing that
once per second gives less than 0.5 per mille increase in
power compared to a machine that is always idle

This is really noise that we should not worry about.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

2012-12-17 Thread Adrian Chadd
Personally, I'd rather see some consistently used units here..



Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

2012-12-17 Thread Poul-Henning Kamp

In message 50cf79ad.9040...@freebsd.org, Alexander Motin writes:
Hi.

  I wonder why the choice is to use (actually, call) the value
  microseconds rather use a bintime or something scaled and with a
  well defined resolution.

It was kind of engineering choice. I've chosen microseconds [...]

And that was the wrong choice, the format should be a binary number
so arithmetic and comparisons does not become a nightmare.

If people need milliseconds or microseconds, they can get that
using suitable #defined multiplication factors.

A 64 bit type, with 32bit before and after the binary point would
be sufficient for now, and easily extensible to something larger
should one or more laws of computing nature be changed.

So do the following:

typedef dur_t   int64_t;/* signed for bug catching */
#define DURSEC  ((dur_t)1  32)
#define DURMIN  (DURSEC * 60)
#define DURMSEC (DURSEC / 1000)
#define DURUSEC (DURSEC / 1000)
#define DURNSEC (DURSEC / 1000)

And stop crapping around with mixed-radix numbers, even the british
changed to decimal coinage to avoid that crap...


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-17 Thread David O'Brien
On Sat, Dec 15, 2012 at 11:13:26PM +0200, Alexander Motin wrote:
 On 15.12.2012 23:03, Alexander Motin wrote:
  Sorry, it's my fault. I've tried to save some time on patch generation
  and forgot about that change in lib/. We haven't touched user-level in
  our work except that file. Here is patch with that chunk added:
  http://people.freebsd.org/~mav/calloutng_12_15_1.patch
 
 And one more part I've missed is manual pages update, that probably 
 needs more improvements:
 http://people.freebsd.org/~mav/calloutng_12_15.man.patch

Perhaps use the SCM at what its good at?

Sync your branch with HEAD and then do an 'svn diff ^/head' and your
branch.

-- 
-- David  (obr...@freebsd.org)
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-17 Thread Davide Italiano
On Mon, Dec 17, 2012 at 2:39 PM, David O'Brien obr...@freebsd.org wrote:
 On Sat, Dec 15, 2012 at 11:13:26PM +0200, Alexander Motin wrote:
 On 15.12.2012 23:03, Alexander Motin wrote:
  Sorry, it's my fault. I've tried to save some time on patch generation
  and forgot about that change in lib/. We haven't touched user-level in
  our work except that file. Here is patch with that chunk added:
  http://people.freebsd.org/~mav/calloutng_12_15_1.patch

 And one more part I've missed is manual pages update, that probably
 needs more improvements:
 http://people.freebsd.org/~mav/calloutng_12_15.man.patch

 Perhaps use the SCM at what its good at?

 Sync your branch with HEAD and then do an 'svn diff ^/head' and your
 branch.

 --
 -- David  (obr...@freebsd.org)

Last time I tried doing that the way you describe, I got an output
with tons svn:mergeinfo and I didn't find a way to suppress them if
not manually.
e.g.

Property changes on: usr.bin/procstat
___
Modified: svn:mergeinfo
   Merged /head/usr.bin/procstat:r236314-239017
Index: usr.bin/calendar
===
--- usr.bin/calendar(.../head)  (revision 239166)
+++ usr.bin/calendar(.../projects/calloutng)(revision 239166)


Can you help me in understanding what I did wrong?

Thanks
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-16 Thread Bruce Evans

On Sat, 15 Dec 2012, Garrett Cooper wrote:


On Dec 15, 2012, at 12:34 PM, Mark Johnston wrote:


On Sat, Dec 15, 2012 at 06:55:53PM +0200, Alexander Motin wrote:

Hi.

I'm sorry to interrupt review, but as usual good ideas came during the
final testing, causing another round. :)  Here is updated patch for
HEAD, that includes several new changes:
http://people.freebsd.org/~mav/calloutng_12_15.patch


This patch breaks the libprocstat build.

Specifically, the OpenSolaris sys/time.h defines the preprocessor
symbols gethrestime and gethrestime_sec. These symbols are also defined
in cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h.
libprocstat:zfs.c is compiled using include paths that pick up the
OpenSolaris time.h, and with this patch _callout.h includes sys/time.h.

zfs.c includes taskqueue.h (with _KERNEL defined), which includes
_callout.h, so both time.h and zfs_context.h are included in zfs.c, and
the symbols are thus defined twice.


Gross namespace pollution.  sys/_callout.h exists so that the full
namespace pollution of sys/callout.h doesn't get included nested.  But
sys/time.h is much more polluted than sys/callout.h.

However, sys/time.h is old standard pollution in sys/param.h, and
sys/callout.h is not so old standard pollution in sys/systm.h.  It is
a bug to not include sys/param.h and sys/systm.h in most kernel source
code, so these nested includes are just style bugs -- they have no
effect for correct kernel source code.


The patch below fixes the build for me. Another approach might be to
include sys/_task.h instead of taskqueue.h at the beginning of zfs.c.


Good if it works.


I had a patch open once upon a time to cleanup inclusion of sys/time.h all 
over the tree and deal with the sys/time.h - time.h pollution issue, but it 
got dropped due to lack of interest (20~30 apps/libs were affected IIRC and I only 
really got assistance in fixing the UFS and bsnmpd pieces, and gave up due to lack of 
response from maintainers). dtrace/zfs is a definite instigator in this pollution (I 
remember nasty cddl/... pollution with the compat sys/time.h header).


Please use the unix newline character in mail.  The above is difficult to
quote.

The standard sys/time.h pollution in sys/param.h is only in the kernel,
and there aren't many direct includes of sys/time.h in the kernel.  Userland
is different and many of the direct includes were correct.  But not POSIX
specifies that struct timespec and struct timeval be defined in most places
where they are needed, so the includes of sys/time.h are not necessary
for POSIX or FreeBSD, although FreeBSD man pages still say that they
are necessary.  The sys/time.h - time.h pollution issue is also only
for userland.  Many places depend on one including the other, and include
the wrong one themself.


Bottom line: make sure anything new you're defining isn't already 
defined via POSIX or other OSes, and if so please try to make the 
implementations match (so that eventual POSIX inclusion might be possible) and 
when in doubt I suggest consulting standards@ / brde@.


Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-16 Thread Alexander Motin

Hi.

Here is one more version. Unless something new will be found/reported 
this may be the last one, because me and Davide are quite satisfied with 
the results. If everything will be fine, I think we could commit it to 
HEAD closer to the end of the week:

http://people.freebsd.org/~mav/calloutng_12_16.patch

Changes in this version:
 -- Removed couple of redundant variables in callout implementation, 
that reduced sizeof(struct callout) by two pointers and simplified some 
internal code.
 -- syscons driver was made to schedule only 1-2 callouts per second 
instead of 20-30 before when console is in graphical mode and there are 
few other things to do. Now my laptop has only about 30 interrupts per 
second total during idle periods with X running.
 -- i8254 eventtimer driver was optimized to work faster in disabled by 
default one-shot mode.

 -- Few kernel functions were added to make KPIs more complete.
 -- Man pages were updated.
 -- Some style fixes were made.

On 15.12.2012 18:55, Alexander Motin wrote:

I'm sorry to interrupt review, but as usual good ideas came during the
final testing, causing another round. :)  Here is updated patch for
HEAD, that includes several new changes:
http://people.freebsd.org/~mav/calloutng_12_15.patch

The new changes are:
  -- Precision and event aggregation code was reworked. Instead of
previous -prec/+prec representation, precision is now single-sided --
-0/+prec. It allowed to significantly improve precision on long time
intervals for APIs which imply that event should not happen before the
specified time. Depending on CPU activity, mistake for long time
intervals now will never be more then 1-500ms, even if specified
precision allows more.
  -- Some minor optimizations were made to reduce callout overhead and
latency by 1.5-2us. Now on Core2Duo amd64 system with LAPIC eventtimer
and TSC timecounter usleep(1) call from user-level executes in just
5-6us, instead of 7-8us before. Now it can do 180K cycles per second on
single CPU with only partial CPU load.
  -- Number of kernel subsystems (dcons, syscons, yarrow, led, atkbd,
setrlimit) were modified to reduce number of interrupts, also with event
aggregation by explicit specification of the acceptable events
precision. Now my Core2Duo test system has only 30 interrupts per second
in idle. If not remaining syscons events, it could easily be 15. My
IvyBridge ultrabook first time in its history shown 5.5 hours of battery
time with full screen brightness and 10 hours with lid closed.
  -- Some kernel functions were added to make KPIs more complete.

I've successfully tested this patch on amd64 and arm.




--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-16 Thread Adrian Chadd
On 16 December 2012 15:37, Alexander Motin m...@freebsd.org wrote:
 Hi.

 Here is one more version. Unless something new will be found/reported this
 may be the last one, because me and Davide are quite satisfied with the
 results. If everything will be fine, I think we could commit it to HEAD
 closer to the end of the week:
 http://people.freebsd.org/~mav/calloutng_12_16.patch

Hi,

Don't you think one week is a little on the low side for reviewing
something this critical?

Would you mind approaching some of the cluster peeps and seeing if
they'll run this up on the ref10* boxes and VMs, just to get some
further exposure?

Thanks,




Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-16 Thread Garrett Cooper
On Dec 16, 2012, at 5:29 PM, Adrian Chadd adr...@freebsd.org wrote:

 On 16 December 2012 15:37, Alexander Motin m...@freebsd.org wrote:
 Hi.
 
 Here is one more version. Unless something new will be found/reported this
 may be the last one, because me and Davide are quite satisfied with the
 results. If everything will be fine, I think we could commit it to HEAD
 closer to the end of the week:
 http://people.freebsd.org/~mav/calloutng_12_16.patch
 
 Hi,
 
 Don't you think one week is a little on the low side for reviewing
 something this critical?
 
 Would you mind approaching some of the cluster peeps and seeing if
 they'll run this up on the ref10* boxes and VMs, just to get some
 further exposure?

And maybe tinderbox..?
Thanks,
-Garrett
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFC/RFT] calloutng

2012-12-16 Thread Adrian Chadd
On 16 December 2012 18:31, Garrett Cooper yaneg...@gmail.com wrote:

 Would you mind approaching some of the cluster peeps and seeing if
 they'll run this up on the ref10* boxes and VMs, just to get some
 further exposure?

 And maybe tinderbox..?

Tinderbox is a great idea.

Maybe hit up the altq/pf using crowd and see if they'll test this stuff out too?

What else gets heavily callout /timer driven? Try some computational
workloads that stress the fairness of ULE/4BSD, maybe?


Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


  1   2   >