subject:"\[RFC\/RFT\] calloutng"


On 02.01.2013 12:57, Luigi Rizzo wrote:

On Mon, Dec 31, 2012 at 12:17:27PM +0200, Alexander Motin wrote:

On 31.12.2012 08:17, Luigi Rizzo wrote:

On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:

...

Then I noticed you had a 12_26 patchset so I tested
that (after crudely fixing a couple uninitialized var warnings), and it
all looks good on this arm (Raspberry Pi).  I'll attach the results.

It's so sweet to be able to do precision sleeps.


Thank you for testing, Ian.


interesting numbers, but there seems to be some problem in computing
the exact interval; delays are much larger than expected.

In this test, the original timer code used to round to the next multiple
of 1 tick and then add another tick (except for the kqueue case),
which is exactly what you see in the second set of measurements.

The calloutng code however seems to do something odd:
in addition to fixed overhead (some 50us, which you can see in
the tests for 1us and 300us), all delay seem to be ~10% larger
than what is requested, upper bounded to 10ms (note, the
numbers are averages so i cannot tell whether all samples are
the same or there is some distribution of values).

I am not sure if this error is peculiar of the ARM version or also
appears on x86/amd64 but I believe it should be fixed.

If you look at the results below:

1us possily ok:
for very short intervals i would expect some kind
of 'reschedule' without actually firing a timer; maybe
50us are what it takes to do a round through the scheduler ?

300us   probably ok
i guess the extra 50-90us are what it takes to do a round
through the scheduler

1000us  borderline (this is the case for poll and kqueue, which are
rounded to 1ms)
here intervals seem to be increased by 10%, and i cannot see
a good reason for this (more below).

3000us and above: wrong
here again, the intervals seem to be 10% larger than what is
requested, perhaps limiting the error to 10-20ms.


Maybe the 10% extension results from creating a default 'precision'
for legacy calls, but i do not think this is done correctly.

First of all, if users do not specify a precision themselves, the
automatically generated value should never exceed one tick.

Second, the only point of a 'precision' parameter is to merge
requests that may be close in time, so if there is already a
timer scheduled within [Treq, Treq+precision] i will get it;
but if there no pending timer, then one should schedule it
for the requested interval.

Davide/Alexander, any ideas ?


All mentioned effects could be explained with implemented logic. 50us at
1us is probably sum of minimal latency of the hardware eventtimer on the
specific platform and some software processing overhead (syscall,
callout, timecouters, scheduler, etc). At later points system starts to
noticeably use precision specified by kern.timecounter.alloweddeviation
sysctl. It affects results from two sides: 1) extending intervals for
specified percent of time to allow event aggregation, and 2) choosing
time base between fast getbinuptime() and precise binuptime(). Extending
interval is needed to aggregate not only callouts with each other, but
also callouts with other system events, which are impossible to schedule
in advance. It gives specified relative error, but no more then one CPU
wakeup period in absolute: for busy CPU (not skipping hardclock() ticks)
it is 1/hz, for completely idle one it can be up to 0.5s. Second point
allows to reduce processing overhead by the cost of error up to 1/hz for
long periods ((100/allowed)*(1/hz)), when it is used.


i am not sure what you mean by extending interval, but i believe the
logic should be the following:

- say user requests a timeout after X seconds and with a tolerance of D second
   (both X and D are fractional, so they can be short).  Interpret this as

the system should do its best to generate an event between X and X+D 
seconds

- convert X to an absolute time, T_X

- if there are any pending events already scheduled between T_X and T_X+D,
   then by definition they are acceptable. Attach the requested timeout
   to the earliest of these events.


All above is true, but not following.


- otherwise, schedule an event at time T_X (because there is no valid
   reason to generate a late event, and it makes no sense from an
   energy saving standpoint, either -- see below).


System may have many interrupts except timer: network, disk, ... WiFi 
cards generate interrupts with AP beacon rate -- dozens times per 
second. It is not very efficient to wake up CPU precisely at T_X time, 
that may be just 100us earlier then next hardware interrupt. That's why 
timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next 
statclock, ...). As result, event will be handled within allowed range, 
but real delay will depends on current environment conditions.



It seems to me that you are instead extending the requested interval

Re: [RFC/RFT] calloutng

2013-01-02 Thread Luigi Rizzo

On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote:
 On 02.01.2013 12:57, Luigi Rizzo wrote:
...
 i am not sure what you mean by extending interval, but i believe the
 logic should be the following:
 
 - say user requests a timeout after X seconds and with a tolerance of D 
 second
(both X and D are fractional, so they can be short).  Interpret this as
 
 the system should do its best to generate an event between X and X+D 
 seconds
 
 - convert X to an absolute time, T_X
 
 - if there are any pending events already scheduled between T_X and T_X+D,
then by definition they are acceptable. Attach the requested timeout
to the earliest of these events.
 
 All above is true, but not following.
 
 - otherwise, schedule an event at time T_X (because there is no valid
reason to generate a late event, and it makes no sense from an
energy saving standpoint, either -- see below).
 
 System may have many interrupts except timer: network, disk, ... WiFi 
 cards generate interrupts with AP beacon rate -- dozens times per 
 second. It is not very efficient to wake up CPU precisely at T_X time, 
 that may be just 100us earlier then next hardware interrupt. That's why 
 timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next 
 statclock, ...). As result, event will be handled within allowed range, 
 but real delay will depends on current environment conditions.

I don't see why system events (hardclock, statclock, 0.5s,...)
need to be treated specially -- and i am saying this also in
the interest of simplifying the logic of the code.

First of all, if you know that there is already a hardclock/statclock/*
scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
was no event scheduled in [T_X, T_X+D] so you need to generate
a new one.

Surely scheduling the event at T_X+D instead of T_X increases the
chance of merging events. But the saving are smaller and smaller
as the value X increases. This particular client will only
change its request rate from 1/X to 1/(X+D) so in relative terms
the gain is ( 1/X - 1/(X+D) ) / (1/(X+D) ) = D/X

Example: if X = 300ms, and D = 10ms (as in the test case)
you just save one interrupt every 30seconds by scheduling at
T_X+D instead of T_X. Are we actually able to measure the
difference ?

Even at high interrupt rates (e.g. X = 1ms) you are not
going to save a lot unless the tolerance D is very large,
which is generally undesirable for other reasons
(presumably, applications are not going to be happy
if you artificially double their timeouts).
Now, say your application requests timeouts every X = 300ms.

 It seems to me that you are instead extending the requested interval
 upfront, which causes some gratuitous pessimizations in scheduling
 the callout.
 
 Re. energy savings: the gain in extending the timeout cannot exceed
 the value D/X. So while it may make sense to extend a 1us request
 to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s,
 it is completely pointless from an energy saving standpoint to
 introduce a 10ms error on a 300ms request.
 
 I am not so sure in this. When CPU package is in C7 sleep state with all 
 buses and caches shut down and memory set to self refresh, it consumes 
 very few (some milli-Watts) of power. Wake up from that state takes 
 100us or even more with power consumption much higher then normal 
 operational one. Sure, if we compare it with power consumption of 100% 
 CPU load, difference between 10 and 100 wakeups per second may be small, 
 but when comparing to each other in some low-power environment for 
 mostly idle system it may be much more significant.

see above -- at low rates the difference is not measurable,
at high rates thCe only obvious answer is do not use C7 unless
if the next interrupt is due in less than 2..5 milliseconds

 (even though i hate the idea that a 1us request defaults to
 a 50us delay; but that is hopefully something that can be tuned
 in a platform-specific way and perhaps at runtime).
 
 It is 50us on this ARM. On SandyBridge Core i7 it is only about 2us.

very good, i suspected something similar, just wanted to be sure :)

cheers
luigi

 -- 
 Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

02.01.2013 14:28 пользователь Luigi Rizzo ri...@iet.unipi.it написал:

 On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote:
  On 02.01.2013 12:57, Luigi Rizzo wrote:
 ...
  i am not sure what you mean by extending interval, but i believe the
  logic should be the following:
  
  - say user requests a timeout after X seconds and with a tolerance of D
  second
 (both X and D are fractional, so they can be short).  Interpret
this as
  
  the system should do its best to generate an event between X and
X+D
  seconds
  
  - convert X to an absolute time, T_X
  
  - if there are any pending events already scheduled between T_X and
T_X+D,
 then by definition they are acceptable. Attach the requested timeout
 to the earliest of these events.
 
  All above is true, but not following.
 
  - otherwise, schedule an event at time T_X (because there is no valid
 reason to generate a late event, and it makes no sense from an
 energy saving standpoint, either -- see below).
 
  System may have many interrupts except timer: network, disk, ... WiFi
  cards generate interrupts with AP beacon rate -- dozens times per
  second. It is not very efficient to wake up CPU precisely at T_X time,
  that may be just 100us earlier then next hardware interrupt. That's why
  timer interrupts are scheduled at min(T_X+D, 0.5s, next hardclock, next
  statclock, ...). As result, event will be handled within allowed range,
  but real delay will depends on current environment conditions.

 I don't see why system events (hardclock, statclock, 0.5s,...)
 need to be treated specially -- and i am saying this also in
 the interest of simplifying the logic of the code.

Sure. That is mostly for historical reasons. At some point they should
disappear, just not now, as patch is already quite big.

 First of all, if you know that there is already a hardclock/statclock/*
 scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
 was no event scheduled in [T_X, T_X+D] so you need to generate
 a new one.

That is true, but my main point was about merging with external events,
which I can't predict and the only way to merge is increase sleep period,
hoping for better.

 Surely scheduling the event at T_X+D instead of T_X increases the
 chance of merging events. But the saving are smaller and smaller
 as the value X increases. This particular client will only
 change its request rate from 1/X to 1/(X+D) so in relative terms
 the gain is ( 1/X - 1/(X+D) ) / (1/(X+D) ) = D/X

 Example: if X = 300ms, and D = 10ms (as in the test case)
 you just save one interrupt every 30seconds by scheduling at
 T_X+D instead of T_X. Are we actually able to measure the
 difference ?

 Even at high interrupt rates (e.g. X = 1ms) you are not
 going to save a lot unless the tolerance D is very large,
 which is generally undesirable for other reasons
 (presumably, applications are not going to be happy
 if you artificially double their timeouts).
 Now, say your application requests timeouts every X = 300ms.

With default precision set to 5% it will be only 5% save from periods
increase. But that is absolutely not my goal!

Imagine different case: you have NIC interrupts at 1000Hz. Also you have
100 callouts with 100ms period each. If we program timer with absolute
precision, you will get about 2000Hz of total interrupt rate. But if we
allow just 2% deviation, most of callouts will be grouped with NIC
interrupts and total rate will be 1000Hz. Loosing _less_ then 2% of
precision we are reducing interrupt rate in _half_!

  It seems to me that you are instead extending the requested interval
  upfront, which causes some gratuitous pessimizations in scheduling
  the callout.
  
  Re. energy savings: the gain in extending the timeout cannot exceed
  the value D/X. So while it may make sense to extend a 1us request
  to 50us to go (theoretically) from 1M callouts/s to 20K callouts/s,
  it is completely pointless from an energy saving standpoint to
  introduce a 10ms error on a 300ms request.
 
  I am not so sure in this. When CPU package is in C7 sleep state with all
  buses and caches shut down and memory set to self refresh, it consumes
  very few (some milli-Watts) of power. Wake up from that state takes
  100us or even more with power consumption much higher then normal
  operational one. Sure, if we compare it with power consumption of 100%
  CPU load, difference between 10 and 100 wakeups per second may be small,
  but when comparing to each other in some low-power environment for
  mostly idle system it may be much more significant.

 see above -- at low rates the difference is not measurable,
 at high rates thCe only obvious answer is do not use C7 unless
 if the next interrupt is due in less than 2..5 milliseconds

  (even though i hate the idea that a 1us request defaults to
  a 50us delay; but that is hopefully something that can be tuned
  in a platform-specific way and perhaps at runtime).
 
  It is 50us on this ARM.

Re: [RFC/RFT] calloutng

2013-01-02 Thread Ian Lepore

On Wed, 2013-01-02 at 15:11 +0200, Alexander Motin wrote:
 02.01.2013 14:28 пользователь Luigi Rizzo ri...@iet.unipi.it написал:
 
  On Wed, Jan 02, 2013 at 01:24:26PM +0200, Alexander Motin wrote:
   On 02.01.2013 12:57, Luigi Rizzo wrote:

  First of all, if you know that there is already a hardclock/statclock/*
  scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
  was no event scheduled in [T_X, T_X+D] so you need to generate
  a new one.
 
 That is true, but my main point was about merging with external events,
 which I can't predict and the only way to merge is increase sleep period,
 hoping for better.
 

This really is the crux of the problem, because you can't *by default*
dispatch an event earlier than requested because that's just a violation
of the usual rules of precision timing (where you expect to be late but
never early).

Sometimes there is no need for such precision, and an early wakeup is no
more or less detrimental than a late wakeup.  In fact, that may even be
the majority case.  I wonder if it might make sense to allow the
precision specification to indicate whether it needs traditional never
early behavior or whether it can be interpretted as plus or minus this
amount is fine.  Like maybe negative precision is interpretted as plus
or minus abs(precision) or something like that.

Or maybe even the other way around... you get plus or minus precision
by default and the few things that really care about precision timing
have a way of indicating that.  (But in that case the userland sleeps
would have to assume the traditional behavior because that's how they've
always been documented.)

-- Ian


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2013-01-02 Thread Luigi Rizzo

On Wed, Jan 02, 2013 at 03:11:05PM +0200, Alexander Motin wrote:
...
  First of all, if you know that there is already a hardclock/statclock/*
  scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
  was no event scheduled in [T_X, T_X+D] so you need to generate
  a new one.
 
 That is true, but my main point was about merging with external events,
 which I can't predict and the only way to merge is increase sleep period,
 hoping for better.

ok, now i understand why you want to schedule for T_X+D.

Probably one way to close this discussion would be to provide
a sysctl so the sysadmin can decide which point in the interval
to pick when there is no suitable callout already scheduled.

cheers
luigi

-+---
  Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
  http://www.iet.unipi.it/~luigi/. Universita` di Pisa
  TEL  +39-050-2211611   . via Diotisalvi 2
  Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2013-01-02 Thread Konstantin Belousov

On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote:
 On Wed, Jan 02, 2013 at 03:11:05PM +0200, Alexander Motin wrote:
 ...
   First of all, if you know that there is already a hardclock/statclock/*
   scheduled in [T_X, T_X+D] you just reuse that. This particular bullet
   was no event scheduled in [T_X, T_X+D] so you need to generate
   a new one.
  
  That is true, but my main point was about merging with external events,
  which I can't predict and the only way to merge is increase sleep period,
  hoping for better.
 
 ok, now i understand why you want to schedule for T_X+D.
 
 Probably one way to close this discussion would be to provide
 a sysctl so the sysadmin can decide which point in the interval
 to pick when there is no suitable callout already scheduled.
Isn't trying to synchronize to the external events in this way unsafe ?
I remember, but cannot find the reference right now, a scheduler
exploit(s) which completely hide malicious thread from the time
accounting, by making it voluntary yielding right before statclock
should fire. If statistic gathering could be piggy-backed on the
external interrupt, and attacker can control the source of the external
events, wouldn't this give her a handle ?


pgpVV7IP9DZbJ.pgp
Description: PGP signature

Re: [RFC/RFT] calloutng

2013-01-02 Thread Adrian Chadd

.. I'm pretty damned sure we're going to need to enforce a never
earlier than X latency.

Is there a more detailed writeup of calloutng somewhere, besides
David's slides? The wiki page is rather empty.

Eg - I think this work does coalesce wakeups, right? Or it can? So
when in low-power scenarios you can end up with lower-resolution
callout periods, but many less CPU wakeups a second?

(Do we actually _expose_ wakeups-per-second somewhere?)



Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng


On 02.01.2013 19:09, Konstantin Belousov wrote:

On Wed, Jan 02, 2013 at 05:22:06PM +0100, Luigi Rizzo wrote:

Probably one way to close this discussion would be to provide
a sysctl so the sysadmin can decide which point in the interval
to pick when there is no suitable callout already scheduled.

Isn't trying to synchronize to the external events in this way unsafe ?
I remember, but cannot find the reference right now, a scheduler
exploit(s) which completely hide malicious thread from the time
accounting, by making it voluntary yielding right before statclock
should fire. If statistic gathering could be piggy-backed on the
external interrupt, and attacker can control the source of the external
events, wouldn't this give her a handle ?


There are many different kinds of accounting with different 
characteristics. Run time for each thread calculated using high 
resolution per-CPU clocks on each context switch. It is impossible to 
hide from it. System load average updated using callout and aligned with 
hardclock(). Hiding from it was easy before, but it can be made more 
complicated (asynchronous) now. Per-CPU SYSTEM/INTERRUPT/USER/NICE/IDLE 
counters are updated by statcklock(), that is asynchronous to 
hardclock() and hiding from it supposed to be more complicated. But even 
before it was possible, since hardclock() frequency is 8 times higher 
then statclock() one. More important for scheduling fairness thread's 
CPU percentage is also based on hardclock() and hiding from it was 
trivial before, since all sleep primitives were strictly aligned to 
hardclock(). Now it is slightly less trivial, since this alignment was 
removed and user-level APIs provide no easy way to enforce it.


The only way to get really safe accounting is forget about sampling and 
use potentially more expensive other ways. It was always stopped by lack 
of cheap and reliable clocks. But since TSC is P-state invariant on most 
of CPUs present now it could probably be reconsidered.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng


On 02.01.2013 18:08, Adrian Chadd wrote:

.. I'm pretty damned sure we're going to need to enforce a never
earlier than X latency.


Do you mean here that we should never wake up before specified time 
(just as specified by the most of existing APIs), or that we should not 
allow sleep shorter then some value to avoid DoS? At least on x86 
nanosleep(0) doesn't allow to block the system. Also there is already 
present mechanism for specifying minimum timer programming interval in 
eventtimers(9) KPI.



Is there a more detailed writeup of calloutng somewhere, besides
David's slides? The wiki page is rather empty.


There are updated manual pages in the patch. Also Davide written some 
blog during GSoC. Now we are working on papers for the AsiaBSDCon.



Eg - I think this work does coalesce wakeups, right? Or it can? So
when in low-power scenarios you can end up with lower-resolution
callout periods, but many less CPU wakeups a second?


This work does coalesce wakeups out of the box, but also provide ways to 
improve it further, where possible. With additional tuning of some 
kernel subsystems and drivers I was able to drop total idle interrupt 
rate down to 10-15Hz on arm and 20-30Hz on x86.



(Do we actually _expose_ wakeups-per-second somewhere?)


On systems with ACPI there are average per-CPU sleep times exposed via 
sysctls dev.cpu.X.cx_usage. Also cpu_idle() call rate calculated by both 
schedulers for purposes of idle loop optimizations, but it is not 
exposed outside now. Also for idle SMP system enabling COUNT_IPIS should 
give number of interrupts in systat comparable to number of wakeups. I 
am mostly using the last way.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2013-01-02 Thread Kevin Oberman

On Wed, Jan 2, 2013 at 2:06 PM, Alexander Motin m...@freebsd.org wrote:
 On 02.01.2013 18:08, Adrian Chadd wrote:

 .. I'm pretty damned sure we're going to need to enforce a never
 earlier than X latency.


 Do you mean here that we should never wake up before specified time (just as
 specified by the most of existing APIs), or that we should not allow sleep
 shorter then some value to avoid DoS? At least on x86 nanosleep(0) doesn't
 allow to block the system. Also there is already present mechanism for
 specifying minimum timer programming interval in eventtimers(9) KPI.

I can see serious performance issues with some hardware (wireless
comes to mind) if things happen too quickly. Intuition is that it
could also play hob with VMs.

I believe that the proper way is to wake between  T_X and T_X + D.
This assumes that D is max_wake_delay, not deviation, which leaves us
at the original of (T_X) = event_time = (T_X + D).
-- 
R. Kevin Oberman, Network Engineer
E-mail: kob6...@gmail.com
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2012-12-31 Thread Davide Italiano

On Sun, Dec 30, 2012 at 10:17 PM, Luigi Rizzo ri...@iet.unipi.it wrote:
 On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
 ...
 I grabbed testsleep.c to test an arm event timer implementation, and had
 to fix a couple nits... kqueueto was missing from the names[] array, and
 I had to add a * 1000 to a couple places where usec was stuffed into a
 timespec's tv_nsec.

 I also tested the calloutng_12_17 patches and the kqueue stuff behaved
 very strangely.  Then I noticed you had a 12_26 patchset so I tested
 that (after crudely fixing a couple uninitialized var warnings), and it
 all looks good on this arm (Raspberry Pi).  I'll attach the results.

 It's so sweet to be able to do precision sleeps.

 interesting numbers, but there seems to be some problem in computing
 the exact interval; delays are much larger than expected.

 In this test, the original timer code used to round to the next multiple
 of 1 tick and then add another tick (except for the kqueue case),
 which is exactly what you see in the second set of measurements.

 The calloutng code however seems to do something odd:
 in addition to fixed overhead (some 50us, which you can see in
 the tests for 1us and 300us), all delay seem to be ~10% larger
 than what is requested, upper bounded to 10ms (note, the
 numbers are averages so i cannot tell whether all samples are
 the same or there is some distribution of values).

 I am not sure if this error is peculiar of the ARM version or also
 appears on x86/amd64 but I believe it should be fixed.

 If you look at the results below:

 1us possily ok:
 for very short intervals i would expect some kind
 of 'reschedule' without actually firing a timer; maybe
 50us are what it takes to do a round through the scheduler ?

 300us   probably ok
 i guess the extra 50-90us are what it takes to do a round
 through the scheduler

 1000us  borderline (this is the case for poll and kqueue, which are
 rounded to 1ms)
 here intervals seem to be increased by 10%, and i cannot see
 a good reason for this (more below).

 3000us and above: wrong
 here again, the intervals seem to be 10% larger than what is
 requested, perhaps limiting the error to 10-20ms.


 Maybe the 10% extension results from creating a default 'precision'
 for legacy calls, but i do not think this is done correctly.

 First of all, if users do not specify a precision themselves, the
 automatically generated value should never exceed one tick.

 Second, the only point of a 'precision' parameter is to merge
 requests that may be close in time, so if there is already a
 timer scheduled within [Treq, Treq+precision] i will get it;
 but if there no pending timer, then one should schedule it
 for the requested interval.

 Davide/Alexander, any ideas ?


The first question is what couple timecounter/eventtimer is used.
Some hardware is slower than other, and this may affect numbers.

Thanks for testing, BTW.

Davide

 cheers
 luigi

 for t in 1 300 3000 3 30 ; do
   for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
 ./testsleep $t $m
   done
 done


 With calloutng_12_26.patch...

 HZ=100   HZ=250   HZ=1000
 --   
 select  1 55.79  1 50.96  1 61.32
 poll1   1109.46  1   1107.86  1   1114.38
 usleep  1 56.33  1 72.90  1 62.78
 nanosleep   1 52.66  1 55.23  1 64.23
 kqueue  1   1114.23  1   1113.81  1   1121.21
 kqueueto1 65.44  1 71.00  1 75.01
 syscall 1  4.70  1  4.45  1  4.55
 select300355.79300357.76300362.35
 poll  300   1107.85300   1122.55300   1115.62
 usleep300355.28300357.28300360.79
 nanosleep 300354.49300355.82300360.62
 kqueue300   1112.57300   1118.13300   1117.16
 kqueueto  300375.98300378.62300395.61
 syscall   300  4.41300  4.45300  4.54
 select   3000   3246.75   3000   3246.74   3000   3252.72
 poll 3000   3238.10   3000   3229.12   3000   3250.10
 usleep   3000   3242.47   3000   3237.06   3000   3249.61
 nanosleep3000   3238.79   3000   3231.55   3000   3248.11
 kqueue   3000   3240.01   3000   3236.07   3000   3247.60
 kqueueto 3000   3265.36   3000   3267.22   3000   3274.96
 syscall  3000  4.69   3000  4.44   3000  4.50
 select  3  31714.60  3  31941.17  3  32467.69
 poll3  31522.76  3  31983.00  3  32497.81
 usleep  3

Re: [RFC/RFT] calloutng

2012-12-31 Thread Alexander Motin


On 31.12.2012 08:17, Luigi Rizzo wrote:

On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
...

I grabbed testsleep.c to test an arm event timer implementation, and had
to fix a couple nits... kqueueto was missing from the names[] array, and
I had to add a * 1000 to a couple places where usec was stuffed into a
timespec's tv_nsec.

I also tested the calloutng_12_17 patches and the kqueue stuff behaved
very strangely.


I've rewritten kqueue timeouts at the calloutng_12_26.patch.


Then I noticed you had a 12_26 patchset so I tested
that (after crudely fixing a couple uninitialized var warnings), and it
all looks good on this arm (Raspberry Pi).  I'll attach the results.

It's so sweet to be able to do precision sleeps.


Thank you for testing, Ian.


interesting numbers, but there seems to be some problem in computing
the exact interval; delays are much larger than expected.

In this test, the original timer code used to round to the next multiple
of 1 tick and then add another tick (except for the kqueue case),
which is exactly what you see in the second set of measurements.

The calloutng code however seems to do something odd:
in addition to fixed overhead (some 50us, which you can see in
the tests for 1us and 300us), all delay seem to be ~10% larger
than what is requested, upper bounded to 10ms (note, the
numbers are averages so i cannot tell whether all samples are
the same or there is some distribution of values).

I am not sure if this error is peculiar of the ARM version or also
appears on x86/amd64 but I believe it should be fixed.

If you look at the results below:

1us possily ok:
for very short intervals i would expect some kind
of 'reschedule' without actually firing a timer; maybe
50us are what it takes to do a round through the scheduler ?

300us   probably ok
i guess the extra 50-90us are what it takes to do a round
through the scheduler

1000us  borderline (this is the case for poll and kqueue, which are
rounded to 1ms)
here intervals seem to be increased by 10%, and i cannot see
a good reason for this (more below).

3000us and above: wrong
here again, the intervals seem to be 10% larger than what is
requested, perhaps limiting the error to 10-20ms.


Maybe the 10% extension results from creating a default 'precision'
for legacy calls, but i do not think this is done correctly.

First of all, if users do not specify a precision themselves, the
automatically generated value should never exceed one tick.

Second, the only point of a 'precision' parameter is to merge
requests that may be close in time, so if there is already a
timer scheduled within [Treq, Treq+precision] i will get it;
but if there no pending timer, then one should schedule it
for the requested interval.

Davide/Alexander, any ideas ?


All mentioned effects could be explained with implemented logic. 50us at 
1us is probably sum of minimal latency of the hardware eventtimer on the 
specific platform and some software processing overhead (syscall, 
callout, timecouters, scheduler, etc). At later points system starts to 
noticeably use precision specified by kern.timecounter.alloweddeviation 
sysctl. It affects results from two sides: 1) extending intervals for 
specified percent of time to allow event aggregation, and 2) choosing 
time base between fast getbinuptime() and precise binuptime(). Extending 
interval is needed to aggregate not only callouts with each other, but 
also callouts with other system events, which are impossible to schedule 
in advance. It gives specified relative error, but no more then one CPU 
wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) 
it is 1/hz, for completely idle one it can be up to 0.5s. Second point 
allows to reduce processing overhead by the cost of error up to 1/hz for 
long periods ((100/allowed)*(1/hz)), when it is used.


To get best possible precision kern.timecounter.alloweddeviation sysctl 
can be set to smaller value. Setting it to 0 will effectively disable 
all optimizations, but should give 50us precision in all cases.



for t in 1 300 3000 3 30 ; do
   for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
 ./testsleep $t $m
   done
done


With calloutng_12_26.patch...

 HZ=100   HZ=250   HZ=1000
--   
select  1 55.79  1 50.96  1 61.32
poll1   1109.46  1   1107.86  1   1114.38
usleep  1 56.33  1 72.90  1 62.78
nanosleep   1 52.66  1 55.23  1 64.23
kqueue  1   1114.23  1   1113.81  1   1121.21
kqueueto1 65.44  1 71.00  1 75.01
syscall 1  4.70  1  4.45  1  4.55
select300355.79300357.76

Re: [RFC/RFT] calloutng

2012-12-31 Thread Ian Lepore

On Mon, 2012-12-31 at 12:17 +0200, Alexander Motin wrote:
 On 31.12.2012 08:17, Luigi Rizzo wrote:
  On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
  ...
  I grabbed testsleep.c to test an arm event timer implementation, and had
  to fix a couple nits... kqueueto was missing from the names[] array, and
  I had to add a * 1000 to a couple places where usec was stuffed into a
  timespec's tv_nsec.
 
  I also tested the calloutng_12_17 patches and the kqueue stuff behaved
  very strangely.
 
 I've rewritten kqueue timeouts at the calloutng_12_26.patch.
 
  Then I noticed you had a 12_26 patchset so I tested
  that (after crudely fixing a couple uninitialized var warnings), and it
  all looks good on this arm (Raspberry Pi).  I'll attach the results.
 
  It's so sweet to be able to do precision sleeps.
 
 Thank you for testing, Ian.
 
  interesting numbers, but there seems to be some problem in computing
  the exact interval; delays are much larger than expected.
 
  In this test, the original timer code used to round to the next multiple
  of 1 tick and then add another tick (except for the kqueue case),
  which is exactly what you see in the second set of measurements.
 
  The calloutng code however seems to do something odd:
  in addition to fixed overhead (some 50us, which you can see in
  the tests for 1us and 300us), all delay seem to be ~10% larger
  than what is requested, upper bounded to 10ms (note, the
  numbers are averages so i cannot tell whether all samples are
  the same or there is some distribution of values).
 
  I am not sure if this error is peculiar of the ARM version or also
  appears on x86/amd64 but I believe it should be fixed.
 
  If you look at the results below:
 
  1us possily ok:
  for very short intervals i would expect some kind
  of 'reschedule' without actually firing a timer; maybe
  50us are what it takes to do a round through the scheduler ?
 
  300us   probably ok
  i guess the extra 50-90us are what it takes to do a round
  through the scheduler
 
  1000us  borderline (this is the case for poll and kqueue, which are
  rounded to 1ms)
  here intervals seem to be increased by 10%, and i cannot see
  a good reason for this (more below).
 
  3000us and above: wrong
  here again, the intervals seem to be 10% larger than what is
  requested, perhaps limiting the error to 10-20ms.
 
 
  Maybe the 10% extension results from creating a default 'precision'
  for legacy calls, but i do not think this is done correctly.
 
  First of all, if users do not specify a precision themselves, the
  automatically generated value should never exceed one tick.
 
  Second, the only point of a 'precision' parameter is to merge
  requests that may be close in time, so if there is already a
  timer scheduled within [Treq, Treq+precision] i will get it;
  but if there no pending timer, then one should schedule it
  for the requested interval.
 
  Davide/Alexander, any ideas ?
 
 All mentioned effects could be explained with implemented logic. 50us at 
 1us is probably sum of minimal latency of the hardware eventtimer on the 
 specific platform and some software processing overhead (syscall, 
 callout, timecouters, scheduler, etc). At later points system starts to 
 noticeably use precision specified by kern.timecounter.alloweddeviation 
 sysctl. It affects results from two sides: 1) extending intervals for 
 specified percent of time to allow event aggregation, and 2) choosing 
 time base between fast getbinuptime() and precise binuptime(). Extending 
 interval is needed to aggregate not only callouts with each other, but 
 also callouts with other system events, which are impossible to schedule 
 in advance. It gives specified relative error, but no more then one CPU 
 wakeup period in absolute: for busy CPU (not skipping hardclock() ticks) 
 it is 1/hz, for completely idle one it can be up to 0.5s. Second point 
 allows to reduce processing overhead by the cost of error up to 1/hz for 
 long periods ((100/allowed)*(1/hz)), when it is used.
 
 To get best possible precision kern.timecounter.alloweddeviation sysctl 
 can be set to smaller value. Setting it to 0 will effectively disable 
 all optimizations, but should give 50us precision in all cases.
 
  for t in 1 300 3000 3 30 ; do
 for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
   ./testsleep $t $m
 done
  done
 
  [test results snipped]
 

I should have posted some information about the test platform...  It's a
single-processor 700mhz arm chip.  There was essentially nothing else
running during the tests other than mostly-idle daemons (sshd, ntpd, the
usual suspects).  Kernel debugging options off (INVARIANTS[_SUPPORT],
DIAGNOSTIC, and WITNESS).  

Some sysctl values of interest...

rpi# sysctl kern.timecounter
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: BCM2835 Timecounter(1000) dummy(-100)

Re: [RFC/RFT] calloutng

2012-12-31 Thread Alexander Motin


On 31.12.2012 17:02, Ian Lepore wrote:

On Mon, 2012-12-31 at 12:17 +0200, Alexander Motin wrote:

On 31.12.2012 08:17, Luigi Rizzo wrote:

On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
...

I grabbed testsleep.c to test an arm event timer implementation, and had
to fix a couple nits... kqueueto was missing from the names[] array, and
I had to add a * 1000 to a couple places where usec was stuffed into a
timespec's tv_nsec.

I also tested the calloutng_12_17 patches and the kqueue stuff behaved
very strangely.


I've rewritten kqueue timeouts at the calloutng_12_26.patch.


Then I noticed you had a 12_26 patchset so I tested
that (after crudely fixing a couple uninitialized var warnings), and it
all looks good on this arm (Raspberry Pi).  I'll attach the results.

It's so sweet to be able to do precision sleeps.


Thank you for testing, Ian.


interesting numbers, but there seems to be some problem in computing
the exact interval; delays are much larger than expected.

In this test, the original timer code used to round to the next multiple
of 1 tick and then add another tick (except for the kqueue case),
which is exactly what you see in the second set of measurements.

The calloutng code however seems to do something odd:
in addition to fixed overhead (some 50us, which you can see in
the tests for 1us and 300us), all delay seem to be ~10% larger
than what is requested, upper bounded to 10ms (note, the
numbers are averages so i cannot tell whether all samples are
the same or there is some distribution of values).

I am not sure if this error is peculiar of the ARM version or also
appears on x86/amd64 but I believe it should be fixed.

If you look at the results below:

1us possily ok:
for very short intervals i would expect some kind
of 'reschedule' without actually firing a timer; maybe
50us are what it takes to do a round through the scheduler ?

300us   probably ok
i guess the extra 50-90us are what it takes to do a round
through the scheduler

1000us  borderline (this is the case for poll and kqueue, which are
rounded to 1ms)
here intervals seem to be increased by 10%, and i cannot see
a good reason for this (more below).

3000us and above: wrong
here again, the intervals seem to be 10% larger than what is
requested, perhaps limiting the error to 10-20ms.


Maybe the 10% extension results from creating a default 'precision'
for legacy calls, but i do not think this is done correctly.

First of all, if users do not specify a precision themselves, the
automatically generated value should never exceed one tick.

Second, the only point of a 'precision' parameter is to merge
requests that may be close in time, so if there is already a
timer scheduled within [Treq, Treq+precision] i will get it;
but if there no pending timer, then one should schedule it
for the requested interval.

Davide/Alexander, any ideas ?


All mentioned effects could be explained with implemented logic. 50us at
1us is probably sum of minimal latency of the hardware eventtimer on the
specific platform and some software processing overhead (syscall,
callout, timecouters, scheduler, etc). At later points system starts to
noticeably use precision specified by kern.timecounter.alloweddeviation
sysctl. It affects results from two sides: 1) extending intervals for
specified percent of time to allow event aggregation, and 2) choosing
time base between fast getbinuptime() and precise binuptime(). Extending
interval is needed to aggregate not only callouts with each other, but
also callouts with other system events, which are impossible to schedule
in advance. It gives specified relative error, but no more then one CPU
wakeup period in absolute: for busy CPU (not skipping hardclock() ticks)
it is 1/hz, for completely idle one it can be up to 0.5s. Second point
allows to reduce processing overhead by the cost of error up to 1/hz for
long periods ((100/allowed)*(1/hz)), when it is used.

To get best possible precision kern.timecounter.alloweddeviation sysctl
can be set to smaller value. Setting it to 0 will effectively disable
all optimizations, but should give 50us precision in all cases.


for t in 1 300 3000 3 30 ; do
for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
  ./testsleep $t $m
done
done

[test results snipped]




I should have posted some information about the test platform...  It's a
single-processor 700mhz arm chip.  There was essentially nothing else
running during the tests other than mostly-idle daemons (sshd, ntpd, the
usual suspects).  Kernel debugging options off (INVARIANTS[_SUPPORT],
DIAGNOSTIC, and WITNESS).

Some sysctl values of interest...

rpi# sysctl kern.timecounter
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: BCM2835 Timecounter(1000) dummy(-100)
kern.timecounter.hardware: BCM2835 Timecounter
kern.timecounter.alloweddeviation: 5

Re: [RFC/RFT] calloutng

2012-12-30 Thread Ian Lepore

On Wed, 2012-12-26 at 21:24 +0200, Alexander Motin wrote:
 On 26.12.2012 01:21, Marius Strobl wrote:
  On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
  Experiments with dummynet shown ineffective support for very short
  tick-based callouts. New version fixes that, allowing to get as many
  tick-based callout events as hz value permits, while still be able to
  aggregate events and generating minimum of interrupts.
 
  Also this version modifies system load average calculation to fix some
  cases existing in HEAD and 9 branches, that could be fixed with new
  direct callout functionality.
 
  http://people.freebsd.org/~mav/calloutng_12_17.patch
 
  With several important changes made last time I am going to delay commit
  to HEAD for another week to do more testing. Comments and new test cases
  are welcome. Thanks for staying tuned and commenting.
 
  FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
  try on sparc64 and it at least survives a buildworld there. However,
  with the patched kernels, buildworld times seem to increase slightly but
  reproducible by 1-2% (I only did four runs but typically buildworld
  times are rather stable and don't vary more than a minute for the
  same kernel and source here). Is this an expected trade-off (system
  time as such doesn't seem to increase)?
 
 I don't think build process uses significant number of callouts to 
 affect results directly. I think this additional time could be result of 
 the deeper next event look up, done by the new code, that is practically 
 useless for sparc64, which effectively has no cpu_idle() routine. It 
 wouldn't affect system time and wouldn't show up in any statistics 
 (except PMC or something alike) because it is executed inside timer 
 hardware interrupt handler. If my guess is right, that is a part that 
 probably still could be optimized. I'll look on it. Thanks.
 
  Is there anything specific to test?
 
 Since the most of code is MI, for sparc64 I would mostly look on related 
 MD parts (eventtimers and timecounters) to make sure they are working 
 reliably in more stressful conditions.  I still have some worries about 
 possible deadlock on hardware where IPIs are used to fetch present time 
 from other CPU.
 
 Here is small tool we are using for test correctness and performance of 
 different user-level APIs: http://people.freebsd.org/~mav/testsleep.c
 

I grabbed testsleep.c to test an arm event timer implementation, and had
to fix a couple nits... kqueueto was missing from the names[] array, and
I had to add a * 1000 to a couple places where usec was stuffed into a
timespec's tv_nsec.

I also tested the calloutng_12_17 patches and the kqueue stuff behaved
very strangely.  Then I noticed you had a 12_26 patchset so I tested
that (after crudely fixing a couple uninitialized var warnings), and it
all looks good on this arm (Raspberry Pi).  I'll attach the results.

It's so sweet to be able to do precision sleeps.

-- Ian


for t in 1 300 3000 3 30 ; do
  for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
./testsleep $t $m
  done
done


With calloutng_12_26.patch...

HZ=100   HZ=250   HZ=1000
--   
select  1 55.79  1 50.96  1 61.32
poll1   1109.46  1   1107.86  1   1114.38
usleep  1 56.33  1 72.90  1 62.78
nanosleep   1 52.66  1 55.23  1 64.23
kqueue  1   1114.23  1   1113.81  1   1121.21
kqueueto1 65.44  1 71.00  1 75.01
syscall 1  4.70  1  4.45  1  4.55
select300355.79300357.76300362.35
poll  300   1107.85300   1122.55300   1115.62
usleep300355.28300357.28300360.79
nanosleep 300354.49300355.82300360.62
kqueue300   1112.57300   1118.13300   1117.16
kqueueto  300375.98300378.62300395.61
syscall   300  4.41300  4.45300  4.54
select   3000   3246.75   3000   3246.74   3000   3252.72
poll 3000   3238.10   3000   3229.12   3000   3250.10
usleep   3000   3242.47   3000   3237.06   3000   3249.61
nanosleep3000   3238.79   3000   3231.55   3000   3248.11
kqueue   3000   3240.01   3000   3236.07   3000   3247.60
kqueueto 3000   3265.36   3000   3267.22   3000   3274.96
syscall  3000  4.69   3000  4.44   3000  4.50
select  3  31714.60  3  31941.17  3  32467.69
poll3  31522.76  3  31983.00  3  32497.81
usleep  3  31459.67  3  31980.76  3  32458.71
nanosleep

Re: [RFC/RFT] calloutng

2012-12-30 Thread Luigi Rizzo

On Sun, Dec 30, 2012 at 04:13:43PM -0700, Ian Lepore wrote:
...
 I grabbed testsleep.c to test an arm event timer implementation, and had
 to fix a couple nits... kqueueto was missing from the names[] array, and
 I had to add a * 1000 to a couple places where usec was stuffed into a
 timespec's tv_nsec.
 
 I also tested the calloutng_12_17 patches and the kqueue stuff behaved
 very strangely.  Then I noticed you had a 12_26 patchset so I tested
 that (after crudely fixing a couple uninitialized var warnings), and it
 all looks good on this arm (Raspberry Pi).  I'll attach the results.
 
 It's so sweet to be able to do precision sleeps.

interesting numbers, but there seems to be some problem in computing
the exact interval; delays are much larger than expected.

In this test, the original timer code used to round to the next multiple
of 1 tick and then add another tick (except for the kqueue case),
which is exactly what you see in the second set of measurements.

The calloutng code however seems to do something odd:
in addition to fixed overhead (some 50us, which you can see in
the tests for 1us and 300us), all delay seem to be ~10% larger
than what is requested, upper bounded to 10ms (note, the
numbers are averages so i cannot tell whether all samples are
the same or there is some distribution of values).

I am not sure if this error is peculiar of the ARM version or also
appears on x86/amd64 but I believe it should be fixed.

If you look at the results below:

1us possily ok:
for very short intervals i would expect some kind
of 'reschedule' without actually firing a timer; maybe
50us are what it takes to do a round through the scheduler ?

300us   probably ok
i guess the extra 50-90us are what it takes to do a round
through the scheduler

1000us  borderline (this is the case for poll and kqueue, which are
rounded to 1ms)
here intervals seem to be increased by 10%, and i cannot see
a good reason for this (more below).

3000us and above: wrong
here again, the intervals seem to be 10% larger than what is
requested, perhaps limiting the error to 10-20ms.


Maybe the 10% extension results from creating a default 'precision'
for legacy calls, but i do not think this is done correctly.

First of all, if users do not specify a precision themselves, the
automatically generated value should never exceed one tick.

Second, the only point of a 'precision' parameter is to merge
requests that may be close in time, so if there is already a
timer scheduled within [Treq, Treq+precision] i will get it;
but if there no pending timer, then one should schedule it
for the requested interval.

Davide/Alexander, any ideas ?

cheers
luigi

 for t in 1 300 3000 3 30 ; do
   for m in select poll usleep nanosleep kqueue kqueueto syscall ; do
 ./testsleep $t $m
   done
 done
 
 
 With calloutng_12_26.patch...
 
 HZ=100   HZ=250   HZ=1000
 --   
 select  1 55.79  1 50.96  1 61.32
 poll1   1109.46  1   1107.86  1   1114.38
 usleep  1 56.33  1 72.90  1 62.78
 nanosleep   1 52.66  1 55.23  1 64.23
 kqueue  1   1114.23  1   1113.81  1   1121.21
 kqueueto1 65.44  1 71.00  1 75.01
 syscall 1  4.70  1  4.45  1  4.55
 select300355.79300357.76300362.35
 poll  300   1107.85300   1122.55300   1115.62
 usleep300355.28300357.28300360.79
 nanosleep 300354.49300355.82300360.62
 kqueue300   1112.57300   1118.13300   1117.16
 kqueueto  300375.98300378.62300395.61
 syscall   300  4.41300  4.45300  4.54
 select   3000   3246.75   3000   3246.74   3000   3252.72
 poll 3000   3238.10   3000   3229.12   3000   3250.10
 usleep   3000   3242.47   3000   3237.06   3000   3249.61
 nanosleep3000   3238.79   3000   3231.55   3000   3248.11
 kqueue   3000   3240.01   3000   3236.07   3000   3247.60
 kqueueto 3000   3265.36   3000   3267.22   3000   3274.96
 syscall  3000  4.69   3000  4.44   3000  4.50
 select  3  31714.60  3  31941.17  3  32467.69
 poll3  31522.76  3  31983.00  3  32497.81
 usleep  3  31459.67  3  31980.76  3  32458.71
 nanosleep   3  31431.02  3  31982.22  3  32525.20
 kqueue  3  31466.75  3  31873.90  3  31973.54
 kqueueto3  31564.67  3  32522.35  3  32475.59
 syscall

Re: [RFC/RFT] calloutng

2012-12-26 Thread Alexander Motin


On 26.12.2012 01:21, Marius Strobl wrote:

On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:

Experiments with dummynet shown ineffective support for very short
tick-based callouts. New version fixes that, allowing to get as many
tick-based callout events as hz value permits, while still be able to
aggregate events and generating minimum of interrupts.

Also this version modifies system load average calculation to fix some
cases existing in HEAD and 9 branches, that could be fixed with new
direct callout functionality.

http://people.freebsd.org/~mav/calloutng_12_17.patch

With several important changes made last time I am going to delay commit
to HEAD for another week to do more testing. Comments and new test cases
are welcome. Thanks for staying tuned and commenting.


FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
try on sparc64 and it at least survives a buildworld there. However,
with the patched kernels, buildworld times seem to increase slightly but
reproducible by 1-2% (I only did four runs but typically buildworld
times are rather stable and don't vary more than a minute for the
same kernel and source here). Is this an expected trade-off (system
time as such doesn't seem to increase)?


I don't think build process uses significant number of callouts to 
affect results directly. I think this additional time could be result of 
the deeper next event look up, done by the new code, that is practically 
useless for sparc64, which effectively has no cpu_idle() routine. It 
wouldn't affect system time and wouldn't show up in any statistics 
(except PMC or something alike) because it is executed inside timer 
hardware interrupt handler. If my guess is right, that is a part that 
probably still could be optimized. I'll look on it. Thanks.



Is there anything specific to test?


Since the most of code is MI, for sparc64 I would mostly look on related 
MD parts (eventtimers and timecounters) to make sure they are working 
reliably in more stressful conditions.  I still have some worries about 
possible deadlock on hardware where IPIs are used to fetch present time 
from other CPU.


Here is small tool we are using for test correctness and performance of 
different user-level APIs: http://people.freebsd.org/~mav/testsleep.c


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2012-12-25 Thread Marius Strobl

On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote:
 Experiments with dummynet shown ineffective support for very short 
 tick-based callouts. New version fixes that, allowing to get as many 
 tick-based callout events as hz value permits, while still be able to 
 aggregate events and generating minimum of interrupts.
 
 Also this version modifies system load average calculation to fix some 
 cases existing in HEAD and 9 branches, that could be fixed with new 
 direct callout functionality.
 
 http://people.freebsd.org/~mav/calloutng_12_17.patch
 
 With several important changes made last time I am going to delay commit 
 to HEAD for another week to do more testing. Comments and new test cases 
 are welcome. Thanks for staying tuned and commenting.

FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a
try on sparc64 and it at least survives a buildworld there. However,
with the patched kernels, buildworld times seem to increase slightly but
reproducible by 1-2% (I only did four runs but typically buildworld
times are rather stable and don't vary more than a minute for the
same kernel and source here). Is this an expected trade-off (system
time as such doesn't seem to increase)?
Is there anything specific to test?

Marius

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-22 Thread Jilles Tjoelker

On Wed, Dec 19, 2012 at 11:00:06AM +, Poul-Henning Kamp wrote:
 
 In message 50d192e8.3020...@freebsd.org, Alexander Motin writes:

 Linux uses 32.32 format in their eventtimers code.

 (And that is no accident, I know who they got the number from :-)

 But if at some point we want to be able to 
 handle absolute wall time, [...]

 Then you have other problems, including but not limited to clock
 being stepped, leap-seconds, suspend/resume and frequency stability.

 If you want to support callouts of the type (At 14:00 UTC tomorrow)
 (disregarding the time-zone issue), you need to catch all significant
 changes to our UTC estimate and recalibrate your callout based on
 that.

 It is not obvious that we have applications for such an API that
 warrant the complexity.

 Either way, such a facility should be layered on top of the callout
 facility, which should always run in elapsed time[1] with no attention
 paid to what NTPD might do to the UTC estimate.

POSIX specifies functions that assume such a facility exists, although
applications may not care much if we implement them incorrectly.

For example, pthread_mutex_timedlock() and sem_timedwait() shall time
out when the CLOCK_REALTIME clock reaches the given value, and
pthread_cond_timedwait() and clock_nanosleep() (with TIMER_ABSTIME)
shall time out when the specified clock reaches the given value.

 So summary: 32.32 is the right format.

 Poul-Henning

 [1] Notice that elapsed time needs a firm definition with respect
 to suspend/resume, and that this decision has big implications
 for the API use and code duplication.

 I think it prudent to specify a flag to callouts, to tell what
 should happen on suspend/resume, something like:

   SR_CANCEL   /* Cancel the callout on S/R */
   /* no flag* /* Toll this callout only when system is running */
   SR_IGNORE   /* Toll suspended time from callout */

 If you get this right, callouts from device drivers will just DTRT,
 if you get it wrong, all device drivers will need boilerplate code
 to handle S/R

Userland could get access to this via CLOCK_REALTIME vs CLOCK_MONOTONIC
vs CLOCK_UPTIME.

-- 
Jilles Tjoelker
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-22 Thread Poul-Henning Kamp


In message 2012125025.ga46...@stack.nl, Jilles Tjoelker writes:

 Either way, such a facility should be layered on top of the callout
 facility, which should always run in elapsed time[1] with no attention
 paid to what NTPD might do to the UTC estimate.

POSIX specifies functions that assume such a facility exists, although
applications may not care much if we implement them incorrectly.

It should still be implemented op top of callouts, not as part of:
it is an entirely different thing to try to do right.

 I think it prudent to specify a flag to callouts, to tell what
 should happen on suspend/resume, something like:

  SR_CANCEL   /* Cancel the callout on S/R */
  /* no flag* /* Toll this callout only when system is running */
  SR_IGNORE   /* Toll suspended time from callout */

 If you get this right, callouts from device drivers will just DTRT,
 if you get it wrong, all device drivers will need boilerplate code
 to handle S/R

Userland could get access to this via CLOCK_REALTIME vs CLOCK_MONOTONIC
vs CLOCK_UPTIME.

I have _no_ idea what you are trying to say here...

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2012-12-21 Thread Fabian Keil

Fabian Keil freebsd-lis...@fabiankeil.de wrote:

 Alexander Motin m...@freebsd.org wrote:
 
  On 20.12.2012 15:26, Fabian Keil wrote:
   Alexander Motin m...@freebsd.org wrote:
  
   On 20.12.2012 12:56, Fabian Keil wrote:
   Alexander Motin m...@freebsd.org wrote:
  
   Experiments with dummynet shown ineffective support for very short
   tick-based callouts. New version fixes that, allowing to get as many
   tick-based callout events as hz value permits, while still be able to
   aggregate events and generating minimum of interrupts.
  
   Also this version modifies system load average calculation to fix some
   cases existing in HEAD and 9 branches, that could be fixed with new
   direct callout functionality.
  
   http://people.freebsd.org/~mav/calloutng_12_17.patch
  
   With this patch (and the previous one, I didn't test the others)
   my mouse cursor is occasionally not reacting for short amounts of
   time (less than a second, but long enough to be noticeable).
 
   Could you try to revert part of the patch, related to dev/atkbdc? I am
   not strong in details of that hardware, but in comments there mention
   that they are related. May be lost keyboard interrupts (which polling
   rate was increased to 1 second) cause PS/2 mouse delays.
  
   I reverted the changes to sys/dev/atkbdc/* about an hour ago
   and so far it's looking good. I'll report back tomorrow after
   some more testing.

Still looking good.

Fabian


signature.asc
Description: PGP signature

Re: [RFC/RFT] calloutng

2012-12-21 Thread Alexander Motin


On 21.12.2012 14:16, Fabian Keil wrote:

Fabian Keil freebsd-lis...@fabiankeil.de wrote:


Alexander Motin m...@freebsd.org wrote:


On 20.12.2012 15:26, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


On 20.12.2012 12:56, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


Experiments with dummynet shown ineffective support for very short
tick-based callouts. New version fixes that, allowing to get as many
tick-based callout events as hz value permits, while still be able to
aggregate events and generating minimum of interrupts.

Also this version modifies system load average calculation to fix some
cases existing in HEAD and 9 branches, that could be fixed with new
direct callout functionality.

http://people.freebsd.org/~mav/calloutng_12_17.patch


With this patch (and the previous one, I didn't test the others)
my mouse cursor is occasionally not reacting for short amounts of
time (less than a second, but long enough to be noticeable).



Could you try to revert part of the patch, related to dev/atkbdc? I am
not strong in details of that hardware, but in comments there mention
that they are related. May be lost keyboard interrupts (which polling
rate was increased to 1 second) cause PS/2 mouse delays.


I reverted the changes to sys/dev/atkbdc/* about an hour ago
and so far it's looking good. I'll report back tomorrow after
some more testing.


Still looking good.


Thank you. I think it may be some locking issue in atkbdc code. I'll 
revert that part of the patch until somebody rewrite it properly.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2012-12-20 Thread Fabian Keil

Alexander Motin m...@freebsd.org wrote:

 Experiments with dummynet shown ineffective support for very short 
 tick-based callouts. New version fixes that, allowing to get as many 
 tick-based callout events as hz value permits, while still be able to 
 aggregate events and generating minimum of interrupts.
 
 Also this version modifies system load average calculation to fix some 
 cases existing in HEAD and 9 branches, that could be fixed with new 
 direct callout functionality.
 
 http://people.freebsd.org/~mav/calloutng_12_17.patch

With this patch (and the previous one, I didn't test the others)
my mouse cursor is occasionally not reacting for short amounts of
time (less than a second, but long enough to be noticeable).

Every now and then the window manager (i3-wm) changes window focus
which could be explained by either phantom keyboard or mouse input,
or terminal lines are marked as if the cursor was moved with the
left button pressed.

The problems happen a couple of times per hour but I haven't
been able to intentionally reproduce them. They only seem to
occur while I'm moving the cursor, but of course I wouldn't
otherwise notice a unresponsive cursor anyway.

While the cursor is unresponsive, keyboard input and the rest
of the system works as expected as far as I can tell.

If I set debug.psm.loglevel=4 I get a psm0: lost interrupt?
message once per second when not moving the mouse, however that
also happens without the patch and thus might be unrelated.

I'm using moused.

I'm not sure what additional information is necessary to debug this,
so here a bunch of sysctl values that may or may not be relevant:

fk@r500 ~ $sysctl kern.eventtimer kern.timecounter
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.HPET.flags: 3
kern.eventtimer.et.HPET.frequency: 14318180
kern.eventtimer.et.HPET.quality: 450
kern.eventtimer.et.HPET1.flags: 3
kern.eventtimer.et.HPET1.frequency: 14318180
kern.eventtimer.et.HPET1.quality: 440
kern.eventtimer.et.HPET2.flags: 3
kern.eventtimer.et.HPET2.frequency: 14318180
kern.eventtimer.et.HPET2.quality: 440
kern.eventtimer.et.HPET3.flags: 3
kern.eventtimer.et.HPET3.frequency: 14318180
kern.eventtimer.et.HPET3.quality: 440
kern.eventtimer.choice: HPET(450) HPET1(440) HPET2(440) HPET3(440) i8254(100)
kern.eventtimer.singlemul: 2
kern.eventtimer.idletick: 0
kern.eventtimer.activetick: 1
kern.eventtimer.timer: HPET
kern.eventtimer.periodic: 0
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 25970
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.HPET.counter: 3963519587
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 7323739
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 454465294
kern.timecounter.tc.TSC.frequency: 1995040520
kern.timecounter.tc.TSC.quality: -1000
kern.timecounter.stepwarnings: 0
kern.timecounter.hardware: HPET
kern.timecounter.choice: TSC(-1000) ACPI-fast(900) HPET(950) i8254(0) 
dummy(-100)
kern.timecounter.tick: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.invariant_tsc: 1
kern.timecounter.smp_tsc: 0

The system is a Lenovo R500 with a
Intel(R) Core(TM)2 Duo CPU T5870  @ 2.00GHz (1995.04-MHz K8-class CPU)

Fabian


signature.asc
Description: PGP signature

Re: [RFC/RFT] calloutng

2012-12-20 Thread Alexander Motin


On 20.12.2012 12:56, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


Experiments with dummynet shown ineffective support for very short
tick-based callouts. New version fixes that, allowing to get as many
tick-based callout events as hz value permits, while still be able to
aggregate events and generating minimum of interrupts.

Also this version modifies system load average calculation to fix some
cases existing in HEAD and 9 branches, that could be fixed with new
direct callout functionality.

http://people.freebsd.org/~mav/calloutng_12_17.patch


With this patch (and the previous one, I didn't test the others)
my mouse cursor is occasionally not reacting for short amounts of
time (less than a second, but long enough to be noticeable).

Every now and then the window manager (i3-wm) changes window focus
which could be explained by either phantom keyboard or mouse input,
or terminal lines are marked as if the cursor was moved with the
left button pressed.

The problems happen a couple of times per hour but I haven't
been able to intentionally reproduce them. They only seem to
occur while I'm moving the cursor, but of course I wouldn't
otherwise notice a unresponsive cursor anyway.

While the cursor is unresponsive, keyboard input and the rest
of the system works as expected as far as I can tell.

If I set debug.psm.loglevel=4 I get a psm0: lost interrupt?
message once per second when not moving the mouse, however that
also happens without the patch and thus might be unrelated.

I'm using moused.


Could you try to revert part of the patch, related to dev/atkbdc? I am 
not strong in details of that hardware, but in comments there mention 
that they are related. May be lost keyboard interrupts (which polling 
rate was increased to 1 second) cause PS/2 mouse delays.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2012-12-20 Thread Fabian Keil

Alexander Motin m...@freebsd.org wrote:

 On 20.12.2012 12:56, Fabian Keil wrote:
  Alexander Motin m...@freebsd.org wrote:
 
  Experiments with dummynet shown ineffective support for very short
  tick-based callouts. New version fixes that, allowing to get as many
  tick-based callout events as hz value permits, while still be able to
  aggregate events and generating minimum of interrupts.
 
  Also this version modifies system load average calculation to fix some
  cases existing in HEAD and 9 branches, that could be fixed with new
  direct callout functionality.
 
  http://people.freebsd.org/~mav/calloutng_12_17.patch
 
  With this patch (and the previous one, I didn't test the others)
  my mouse cursor is occasionally not reacting for short amounts of
  time (less than a second, but long enough to be noticeable).
 
  Every now and then the window manager (i3-wm) changes window focus
  which could be explained by either phantom keyboard or mouse input,
  or terminal lines are marked as if the cursor was moved with the
  left button pressed.
 
  The problems happen a couple of times per hour but I haven't
  been able to intentionally reproduce them. They only seem to
  occur while I'm moving the cursor, but of course I wouldn't
  otherwise notice a unresponsive cursor anyway.
 
  While the cursor is unresponsive, keyboard input and the rest
  of the system works as expected as far as I can tell.
 
  If I set debug.psm.loglevel=4 I get a psm0: lost interrupt?
  message once per second when not moving the mouse, however that
  also happens without the patch and thus might be unrelated.
 
  I'm using moused.
 
 Could you try to revert part of the patch, related to dev/atkbdc? I am 
 not strong in details of that hardware, but in comments there mention 
 that they are related. May be lost keyboard interrupts (which polling 
 rate was increased to 1 second) cause PS/2 mouse delays.
 
I reverted the changes to sys/dev/atkbdc/* about an hour ago
and so far it's looking good. I'll report back tomorrow after
some more testing.

Fabian


signature.asc
Description: PGP signature

Re: [RFC/RFT] calloutng

2012-12-20 Thread Alexander Motin


On 20.12.2012 15:26, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


On 20.12.2012 12:56, Fabian Keil wrote:

Alexander Motin m...@freebsd.org wrote:


Experiments with dummynet shown ineffective support for very short
tick-based callouts. New version fixes that, allowing to get as many
tick-based callout events as hz value permits, while still be able to
aggregate events and generating minimum of interrupts.

Also this version modifies system load average calculation to fix some
cases existing in HEAD and 9 branches, that could be fixed with new
direct callout functionality.

http://people.freebsd.org/~mav/calloutng_12_17.patch


With this patch (and the previous one, I didn't test the others)
my mouse cursor is occasionally not reacting for short amounts of
time (less than a second, but long enough to be noticeable).

Every now and then the window manager (i3-wm) changes window focus
which could be explained by either phantom keyboard or mouse input,
or terminal lines are marked as if the cursor was moved with the
left button pressed.

The problems happen a couple of times per hour but I haven't
been able to intentionally reproduce them. They only seem to
occur while I'm moving the cursor, but of course I wouldn't
otherwise notice a unresponsive cursor anyway.

While the cursor is unresponsive, keyboard input and the rest
of the system works as expected as far as I can tell.

If I set debug.psm.loglevel=4 I get a psm0: lost interrupt?
message once per second when not moving the mouse, however that
also happens without the patch and thus might be unrelated.

I'm using moused.


Could you try to revert part of the patch, related to dev/atkbdc? I am
not strong in details of that hardware, but in comments there mention
that they are related. May be lost keyboard interrupts (which polling
rate was increased to 1 second) cause PS/2 mouse delays.


I reverted the changes to sys/dev/atkbdc/* about an hour ago
and so far it's looking good. I'll report back tomorrow after
some more testing.


Thank you for the report. If it will be fine. you can try to reapply 
that part of the patch, just changing line:

callout_reset_flags(sc-callout, hz, atkbdtimeout, dev, C_PRELSET(0));
to the:
callout_reset_flags(sc-callout, hz/10, atkbdtimeout, dev, C_PRELSET(0));

It should about to restore original polling interval, but still make it 
more flexible then original.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2012-12-20 Thread Fabian Keil

Alexander Motin m...@freebsd.org wrote:

 On 20.12.2012 15:26, Fabian Keil wrote:
  Alexander Motin m...@freebsd.org wrote:
 
  On 20.12.2012 12:56, Fabian Keil wrote:
  Alexander Motin m...@freebsd.org wrote:
 
  Experiments with dummynet shown ineffective support for very short
  tick-based callouts. New version fixes that, allowing to get as many
  tick-based callout events as hz value permits, while still be able to
  aggregate events and generating minimum of interrupts.
 
  Also this version modifies system load average calculation to fix some
  cases existing in HEAD and 9 branches, that could be fixed with new
  direct callout functionality.
 
  http://people.freebsd.org/~mav/calloutng_12_17.patch
 
  With this patch (and the previous one, I didn't test the others)
  my mouse cursor is occasionally not reacting for short amounts of
  time (less than a second, but long enough to be noticeable).

  Could you try to revert part of the patch, related to dev/atkbdc? I am
  not strong in details of that hardware, but in comments there mention
  that they are related. May be lost keyboard interrupts (which polling
  rate was increased to 1 second) cause PS/2 mouse delays.
 
  I reverted the changes to sys/dev/atkbdc/* about an hour ago
  and so far it's looking good. I'll report back tomorrow after
  some more testing.

 Thank you for the report. If it will be fine. you can try to reapply 
 that part of the patch, just changing line:
 callout_reset_flags(sc-callout, hz, atkbdtimeout, dev, C_PRELSET(0));
 to the:
 callout_reset_flags(sc-callout, hz/10, atkbdtimeout, dev, C_PRELSET(0));
 
 It should about to restore original polling interval, but still make it 
 more flexible then original.

With this change the mouse stops working completely after moving
the cursor a couple of pixels, at which point dmesg shows:
 
kbdc: TEST_AUX_PORT status:
kbdc: RESET_AUX return code:00fa
kbdc: RESET_AUX status:00aa
kbdc: RESET_AUX ID:

Using the keyboard frequently causes duplicated characters.

I'm aware of a problem in vanilla CURRENT where the mouse stops working
with similar looking messages when the cursor is being moved at the wrong
moment, but this happens less than once a week and I haven't been able to
track it down yet. It's possible that it's the same bug and your patch just
triggers it more reliable (two times in a row). I haven't seen the
duplicated-characters issue before, though.

Just for kicks I also tried:

callout_reset_flags(sc-callout, hz/100, atkbdtimeout, dev, C_PRELSET(0));

but with this modification I can't even enter my login password
and thus can't tell if the mouse would work.

Fabian


signature.asc
Description: PGP signature

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Alexander Motin


On 19.12.2012 01:37, Ian Lepore wrote:

On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote:

On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote:

On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:

[top posting for readability;
in summary we were discussing the new callout API trying to avoid
an explosion of methods and arguments while at the same time
supporting the old API and the new one]
(I am also Cc-ing phk as he might have better insight
on the topic).

I think the patch you propose is a step in the right direction,
but i still remain concerned by having to pass two bintimes
(by reference, but they should really go by value)
and one 'ticks' value to all these functions.

I am also dubious that we need a full 128 bits to specify
the 'precision': there would be absolutely no loss of functionality
if we decided to specify the precision in powers of 2, so a precision
'k' (signed) means 2^k seconds. This way 8 bits are enough to
represent any precision we want.


...

I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.  It would
typical to want a 500uS timeout but be willing to late by up to 250uS if


i said k is signed so negative values represent fractions of a
second. 2^-128 is pretty short :)


Ahh, I missed that.  Good enough then!  Hmmm, if that ideas survives
further review, then could precision be encoded in 8 bits of the flags,
eliminating another parm?


Those who tracked the branch could see that actually was our first 
approach to handle precision. Unfortunately, it appeared not very 
convenient when you need to convert relative precision in percents or 
fraction of interval to the absolute precision in power of 2. We were 
worried that using some ffsll() for it can be inconvenient or expensive. 
But since we are now talking about passing relative bintime as an 
argument, that may be more viable option. I'll make another try.


Thanks for the input. Pity it didn't happen couple of months ago.

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore writes
:
On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:

I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.

It is a bad idea, and it is physically pointless, given the stabilities
of the timebases available for computers in general.

Please just take my word as a time-nut, and use a 32.32 binary format
in seconds (see previous email) and be done with it.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Davide Italiano

On Wed, Dec 19, 2012 at 1:54 AM, Poul-Henning Kamp p...@phk.freebsd.dk wrote:
 
 In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore 
 writes
 :
On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:

I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.

 It is a bad idea, and it is physically pointless, given the stabilities
 of the timebases available for computers in general.

 Please just take my word as a time-nut, and use a 32.32 binary format
 in seconds (see previous email) and be done with it.

 --
 Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
 p...@freebsd.org | TCP/IP since RFC 956
 FreeBSD committer   | BSD since 4.3-tahoe
 Never attribute to malice what can adequately be explained by incompetence.

Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.
I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code
duplication for all the basic conversion/math operation. On the other
hand, 32.32 may not be enough in the long future.
What do you think about that?

Thanks,

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Alexander Motin


On 19.12.2012 12:03, Davide Italiano wrote:

On Wed, Dec 19, 2012 at 1:54 AM, Poul-Henning Kamp p...@phk.freebsd.dk wrote:


In message 1355873265.1198.183.ca...@revolution.hippie.lan, Ian Lepore writes
:

On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:



I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.


It is a bad idea, and it is physically pointless, given the stabilities
of the timebases available for computers in general.

Please just take my word as a time-nut, and use a 32.32 binary format
in seconds (see previous email) and be done with it.


Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.
I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code
duplication for all the basic conversion/math operation. On the other
hand, 32.32 may not be enough in the long future.
What do you think about that?


Linux uses 32.32 format in their eventtimers code. Respecting that now 
we have no any timer hardware with frequency about 4GHz, that precision 
is probably sufficient. But if at some point we want to be able to 
handle absolute wall time, then 32bit integer part may be not enough. 
Then we return to the question: how many different data types do we 
want to see in one subsystem? Sure, using single 64bit value would be 
much easier then struct bintime from many perspectives, but what's about 
the edge cases?


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


In message CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com
, Davide Italiano writes:

Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.

And that is way overkill for specifying a callout, at best your clock
has short term stabilities approaching 1e-8, but likely as bad as 1e-6.

(The reason why bintime is important for timekeeping is that we
accumulate timeintervals approx 1e3 times a second, so the rounding
error has to be much smaller than the short term stability in order
to not dominate)

I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code

No, that was exactly my point:  It should be an integer so that
comparisons and arithmetic is trivial.   A 32.32 format fits
nicely into a int64_t which is readily available in the language.

As I said in my previous email:


typedef dur_t   int64_t;/* signed for bug catching */
#define DURSEC  ((dur_t)1  32)
#define DURMIN  (DURSEC * 60)
#define DURMSEC (DURSEC / 1000)
#define DURUSEC (DURSEC / 1000)
#define DURNSEC (DURSEC / 100)

(Bikeshed the names at your convenience)

Then you can say

callout_foo(34 * DURSEC)
callout_foo(2400 * DURMSEC)
or 
callout_foo(500 * DURNSEC)

With this format you can specify callouts 68 years into the future
with quarter nanosecond resolution, and you can trivially and
efficiently compare dur_t's with
if (d1  d2)

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


In message 50d192e8.3020...@freebsd.org, Alexander Motin writes:

Linux uses 32.32 format in their eventtimers code.

(And that is no accident, I know who they got the number from :-)

But if at some point we want to be able to 
handle absolute wall time, [...]

Then you have other problems, including but not limited to clock
being stepped, leap-seconds, suspend/resume and frequency stability.

If you want to support callouts of the type (At 14:00 UTC tomorrow)
(disregarding the time-zone issue), you need to catch all significant
changes to our UTC estimate and recalibrate your callout based on
that.

It is not obvious that we have applications for such an API that
warrant the complexity.

Either way, such a facility should be layered on top of the callout
facility, which should always run in elapsed time[1] with no attention
paid to what NTPD might do to the UTC estimate.

So summary: 32.32 is the right format.

Poul-Henning

[1] Notice that elapsed time needs a firm definition with respect
to suspend/resume, and that this decision has big implications
for the API use and code duplication.

I think it prudent to specify a flag to callouts, to tell what
should happen on suspend/resume, something like:

SR_CANCEL   /* Cancel the callout on S/R */
/* no flag* /* Toll this callout only when system is running */
SR_IGNORE   /* Toll suspended time from callout */

If you get this right, callouts from device drivers will just DTRT,
if you get it wrong, all device drivers will need boilerplate code
to handle S/R

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com
, Davide Italiano writes:


Right now -- the precision is specified in 'bintime', which is a binary number.
It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
the specific platform.


And that is way overkill for specifying a callout, at best your clock
has short term stabilities approaching 1e-8, but likely as bad as 1e-6.


So you always agreed with me that bintimes are unsuitable for almost
everything, and especially unsuitable for timeouts? :-)


(The reason why bintime is important for timekeeping is that we
accumulate timeintervals approx 1e3 times a second, so the rounding
error has to be much smaller than the short term stability in order
to not dominate)


bintimes are not unsuitable for timekeeping, but they a painful to use
for other APIs.  You have to either put bintimes in layers in the other
APIs, or convert them to a more suitable format, and there is a problem
placing the conversion at points where it is efficient.  This thread
seems to be mostly about putting the conversion in wrong places.  My
original objection was about using bintimes for almost everything at
the implementation level.


I do not really think it worth to create another structure for
handling time (e.g. struct bintime32), as it will lead to code


No, that was exactly my point:  It should be an integer so that
comparisons and arithmetic is trivial.   A 32.32 format fits
nicely into a int64_t which is readily available in the language.


I would have tried a 32 bit format with a variable named 'ticks'.
Something like:
- ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
  this.  The tick period would be constant but for virtual ticks and
  not too small.  hz = 1000 now makes the period too small, and not a
  power of 2.  So make the period 1/128 second.  This gives a 1.24.7
  binary format.  2**24 seconds is 194 days.
- ticks  0.  The 31 value bits are now a cookie (descriptor) referring
  to a bintime or whatever.  This case should rarely be used.  I don't
  like it that a tickless kernel, which is needed mainly for power
  saving, has expanded into complications to support short timeouts
  which should rarely be used.


As I said in my previous email:

   typedef dur_t   int64_t;/* signed for bug catching */
   #define DURSEC  ((dur_t)1  32)
   #define DURMIN  (DURSEC * 60)
   #define DURMSEC (DURSEC / 1000)
   #define DURUSEC (DURSEC / 1000)
   #define DURNSEC (DURSEC / 100)

(Bikeshed the names at your convenience)

Then you can say

callout_foo(34 * DURSEC)
callout_foo(2400 * DURMSEC)
or
callout_foo(500 * DURNSEC)


Constructing the cookie for my special case would not be so easy.


With this format you can specify callouts 68 years into the future
with quarter nanosecond resolution, and you can trivially and
efficiently compare dur_t's with
if (d1  d2)


This would make a better general format than timevals, timespecs and
of course bintimes :-).  It is a bit wasteful for timeouts since
its extremes are rarely used.  Malicious and broken callers can
still cause overflow at 68 years, so you have to check for it and
handle it.  The limit of 194 days is just as good for timeouts.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


In message 20121219221518.e1...@besplex.bde.org, Bruce Evans writes:

 With this format you can specify callouts 68 years into the future
 with quarter nanosecond resolution, and you can trivially and
 efficiently compare dur_t's with
  if (d1  d2)

This would make a better general format than timevals, timespecs and
of course bintimes :-).

Except that for absolute timescales, we're running out of the 32 bits
integer part.

Bintimes is a necessary superset of the 32.32 which tries to work
around the necessary but missing int96_t or int128_t[1].

Poul-Henning


[1] A good addition to C would be a general multi-word integer type
where you could ask for any int%d_t or uint%d_t you cared for, and
have the compiler DTRT.  In difference from using a multiword-library,
this would still give these types their natural integer behaviour.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Davide Italiano

On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote:
 On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:

 
 In message
 CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com
 , Davide Italiano writes:

 Right now -- the precision is specified in 'bintime', which is a binary
 number.
 It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
 the specific platform.


 And that is way overkill for specifying a callout, at best your clock
 has short term stabilities approaching 1e-8, but likely as bad as 1e-6.


 So you always agreed with me that bintimes are unsuitable for almost
 everything, and especially unsuitable for timeouts? :-)


 (The reason why bintime is important for timekeeping is that we
 accumulate timeintervals approx 1e3 times a second, so the rounding
 error has to be much smaller than the short term stability in order
 to not dominate)


 bintimes are not unsuitable for timekeeping, but they a painful to use
 for other APIs.  You have to either put bintimes in layers in the other
 APIs, or convert them to a more suitable format, and there is a problem
 placing the conversion at points where it is efficient.  This thread
 seems to be mostly about putting the conversion in wrong places.  My
 original objection was about using bintimes for almost everything at
 the implementation level.


 I do not really think it worth to create another structure for
 handling time (e.g. struct bintime32), as it will lead to code


 No, that was exactly my point:  It should be an integer so that
 comparisons and arithmetic is trivial.   A 32.32 format fits
 nicely into a int64_t which is readily available in the language.


 I would have tried a 32 bit format with a variable named 'ticks'.
 Something like:
 - ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
   this.  The tick period would be constant but for virtual ticks and
   not too small.  hz = 1000 now makes the period too small, and not a
   power of 2.  So make the period 1/128 second.  This gives a 1.24.7
   binary format.  2**24 seconds is 194 days.
 - ticks  0.  The 31 value bits are now a cookie (descriptor) referring
   to a bintime or whatever.  This case should rarely be used.  I don't
   like it that a tickless kernel, which is needed mainly for power
   saving, has expanded into complications to support short timeouts
   which should rarely be used.


Bruce, I don't really agree with this.
The data addressed by cookie should be still stored somewhere, and KBI
will result broken. This, indeed, is not real problem as long as
current calloutng code heavily breaks KBI, but if that was your point,
I don't see how your proposed change could help.


 As I said in my previous email:

typedef dur_t   int64_t;/* signed for bug catching */
#define DURSEC  ((dur_t)1  32)
#define DURMIN  (DURSEC * 60)
#define DURMSEC (DURSEC / 1000)
#define DURUSEC (DURSEC / 1000)
#define DURNSEC (DURSEC / 100)

 (Bikeshed the names at your convenience)

 Then you can say

 callout_foo(34 * DURSEC)
 callout_foo(2400 * DURMSEC)
 or
 callout_foo(500 * DURNSEC)


 Constructing the cookie for my special case would not be so easy.


 With this format you can specify callouts 68 years into the future
 with quarter nanosecond resolution, and you can trivially and
 efficiently compare dur_t's with
 if (d1  d2)


 This would make a better general format than timevals, timespecs and
 of course bintimes :-).  It is a bit wasteful for timeouts since
 its extremes are rarely used.  Malicious and broken callers can
 still cause overflow at 68 years, so you have to check for it and
 handle it.  The limit of 194 days is just as good for timeouts.

 Bruce

I think the phk's proposal  is better. About your overflow objection,
I think is really unlikely to happen, but better safe than sorry.

Thanks

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


In message 20121220005706.i1...@besplex.bde.org, Bruce Evans writes:
On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:

 Except that for absolute timescales, we're running out of the 32 bits
 integer part.

Except 32 bit time_t works until 2106 if it is unsigned.

That's sort of not an option.

The real problem was that time_t was not defined as a floating
point number.

 [1] A good addition to C would be a general multi-word integer type
 where you could ask for any int%d_t or uint%d_t you cared for, and
 have the compiler DTRT.  In difference from using a multiword-library,
 this would still give these types their natural integer behaviour.

That would be convenient, but bad for efficiency if it were actually
used much.

You can say that about anything but CPU-native operations, and I doubt
it would be as inefficient as struct bintime, which does not have access
to the carry bit.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message 20121219221518.e1...@besplex.bde.org, Bruce Evans writes:


With this format you can specify callouts 68 years into the future
with quarter nanosecond resolution, and you can trivially and
efficiently compare dur_t's with
if (d1  d2)


This would make a better general format than timevals, timespecs and
of course bintimes :-).


Except that for absolute timescales, we're running out of the 32 bits
integer part.


Except 32 bit time_t works until 2106 if it is unsigned.


Bintimes is a necessary superset of the 32.32 which tries to work
around the necessary but missing int96_t or int128_t[1].

[1] A good addition to C would be a general multi-word integer type
where you could ask for any int%d_t or uint%d_t you cared for, and
have the compiler DTRT.  In difference from using a multiword-library,
this would still give these types their natural integer behaviour.


That would be convenient, but bad for efficiency if it were actually
used much.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Luigi Rizzo

On Wed, Dec 19, 2012 at 10:51:48AM +, Poul-Henning Kamp wrote:
 
 In message 
 CACYV=-eg542ihm9kfujpvczzra4tqepebva8rzt1yohncgf...@mail.gmail.com
 , Davide Italiano writes:
 
 Right now -- the precision is specified in 'bintime', which is a binary 
 number.
 It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in
 the specific platform.
 
 And that is way overkill for specifying a callout, at best your clock
 has short term stabilities approaching 1e-8, but likely as bad as 1e-6.
 
 (The reason why bintime is important for timekeeping is that we
 accumulate timeintervals approx 1e3 times a second, so the rounding
 error has to be much smaller than the short term stability in order
 to not dominate)
 
 I do not really think it worth to create another structure for
 handling time (e.g. struct bintime32), as it will lead to code
 
 No, that was exactly my point:  It should be an integer so that
 comparisons and arithmetic is trivial.   A 32.32 format fits
 nicely into a int64_t which is readily available in the language.
 
 As I said in my previous email:
 
 
 typedef dur_t   int64_t;/* signed for bug catching */
 #define DURSEC  ((dur_t)1  32)
 #define DURMIN  (DURSEC * 60)
 #define DURMSEC (DURSEC / 1000)
 #define DURUSEC (DURSEC / 1000)
 #define DURNSEC (DURSEC / 100)
 
 (Bikeshed the names at your convenience)
 
 Then you can say
 
   callout_foo(34 * DURSEC)
   callout_foo(2400 * DURMSEC)
 or 
   callout_foo(500 * DURNSEC)

only thing, we must be careful with the parentheses

For instance, in your macro, DURNSEC evaluates to 0 and so
does any multiple of it.
We should define them as

#define DURNSEC DURSEC / 100
...

so DURNSEC is still 0 and 500*DURNSEC gives 214

I am curious that Bruce did not mention this :)

(btw the typedef is swapped, should be typedef int64_t dur_t)

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


In message 20121219150809.ga98...@onelab2.iet.unipi.it, Luigi Rizzo writes:

 typedef dur_t   int64_t;/* signed for bug catching */
 #define DURSEC  ((dur_t)1  32)
 #define DURMIN  (DURSEC * 60)
 #define DURMSEC (DURSEC / 1000)
 #define DURUSEC (DURSEC / 1000)
 #define DURNSEC (DURSEC / 100)


only thing, we must be careful with the parentheses

Actually, it's more impportant to be careful with zeros, if you
adjust the above to the correct number of zeros, DURNSEC is 4,
which is within seven percent of the correct value.

(btw the typedef is swapped, should be typedef int64_t dur_t)

Yes, I'm trying to find out of people even listen to me :-)


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Alexander Motin


On 19.12.2012 16:20, Bruce Evans wrote:

On Wed, 19 Dec 2012, Davide Italiano wrote:


On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au
wrote:



I would have tried a 32 bit format with a variable named 'ticks'.
Something like:
- ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
  this.  The tick period would be constant but for virtual ticks and
  not too small.  hz = 1000 now makes the period too small, and not a
  power of 2.  So make the period 1/128 second.  This gives a 1.24.7
  binary format.  2**24 seconds is 194 days.
- ticks  0.  The 31 value bits are now a cookie (descriptor) referring
  to a bintime or whatever.  This case should rarely be used.  I don't
  like it that a tickless kernel, which is needed mainly for power
  saving, has expanded into complications to support short timeouts
  which should rarely be used.


Bruce, I don't really agree with this.
The data addressed by cookie should be still stored somewhere, and KBI
will result broken. This, indeed, is not real problem as long as
current calloutng code heavily breaks KBI, but if that was your point,
I don't see how your proposed change could help.


In the old API, it is an error to pass ticks  0, so only broken old
callers are affected.  Of course, if there are any then it would be
hard to detect their garbage cookies.

Anywy, it's too later to change to this, and maybe also to a 32.32
format.


It would be late to change this after committing. I would definitely 
like it to be done earlier to not redo all the tests, but I think we 
could convert callout and eventtimers code to 32.32 format in several 
days. The only two questions are: do we really want it (won't there be 
any reasons to regret about it) and how do we want it to look?


For the first question my personal showstopper since eventtimers 
creation always was the wish to keep consistency. But benefits of 32.32 
format are clear, and if there are more votes for it, let's consider. If 
now it will be decided that full range will never be useful for callout 
subsystem, then it is obviously not needed for eventtimers also.


About the second question, what do you think about such prototypes:

typedef int64_t sbintime_t

static __inline sbintime_t
bintime_shrink(struct bintime *bt)
{}

static __inline struct bintime
bintime_expand(sbintime_t sbt)
{}

...

int
callout_reset_bt(struct callout *, sbintime_t sbt, sbintime_t pr,
void (*fn)(void *), void *arg, int flags);

, where pr used only for absolute precision, and flags includes: direct 
execution, absolute/relative time in argument, relative precision in 
case of relative sbt, flag for aligning to hardclock() to emulate legacy 
behavior, and potentially flags for reaction on suspend/resume.


Another option is to move absolute precision also to flags, using log2() 
representation, as we tried and as was proposed before. With possibility 
to use precise relative time there will be few cases requiring absolute 
value of precision, that should depend on period. Then there will be no 
extra arguments in the most usual cases.


Wrapper for existing API compatibility will look just like this:

#define callout_reset(c, ticks, fn, arg)\
callout_reset_bt(c, ticks2sbintime(ticks), -1,  \
(fn), (arg), C_HARDCLOCK)

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-19 Thread Davide Italiano

dropping phk _AT_ onelab2 _DOT_ something from CC as long as it
doesn't seem a valid mail address and I'm annoyed mails bounce back.

On Wed, Dec 19, 2012 at 6:20 AM, Bruce Evans b...@optusnet.com.au wrote:
 On Wed, 19 Dec 2012, Davide Italiano wrote:

 On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote:


 I would have tried a 32 bit format with a variable named 'ticks'.
 Something like:
 - ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
   this.  The tick period would be constant but for virtual ticks and
   not too small.  hz = 1000 now makes the period too small, and not a
   power of 2.  So make the period 1/128 second.  This gives a 1.24.7
   binary format.  2**24 seconds is 194 days.
 - ticks  0.  The 31 value bits are now a cookie (descriptor) referring
   to a bintime or whatever.  This case should rarely be used.  I don't
   like it that a tickless kernel, which is needed mainly for power
   saving, has expanded into complications to support short timeouts
   which should rarely be used.


 Bruce, I don't really agree with this.
 The data addressed by cookie should be still stored somewhere, and KBI
 will result broken. This, indeed, is not real problem as long as
 current calloutng code heavily breaks KBI, but if that was your point,
 I don't see how your proposed change could help.


 In the old API, it is an error to pass ticks  0, so only broken old
 callers are affected.  Of course, if there are any then it would be
 hard to detect their garbage cookies.

 Anywy, it's too later to change to this, and maybe also to a 32.32
 format.

 [32.32 format]

It's not too late. What I'd like to do, right now people got
interested in the problem is agreeing on the interface used.
Following this thread, as I've already discussed to mav@, we would
like to decide what of the two is better:
- specify precision as additional argument (as we're doing right now)
- use 'flags' argument
If we allow time argument to be relative and not absolute, as
suggested by luigi@, we can easily use relative precision where we had
to use ffl() before.



 This would make a better general format than timevals, timespecs and
 of course bintimes :-).  It is a bit wasteful for timeouts since
 its extremes are rarely used.  Malicious and broken callers can
 still cause overflow at 68 years, so you have to check for it and
 handle it.  The limit of 194 days is just as good for timeouts.


 I think the phk's proposal  is better. About your overflow objection,
 I think is really unlikely to happen, but better safe than sorry.


 It's very easy for applications to cause kernel overflow using valid
 syscall args like tv_sec = TIME_T_MAX for a relative time in
 nanosleep().  Adding TIME_T_MAX to the current time in seconds overflow
 for all current times except for the first second after the Epoch.
 There is no difference between the overflow for 32-bit and 64-bit
 time_t's for this.  This is now mostly handled so that the behaviour is
 harmless although wrong.  E.g., the timeout might become negative,
 and then since it is not a cookie it is silently replaced by a timeout
 of 1 tick.  In nanosleep(), IIRC there are further overflows that result
 in returning early instead of retrying the 1-tick timeouts endlessly.

 Bruce

Thanks,

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


On Wed, 19 Dec 2012, Davide Italiano wrote:


On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans b...@optusnet.com.au wrote:



I would have tried a 32 bit format with a variable named 'ticks'.
Something like:
- ticks = 0.  Same meaning as now.  No changes in ABIs or APIs to use
  this.  The tick period would be constant but for virtual ticks and
  not too small.  hz = 1000 now makes the period too small, and not a
  power of 2.  So make the period 1/128 second.  This gives a 1.24.7
  binary format.  2**24 seconds is 194 days.
- ticks  0.  The 31 value bits are now a cookie (descriptor) referring
  to a bintime or whatever.  This case should rarely be used.  I don't
  like it that a tickless kernel, which is needed mainly for power
  saving, has expanded into complications to support short timeouts
  which should rarely be used.


Bruce, I don't really agree with this.
The data addressed by cookie should be still stored somewhere, and KBI
will result broken. This, indeed, is not real problem as long as
current calloutng code heavily breaks KBI, but if that was your point,
I don't see how your proposed change could help.


In the old API, it is an error to pass ticks  0, so only broken old
callers are affected.  Of course, if there are any then it would be
hard to detect their garbage cookies.

Anywy, it's too later to change to this, and maybe also to a 32.32
format.

[32.32 format]

This would make a better general format than timevals, timespecs and
of course bintimes :-).  It is a bit wasteful for timeouts since
its extremes are rarely used.  Malicious and broken callers can
still cause overflow at 68 years, so you have to check for it and
handle it.  The limit of 194 days is just as good for timeouts.


I think the phk's proposal  is better. About your overflow objection,
I think is really unlikely to happen, but better safe than sorry.


It's very easy for applications to cause kernel overflow using valid
syscall args like tv_sec = TIME_T_MAX for a relative time in
nanosleep().  Adding TIME_T_MAX to the current time in seconds overflow
for all current times except for the first second after the Epoch.
There is no difference between the overflow for 32-bit and 64-bit
time_t's for this.  This is now mostly handled so that the behaviour is
harmless although wrong.  E.g., the timeout might become negative,
and then since it is not a cookie it is silently replaced by a timeout
of 1 tick.  In nanosleep(), IIRC there are further overflows that result
in returning early instead of retrying the 1-tick timeouts endlessly.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



In message 20121220005706.i1...@besplex.bde.org, Bruce Evans writes:

On Wed, 19 Dec 2012, Poul-Henning Kamp wrote:



Except that for absolute timescales, we're running out of the 32 bits
integer part.


Except 32 bit time_t works until 2106 if it is unsigned.


That's sort of not an option.


I think it is.  It is just probably not necessary since 32-bit systems
will go away before 2038.


The real problem was that time_t was not defined as a floating
point number.


That would be convenient too, but bad for efficiency on some systems.
Kernels might not be able to use it, and then would have to use an
alternative representation, which they should have done all along.


[1] A good addition to C would be a general multi-word integer type
where you could ask for any int%d_t or uint%d_t you cared for, and
have the compiler DTRT.  In difference from using a multiword-library,
this would still give these types their natural integer behaviour.


That would be convenient, but bad for efficiency if it were actually
used much.


You can say that about anything but CPU-native operations, and I doubt
it would be as inefficient as struct bintime, which does not have access
to the carry bit.


Yes, I would say that about non-native.  It goes against the spirit of C.

OTOH, compilers are getting closer to giving full access to the carry
bit.  I just checked what clang does in a home-made 128-bit add function:

% static void __noinline
% uadd(struct u *xup, struct u *yup)
% {
%   unsigned long long t;
% 
% 	t = xup-w[0] + yup-w[0];

%   if (t  xup-w[0])
%   xup-w[1]++;
%   xup-w[0] = t;
%   xup-w[1] += yup-w[1];
% }
% 
% 	.align	16, 0x90

%   .type   uadd,@function
% uadd:   # @uadd
%   .cfi_startproc
% # BB#0: # %entry
%   movq(%rdi), %rcx
%   movq8(%rdi), %rax
%   addq(%rsi), %rcx

gcc generates an additional cmpq instruction here.

%   jae .LBB2_2

clang uses the carry bit set by the first addition to avoid the comparison,
but still branches.

% # BB#1: # %if.then
%   incq%rax
%   movq%rax, 8(%rdi)

This adds 1 explicitly instead of using adcq, but this is the slow path.

% .LBB2_2:# %if.end
%   movq%rcx, (%rdi)
%   addq8(%rsi), %rax

This is as efficient as possible except for the extra branch, and the
branch is almost perfectly predictable.

%   movq%rax, 8(%rdi)
%   ret
% .Ltmp22:
%   .size   uadd, .Ltmp22-uadd
%   .cfi_endproc

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


In message 50d1e0d8.9070...@freebsd.org, Alexander Motin writes:

It would be late to change this after committing. I would definitely 
like it to be done earlier to not redo all the tests, but I think we 
could convert callout and eventtimers code to 32.32 format in several 
days. The only two questions are: do we really want it (won't there be 
any reasons to regret about it) and how do we want it to look?

As much as it pains me to raise this point, we would regret it
if we did not use 32.32, because Linux already went that way.

As much as there is to be said for doing things right, we should
also try to avoid pointless incompatibilities which will make it
needlessly hard for people to move code, particular device drivers
forth and back.


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


I finally remembered to remove the .it phk :-).

On Wed, 19 Dec 2012, Luigi Rizzo wrote:


On Wed, Dec 19, 2012 at 10:51:48AM +, Poul-Henning Kamp wrote:

...
As I said in my previous email:


typedef dur_t   int64_t;/* signed for bug catching */
#define DURSEC  ((dur_t)1  32)
#define DURMIN  (DURSEC * 60)
#define DURMSEC (DURSEC / 1000)
#define DURUSEC (DURSEC / 1000)
#define DURNSEC (DURSEC / 100)

(Bikeshed the names at your convenience)

Then you can say

callout_foo(34 * DURSEC)
callout_foo(2400 * DURMSEC)
or
callout_foo(500 * DURNSEC)


only thing, we must be careful with the parentheses

For instance, in your macro, DURNSEC evaluates to 0 and so
does any multiple of it.
We should define them as

#define DURNSEC DURSEC / 100
...

so DURNSEC is still 0 and 500*DURNSEC gives 214

I am curious that Bruce did not mention this :)


Er, he was careful.  DURNSEC gives 4, not 0.  This is not very accurate,
but probably good enough.

Your version without parentheses is not so careful and depends on
a magic order of operations and no overflow from this.  E.g.:

500*DURNSEC = 500*DURSEC / 10 = 500*((dur_t)1  32) / 10

This is very accurate and happens not to overflow.  But 5 seconds represented
a little strangely in nanoseconds would overflow:

50*DURNSEC = 50*((dur_t)1  32) / 10

So would 5 billion times DURSEC, but 5 billion seconds is more unreasobable
than 5 billion nanoseconds and the format just can't represent that.



(btw the typedef is swapped, should be typedef int64_t dur_t)


Didn't notice this.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2012-12-18 Thread Alexander Motin

Experiments with dummynet shown ineffective support for very short 
tick-based callouts. New version fixes that, allowing to get as many 
tick-based callout events as hz value permits, while still be able to 
aggregate events and generating minimum of interrupts.


Also this version modifies system load average calculation to fix some 
cases existing in HEAD and 9 branches, that could be fixed with new 
direct callout functionality.


http://people.freebsd.org/~mav/calloutng_12_17.patch

With several important changes made last time I am going to delay commit 
to HEAD for another week to do more testing. Comments and new test cases 
are welcome. Thanks for staying tuned and commenting.


On 17.12.2012 01:37, Alexander Motin wrote:

Here is one more version. Unless something new will be found/reported
this may be the last one, because me and Davide are quite satisfied with
the results. If everything will be fine, I think we could commit it to
HEAD closer to the end of the week:
http://people.freebsd.org/~mav/calloutng_12_16.patch

Changes in this version:
  -- Removed couple of redundant variables in callout implementation,
that reduced sizeof(struct callout) by two pointers and simplified some
internal code.
  -- syscons driver was made to schedule only 1-2 callouts per second
instead of 20-30 before when console is in graphical mode and there are
few other things to do. Now my laptop has only about 30 interrupts per
second total during idle periods with X running.
  -- i8254 eventtimer driver was optimized to work faster in disabled by
default one-shot mode.
  -- Few kernel functions were added to make KPIs more complete.
  -- Man pages were updated.
  -- Some style fixes were made.

On 15.12.2012 18:55, Alexander Motin wrote:

I'm sorry to interrupt review, but as usual good ideas came during the
final testing, causing another round. :)  Here is updated patch for
HEAD, that includes several new changes:
http://people.freebsd.org/~mav/calloutng_12_15.patch

The new changes are:
  -- Precision and event aggregation code was reworked. Instead of
previous -prec/+prec representation, precision is now single-sided --
-0/+prec. It allowed to significantly improve precision on long time
intervals for APIs which imply that event should not happen before the
specified time. Depending on CPU activity, mistake for long time
intervals now will never be more then 1-500ms, even if specified
precision allows more.
  -- Some minor optimizations were made to reduce callout overhead and
latency by 1.5-2us. Now on Core2Duo amd64 system with LAPIC eventtimer
and TSC timecounter usleep(1) call from user-level executes in just
5-6us, instead of 7-8us before. Now it can do 180K cycles per second on
single CPU with only partial CPU load.
  -- Number of kernel subsystems (dcons, syscons, yarrow, led, atkbd,
setrlimit) were modified to reduce number of interrupts, also with event
aggregation by explicit specification of the acceptable events
precision. Now my Core2Duo test system has only 30 interrupts per second
in idle. If not remaining syscons events, it could easily be 15. My
IvyBridge ultrabook first time in its history shown 5.5 hours of battery
time with full screen brightness and 10 hours with lid closed.
  -- Some kernel functions were added to make KPIs more complete.

I've successfully tested this patch on amd64 and arm.



--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

On Mon, Dec 17, 2012 at 01:22:59PM -0800, Adrian Chadd wrote:
 Personally, I'd rather see some consistently used units here..

bintime (or something similar) is the correct choice here.
If we are concerned about the size (128 bit) then we
can map it to a shorter, fixed point format, such
as sign+31+32 as phk was suggesting.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:
 Hi.
 
  I would instead do the following:
 
 I also don't very like the wide API and want to hear fresh ideas, but 
 approaches to time measurement there are too different to do what you 
 are proposing.  Main problem is that while ticks value is relative, 
 bintime is absolute. It is not easy to make conversion between them fast 
 and precise. I've managed to do it, but the only function that does it 
 now is _callout_reset_on(). All other functions are just passing values 
 down. I am not sure I want to duplicate that code in each place, though 
 doing it at least for for callout may be a good idea.

I am afraid the above is not convincing.

Most/all of the APIs i mentioned still have the conversion from
ticks to bintime, and the code in your patch is just
building multiple parallel paths (one for each of the
three versions of the same function) to some final
piece of code where the conversion takes place.

The problem is that all of this goes through a set of obfuscating
macros and the end result is horrible.

To be clear, i believe the work you have been doing on cleaning up
callout is great, i am just saying that this is the time to look
at the code from a few steps away and clean up all those design
decisions that perhaps were made in a haste to make things work.

I will give you another example to show how convoluted
is the code now:

cv_timedwait() and cv_timedwait_sig() now have three
versions each (plain, bt, flags).

These six are remapped through macros to two functions, _cv_timedwait()
and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
maps to _cv_timedwait_sig() )

These two _cv_timedwait*() take both ticks and bintimes,
and contain this sequence:

+   if (bt == NULL)
+   sleepq_set_timeout_flags(cvp, timo, flags);
+   else
+   sleepq_set_timeout_bt(cvp, bt, precision);

Guess what, both sleepq_* are macros that remap to the same
_sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.

But then if you dig into _sleepq_set_timeout() you'll see

+   if (bt == NULL)
+   callout_reset_flags_on(td-td_slpcallout, timo,
+   sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
+   else
+   callout_reset_bt_on(td-td_slpcallout, bt, precision,
+   sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);

and again both callout_reset*() are again remapped through
macros to _callout_reset_on(), so another useless if (bt == NULL)
And in the end you have the conversion from ticks to bintime.

So basically the code path for cv_timedwait() has those
two useless switches and one useless extra argument,
and the conversion from ticks to bintime is down
deep down in _callout_reset_on() where it can only
be resolved at runtime, whereas by doing the conversion
at the beginning the decision could have been made at compile
time.

So I believe my proposal would give large simplifications in
the code and lead to a much cleaner implementation of what
you have designed:

1. acknowledge the fact that the only representation of time
   that callouts use internally is a bintime+precision, define one
   single function (instead of two or three or six) that implements
   the blessed API, and implement the others with macros or
   inline functions doing the appropriate conversions;

2. specifically, the *_flags() variant has no reason to exist.
   It can be implemented through the *_bt() variant, and
   being a new function the only places where you introduce it
   require manual modifications so you can directly invoke
   the new function.

Again, please take this as constructive criticism, as i really
like the work you have been doing and appreciate the time and
effort you are putting on it

cheers
luigi

 
 Creating sets of three functions I had three different goals:
  - callout_reset() -- it is legacy variant required to keep API 
 compatibility;
  - callout_reset_flags() -- it is for cases where custom precision 
 specification needs to be added to the existing code, or where direct 
 callout execution is needed. Conversion to bintime would additionally 
 complicate consumer code, that I would try to avoid.
  - callout_reset_bt() -- API for the new code, which needs high 
 precision and doesn't mind to operate bintime. Now there is only three 
 such places in kernel now, and I don't think there will be much more.
 
 Respectively, these three options are replicated to other APIs where 
 time intervals are used.
 
 PS: Please keep me in CC.
 
 -- 
 Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

On Tue, Dec 18, 2012 at 06:36:43PM +0100, Luigi Rizzo wrote:
 On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:
...
 So I believe my proposal would give large simplifications in
 the code and lead to a much cleaner implementation of what
 you have designed:
 
 1. acknowledge the fact that the only representation of time
that callouts use internally is a bintime+precision, define one
single function (instead of two or three or six) that implements
the blessed API, and implement the others with macros or
inline functions doing the appropriate conversions;
 
 2. specifically, the *_flags() variant has no reason to exist.
It can be implemented through the *_bt() variant, and
being a new function the only places where you introduce it
require manual modifications so you can directly invoke
the new function.

to clarify: i am not sure if now the *_bt() variant takes flags too,
but my point is that the main API function should take all
supported arguments (including flags) and others should
simply be regarded as simplified versions. More or less what
we have for sockets, with send() and sendmsg() and friend

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Alexander Motin


On 18.12.2012 19:36, Luigi Rizzo wrote:

On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:

I would instead do the following:


I also don't very like the wide API and want to hear fresh ideas, but
approaches to time measurement there are too different to do what you
are proposing.  Main problem is that while ticks value is relative,
bintime is absolute. It is not easy to make conversion between them fast
and precise. I've managed to do it, but the only function that does it
now is _callout_reset_on(). All other functions are just passing values
down. I am not sure I want to duplicate that code in each place, though
doing it at least for for callout may be a good idea.


I am afraid the above is not convincing.

Most/all of the APIs i mentioned still have the conversion from
ticks to bintime, and the code in your patch is just
building multiple parallel paths (one for each of the
three versions of the same function) to some final
piece of code where the conversion takes place.

The problem is that all of this goes through a set of obfuscating
macros and the end result is horrible.

To be clear, i believe the work you have been doing on cleaning up
callout is great, i am just saying that this is the time to look
at the code from a few steps away and clean up all those design
decisions that perhaps were made in a haste to make things work.

I will give you another example to show how convoluted
is the code now:

cv_timedwait() and cv_timedwait_sig() now have three
versions each (plain, bt, flags).

These six are remapped through macros to two functions, _cv_timedwait()
and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
maps to _cv_timedwait_sig() )

These two _cv_timedwait*() take both ticks and bintimes,
and contain this sequence:

+   if (bt == NULL)
+   sleepq_set_timeout_flags(cvp, timo, flags);
+   else
+   sleepq_set_timeout_bt(cvp, bt, precision);

Guess what, both sleepq_* are macros that remap to the same
_sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.

But then if you dig into _sleepq_set_timeout() you'll see

+   if (bt == NULL)
+   callout_reset_flags_on(td-td_slpcallout, timo,
+   sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
+   else
+   callout_reset_bt_on(td-td_slpcallout, bt, precision,
+   sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);

and again both callout_reset*() are again remapped through
macros to _callout_reset_on(), so another useless if (bt == NULL)
And in the end you have the conversion from ticks to bintime.

So basically the code path for cv_timedwait() has those
two useless switches and one useless extra argument,
and the conversion from ticks to bintime is down
deep down in _callout_reset_on() where it can only
be resolved at runtime, whereas by doing the conversion
at the beginning the decision could have been made at compile
time.

So I believe my proposal would give large simplifications in
the code and lead to a much cleaner implementation of what
you have designed:

1. acknowledge the fact that the only representation of time
that callouts use internally is a bintime+precision, define one
single function (instead of two or three or six) that implements
the blessed API, and implement the others with macros or
inline functions doing the appropriate conversions;

2. specifically, the *_flags() variant has no reason to exist.
It can be implemented through the *_bt() variant, and
being a new function the only places where you introduce it
require manual modifications so you can directly invoke
the new function.

Again, please take this as constructive criticism, as i really
like the work you have been doing and appreciate the time and
effort you are putting on it


Your words about useless cascaded ifs touched me. Actually I've looked 
on _callout_reset_bt_on() yesterday, thinking about moving tick to bt 
conversion to separate function or wrapper. The only thing we would save 
in such case is single integer argument (ticks), as all others (bt, 
prec, flags) are used in the new world order. From the other side, to 
make the conversion process really effective and correct, I've used 
quite specific way of obtaining time, that may change in the future. I 
would not really like it to be inlined in every consumer function and 
become an ABI. So I see two possible ways: make that conversion a 
separate non-inline function (that will require two temporary variables 
to return results and will consume some time on call/return), or make 
callout_reset_bt_on() to have extra ticks argument, allowing to use it 
in one or another way without external ifs and macros. In last case all 
_bt functions in other APIs will also obtain ticks, bt, pr and flags 
arguments. Actually flags there could be used to specify time scale 
(monotonic or wall) and time base (relative or absolute), if we

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Alexander Motin


On 18.12.2012 20:03, Alexander Motin wrote:

On 18.12.2012 19:36, Luigi Rizzo wrote:

On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:

I would instead do the following:


I also don't very like the wide API and want to hear fresh ideas, but
approaches to time measurement there are too different to do what you
are proposing.  Main problem is that while ticks value is relative,
bintime is absolute. It is not easy to make conversion between them fast
and precise. I've managed to do it, but the only function that does it
now is _callout_reset_on(). All other functions are just passing values
down. I am not sure I want to duplicate that code in each place, though
doing it at least for for callout may be a good idea.


I am afraid the above is not convincing.

Most/all of the APIs i mentioned still have the conversion from
ticks to bintime, and the code in your patch is just
building multiple parallel paths (one for each of the
three versions of the same function) to some final
piece of code where the conversion takes place.

The problem is that all of this goes through a set of obfuscating
macros and the end result is horrible.

To be clear, i believe the work you have been doing on cleaning up
callout is great, i am just saying that this is the time to look
at the code from a few steps away and clean up all those design
decisions that perhaps were made in a haste to make things work.

I will give you another example to show how convoluted
is the code now:

cv_timedwait() and cv_timedwait_sig() now have three
versions each (plain, bt, flags).

These six are remapped through macros to two functions, _cv_timedwait()
and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
maps to _cv_timedwait_sig() )

These two _cv_timedwait*() take both ticks and bintimes,
and contain this sequence:

+if (bt == NULL)
+sleepq_set_timeout_flags(cvp, timo, flags);
+else
+sleepq_set_timeout_bt(cvp, bt, precision);

Guess what, both sleepq_* are macros that remap to the same
_sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.

But then if you dig into _sleepq_set_timeout() you'll see

+if (bt == NULL)
+callout_reset_flags_on(td-td_slpcallout, timo,
+sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
+else
+callout_reset_bt_on(td-td_slpcallout, bt, precision,
+sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);

and again both callout_reset*() are again remapped through
macros to _callout_reset_on(), so another useless if (bt == NULL)
And in the end you have the conversion from ticks to bintime.

So basically the code path for cv_timedwait() has those
two useless switches and one useless extra argument,
and the conversion from ticks to bintime is down
deep down in _callout_reset_on() where it can only
be resolved at runtime, whereas by doing the conversion
at the beginning the decision could have been made at compile
time.

So I believe my proposal would give large simplifications in
the code and lead to a much cleaner implementation of what
you have designed:

1. acknowledge the fact that the only representation of time
that callouts use internally is a bintime+precision, define one
single function (instead of two or three or six) that implements
the blessed API, and implement the others with macros or
inline functions doing the appropriate conversions;

2. specifically, the *_flags() variant has no reason to exist.
It can be implemented through the *_bt() variant, and
being a new function the only places where you introduce it
require manual modifications so you can directly invoke
the new function.

Again, please take this as constructive criticism, as i really
like the work you have been doing and appreciate the time and
effort you are putting on it


Your words about useless cascaded ifs touched me. Actually I've looked
on _callout_reset_bt_on() yesterday, thinking about moving tick to bt
conversion to separate function or wrapper. The only thing we would save
in such case is single integer argument (ticks), as all others (bt,
prec, flags) are used in the new world order. From the other side, to
make the conversion process really effective and correct, I've used
quite specific way of obtaining time, that may change in the future. I
would not really like it to be inlined in every consumer function and
become an ABI. So I see two possible ways: make that conversion a
separate non-inline function (that will require two temporary variables
to return results and will consume some time on call/return), or make
callout_reset_bt_on() to have extra ticks argument, allowing to use it
in one or another way without external ifs and macros. In last case all
_bt functions in other APIs will also obtain ticks, bt, pr and flags
arguments. Actually flags there could be used to specify time scale
(monotonic or wall) and time base (relative or absolute), if we decide
to implement all

Re: API explosion (Re: [RFC/RFT] calloutng)

[top posting for readability;
in summary we were discussing the new callout API trying to avoid
an explosion of methods and arguments while at the same time
supporting the old API and the new one]
(I am also Cc-ing phk as he might have better insight
on the topic).

I think the patch you propose is a step in the right direction,
but i still remain concerned by having to pass two bintimes
(by reference, but they should really go by value)
and one 'ticks' value to all these functions.

I am also dubious that we need a full 128 bits to specify
the 'precision': there would be absolutely no loss of functionality
if we decided to specify the precision in powers of 2, so a precision
'k' (signed) means 2^k seconds. This way 8 bits are enough to
represent any precision we want.

The key difference between 'ticks' and bintimes (and the main
difficulty in the conversion) is that ticks are relative and bintimes
are interpreted as absolute. This could be easily solved by using
a flag to specify if the 'bt' argument is absolute or relative, and
passing the argument by value.

So now the flags could contain C_DIRECT_EXEC, C_BT_IS_RELATIVE, the
precision, and another 64 or 128 bit field contains the bintime.

How does this look ?

cheers
luigi

 On 18.12.2012 20:03, Alexander Motin wrote:
 On 18.12.2012 19:36, Luigi Rizzo wrote:
 On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:
 I would instead do the following:
 
 I also don't very like the wide API and want to hear fresh ideas, but
 approaches to time measurement there are too different to do what you
 are proposing.  Main problem is that while ticks value is relative,
 bintime is absolute. It is not easy to make conversion between them fast
 and precise. I've managed to do it, but the only function that does it
 now is _callout_reset_on(). All other functions are just passing values
 down. I am not sure I want to duplicate that code in each place, though
 doing it at least for for callout may be a good idea.
 
 I am afraid the above is not convincing.
 
 Most/all of the APIs i mentioned still have the conversion from
 ticks to bintime, and the code in your patch is just
 building multiple parallel paths (one for each of the
 three versions of the same function) to some final
 piece of code where the conversion takes place.
 
 The problem is that all of this goes through a set of obfuscating
 macros and the end result is horrible.
 
 To be clear, i believe the work you have been doing on cleaning up
 callout is great, i am just saying that this is the time to look
 at the code from a few steps away and clean up all those design
 decisions that perhaps were made in a haste to make things work.
 
 I will give you another example to show how convoluted
 is the code now:
 
 cv_timedwait() and cv_timedwait_sig() now have three
 versions each (plain, bt, flags).
 
 These six are remapped through macros to two functions, _cv_timedwait()
 and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
 maps to _cv_timedwait_sig() )
 
 These two _cv_timedwait*() take both ticks and bintimes,
 and contain this sequence:
 
 +if (bt == NULL)
 +sleepq_set_timeout_flags(cvp, timo, flags);
 +else
 +sleepq_set_timeout_bt(cvp, bt, precision);
 
 Guess what, both sleepq_* are macros that remap to the same
 _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.
 
 But then if you dig into _sleepq_set_timeout() you'll see
 
 +if (bt == NULL)
 +callout_reset_flags_on(td-td_slpcallout, timo,
 +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
 +else
 +callout_reset_bt_on(td-td_slpcallout, bt, precision,
 +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
 
 and again both callout_reset*() are again remapped through
 macros to _callout_reset_on(), so another useless if (bt == NULL)
 And in the end you have the conversion from ticks to bintime.
 
 So basically the code path for cv_timedwait() has those
 two useless switches and one useless extra argument,
 and the conversion from ticks to bintime is down
 deep down in _callout_reset_on() where it can only
 be resolved at runtime, whereas by doing the conversion
 at the beginning the decision could have been made at compile
 time.
 
 So I believe my proposal would give large simplifications in
 the code and lead to a much cleaner implementation of what
 you have designed:
 
 1. acknowledge the fact that the only representation of time
 that callouts use internally is a bintime+precision, define one
 single function (instead of two or three or six) that implements
 the blessed API, and implement the others with macros or
 inline functions doing the appropriate conversions;
 
 2. specifically, the *_flags() variant has no reason to exist.
 It can be implemented through the *_bt() variant, and
 being a new function the only places where you introduce it
 require manual modifications so you can

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Ian Lepore

On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:
 [top posting for readability;
 in summary we were discussing the new callout API trying to avoid
 an explosion of methods and arguments while at the same time
 supporting the old API and the new one]
 (I am also Cc-ing phk as he might have better insight
 on the topic).
 
 I think the patch you propose is a step in the right direction,
 but i still remain concerned by having to pass two bintimes
 (by reference, but they should really go by value)
 and one 'ticks' value to all these functions.
 
 I am also dubious that we need a full 128 bits to specify
 the 'precision': there would be absolutely no loss of functionality
 if we decided to specify the precision in powers of 2, so a precision
 'k' (signed) means 2^k seconds. This way 8 bits are enough to
 represent any precision we want.
 
 The key difference between 'ticks' and bintimes (and the main
 difficulty in the conversion) is that ticks are relative and bintimes
 are interpreted as absolute. This could be easily solved by using
 a flag to specify if the 'bt' argument is absolute or relative, and
 passing the argument by value.
 
 So now the flags could contain C_DIRECT_EXEC, C_BT_IS_RELATIVE, the
 precision, and another 64 or 128 bit field contains the bintime.
 
 How does this look ?
 
 cheers
 luigi
 

I tend to agree that the bintime should be passed by value instead of
reference.  That would allow an inline tickstobintime() that converts
relative ticks to an absolute bintime returned by value and passed right
along in one tidy line/clump of code without any temporary variables
cluttering things up.  

While the 1980s C programmer in me still wants to avoid returning
complex objects by value, the reality is that modern compilers tend to
generate really nice code for such constructs, usually without any
copying of the return value at all.

I'm not so sure about the 2^k precision.  You speak of seconds, but I
would be worrying about sub-second precision in my work.  It would
typical to want a 500uS timeout but be willing to late by up to 250uS if
that helped scheduling and performance.  Also, my idea of precision
would virtually always be I'm willing to be late by this much, but
never early by any amount.

To reinforce the point of that last paragraph... the way I'm looking at
these changes has nothing to do with power saving (I've never owned a
battery-operated computer, probably never will) and everything to do
with performance and being able to sleep accurately for less than a
tick.

-- Ian

  On 18.12.2012 20:03, Alexander Motin wrote:
  On 18.12.2012 19:36, Luigi Rizzo wrote:
  On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote:
  I would instead do the following:
  
  I also don't very like the wide API and want to hear fresh ideas, but
  approaches to time measurement there are too different to do what you
  are proposing.  Main problem is that while ticks value is relative,
  bintime is absolute. It is not easy to make conversion between them fast
  and precise. I've managed to do it, but the only function that does it
  now is _callout_reset_on(). All other functions are just passing values
  down. I am not sure I want to duplicate that code in each place, though
  doing it at least for for callout may be a good idea.
  
  I am afraid the above is not convincing.
  
  Most/all of the APIs i mentioned still have the conversion from
  ticks to bintime, and the code in your patch is just
  building multiple parallel paths (one for each of the
  three versions of the same function) to some final
  piece of code where the conversion takes place.
  
  The problem is that all of this goes through a set of obfuscating
  macros and the end result is horrible.
  
  To be clear, i believe the work you have been doing on cleaning up
  callout is great, i am just saying that this is the time to look
  at the code from a few steps away and clean up all those design
  decisions that perhaps were made in a haste to make things work.
  
  I will give you another example to show how convoluted
  is the code now:
  
  cv_timedwait() and cv_timedwait_sig() now have three
  versions each (plain, bt, flags).
  
  These six are remapped through macros to two functions, _cv_timedwait()
  and _cv_timedwait_sig(),  with a possible bug (cv_timedwait_bt()
  maps to _cv_timedwait_sig() )
  
  These two _cv_timedwait*() take both ticks and bintimes,
  and contain this sequence:
  
  +if (bt == NULL)
  +sleepq_set_timeout_flags(cvp, timo, flags);
  +else
  +sleepq_set_timeout_bt(cvp, bt, precision);
  
  Guess what, both sleepq_* are macros that remap to the same
  _sleepq_set_timeout(...) . So the above if (bt == NULL) is useless.
  
  But then if you dig into _sleepq_set_timeout() you'll see
  
  +if (bt == NULL)
  +callout_reset_flags_on(td-td_slpcallout, timo,
  +sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC);
  +else
  +

Re: API explosion (Re: [RFC/RFT] calloutng)

On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote:
 On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:
  [top posting for readability;
  in summary we were discussing the new callout API trying to avoid
  an explosion of methods and arguments while at the same time
  supporting the old API and the new one]
  (I am also Cc-ing phk as he might have better insight
  on the topic).
  
  I think the patch you propose is a step in the right direction,
  but i still remain concerned by having to pass two bintimes
  (by reference, but they should really go by value)
  and one 'ticks' value to all these functions.
  
  I am also dubious that we need a full 128 bits to specify
  the 'precision': there would be absolutely no loss of functionality
  if we decided to specify the precision in powers of 2, so a precision
  'k' (signed) means 2^k seconds. This way 8 bits are enough to
  represent any precision we want.

...
 I'm not so sure about the 2^k precision.  You speak of seconds, but I
 would be worrying about sub-second precision in my work.  It would
 typical to want a 500uS timeout but be willing to late by up to 250uS if

i said k is signed so negative values represent fractions of a
second. 2^-128 is pretty short :)

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

2012-12-18 Thread Ian Lepore

On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote:
 On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote:
  On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:
   [top posting for readability;
   in summary we were discussing the new callout API trying to avoid
   an explosion of methods and arguments while at the same time
   supporting the old API and the new one]
   (I am also Cc-ing phk as he might have better insight
   on the topic).
   
   I think the patch you propose is a step in the right direction,
   but i still remain concerned by having to pass two bintimes
   (by reference, but they should really go by value)
   and one 'ticks' value to all these functions.
   
   I am also dubious that we need a full 128 bits to specify
   the 'precision': there would be absolutely no loss of functionality
   if we decided to specify the precision in powers of 2, so a precision
   'k' (signed) means 2^k seconds. This way 8 bits are enough to
   represent any precision we want.
 
 ...
  I'm not so sure about the 2^k precision.  You speak of seconds, but I
  would be worrying about sub-second precision in my work.  It would
  typical to want a 500uS timeout but be willing to late by up to 250uS if
 
 i said k is signed so negative values represent fractions of a
 second. 2^-128 is pretty short :)
 
 cheers
 luigi

Ahh, I missed that.  Good enough then!  Hmmm, if that ideas survives
further review, then could precision be encoded in 8 bits of the flags,
eliminating another parm?

-- Ian


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)

On Tue, Dec 18, 2012 at 04:37:10PM -0700, Ian Lepore wrote:
 On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote:
  On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote:
   On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote:
[top posting for readability;
in summary we were discussing the new callout API trying to avoid
an explosion of methods and arguments while at the same time
supporting the old API and the new one]
(I am also Cc-ing phk as he might have better insight
on the topic).

I think the patch you propose is a step in the right direction,
but i still remain concerned by having to pass two bintimes
(by reference, but they should really go by value)
and one 'ticks' value to all these functions.

I am also dubious that we need a full 128 bits to specify
the 'precision': there would be absolutely no loss of functionality
if we decided to specify the precision in powers of 2, so a precision
'k' (signed) means 2^k seconds. This way 8 bits are enough to
represent any precision we want.
  
  ...
   I'm not so sure about the 2^k precision.  You speak of seconds, but I
   would be worrying about sub-second precision in my work.  It would
   typical to want a 500uS timeout but be willing to late by up to 250uS if
  
  i said k is signed so negative values represent fractions of a
  second. 2^-128 is pretty short :)
  
  cheers
  luigi
 
 Ahh, I missed that.  Good enough then!  Hmmm, if that ideas survives
 further review, then could precision be encoded in 8 bits of the flags,
 eliminating another parm?

that was also what i wrote later in the message :)

now we should figure out some use for the remaining 22 bits of the flags

cheers
luigi

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng


On 17.12.2012 03:29, Adrian Chadd wrote:

On 16 December 2012 15:37, Alexander Motin m...@freebsd.org wrote:

Here is one more version. Unless something new will be found/reported this
may be the last one, because me and Davide are quite satisfied with the
results. If everything will be fine, I think we could commit it to HEAD
closer to the end of the week:
http://people.freebsd.org/~mav/calloutng_12_16.patch


Don't you think one week is a little on the low side for reviewing
something this critical?


It was in public development for the last half a year. IIRC, previous 
announce by Davide few months ago caused no any feedback. If you say you 
will review it in two weeks, I will wait for two weeks. But I don't want 
to wait without a purpose.



Would you mind approaching some of the cluster peeps and seeing if
they'll run this up on the ref10* boxes and VMs, just to get some
further exposure?


Are the ref* system have any load to see anything?

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng


On 17.12.2012 05:38, Adrian Chadd wrote:

On 16 December 2012 18:31, Garrett Cooper yaneg...@gmail.com wrote:


Would you mind approaching some of the cluster peeps and seeing if
they'll run this up on the ref10* boxes and VMs, just to get some
further exposure?


And maybe tinderbox..?


Tinderbox is a great idea.

Maybe hit up the altq/pf using crowd and see if they'll test this stuff out too?


It would be good to test, though I know that at least dummynet is 
written awful from the point of this project with its

callout_reset(dn_timeout, 1, dummynet, NULL);
It should work, but kill most of power benefits. I was promised it will 
be fixed after this project end.



What else gets heavily callout /timer driven? Try some computational
workloads that stress the fairness of ULE/4BSD, maybe?


Schedulers are driven directly by hardclock()/statclock(), so fairness 
is not affected here. If CPU is not idle, it will receive full set of 
required events with maximum available precision.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [RFC/RFT] calloutng

2012-12-17 Thread Adrian Chadd

On 16 December 2012 23:57, Andriy Gapon a...@freebsd.org wrote:

 Thank god that this feature was developed in a branch, it was developed for a
 long period of time and there were people who routinely reviewed and tested 
 (and
 really used) it.  And yeah, its design was announced and discussed well in
 advance too.

I can see that; I was even there at David's presentation at BSDCan
2012 about it.

Now that it's finished though, it would be nice to get some more
stress testing before committing it. Just because it's been developed
in a branch doesn't at all imply that it's had wide testing. It's now
imminently going into the tree, so that may scare (heh!) a few people
into testing it.

I'm sure it'll mostly just work, that it'll not really break anything.
It just to me feels that a week of warning before committing something
like this is a little soon. It's _just_ been finished and the authors
have been doing some last minute changes as they get better ideas on
implementing things. I think it's great work, I'd just like to see
some wider testing.

So to put my money where my mouth is, I bought a new hard disk for my
T60 last week. I'll install -HEAD on it tomorrow and give it a whirl.

Thanks,


Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

calloutng and dummynet (Re: [RFC/RFT] calloutng)

On Mon, Dec 17, 2012 at 10:14:29AM +0200, Alexander Motin wrote:
 On 17.12.2012 05:38, Adrian Chadd wrote:
...
 Maybe hit up the altq/pf using crowd and see if they'll test this stuff 
 out too?
 
 It would be good to test, though I know that at least dummynet is 
 written awful from the point of this project with its
   callout_reset(dn_timeout, 1, dummynet, NULL);
 It should work, but kill most of power benefits. I was promised it will 
 be fixed after this project end.

never trust italians :)

but it is good that you are reminding me this, hopefully
i will be able to give it a shot after the holidays

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

[addressing the various items separately]

On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote:
On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
...
- for several functions the only change is the name of an argument
from busy to us. Can you elaborate the reason for the change,
and whether us means microseconds or the pronoun ?)

Please see r242905 by mav@.

i see the goal of this patch is to pass along the amount of
time till the next timer.

I wonder why the choice is to use (actually, call) the value
microseconds rather use a bintime or something scaled and with a
well defined resolution.

In fact looking at the relevant diff

http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905r2=242904pathrev=242905

cpu_idleclock() actually returns a value that is not even microseconds
but 1/(2^20) seconds. The value seems to be ignored right now
so it would be a good time to discuss the resolution.

I am concerned that at some point (5 years from now perhaps ?)
microseconds might start to become too coarse and we would like
something with a more fine-grained resolution. On the
other hand, for the purposes of this change, we can probably
live with an upper limit of some seconds (waking up the machine
once per second is not going to kill performance).

So i would suggest to make the argument to these functions
uint_32 or uint_64 (preferably the same for 32- and 64-bit machines),
rename it to something different from 'us'
and have at least 28-30 fractional bits to represent a bintime.

Right now you are using a bintime with 20 fractional and 11 or 43
bits for the integer part.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng


Hi.

 I wonder why the choice is to use (actually, call) the value
 microseconds rather use a bintime or something scaled and with a
 well defined resolution.

It was kind of engineering choice. I've chosen microseconds, following 
values used by ACPI to represent CPU sleep states exit latencies.  Now 
that is the only usage for that value.  If CPUs so much reduce wakeup 
latencies to make this scale too coarse, this type will be the smallest 
of our optimization tasks. On the other side, I have some doubts that we 
will be able to reach supported 2048 seconds limit on the integer side. 
Now even completely empty idle system has about 30 interrupts per 
second, that is far from 0.0005. From the other side, I don't know any 
system where CPUs have 2048 seconds wakeup latency.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

API explosion (Re: [RFC/RFT] calloutng)

[again, response to another issue i raised]

On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote:
 On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
...
  Finally, a more substantial comment:
  - a lot of functions which formerly had only a timo argument
now have timo, bt, precision, flags. Take seltdwait() as an example.
 
 
 seltdwait() is not part of the public KPI. It has been modified to
 avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e.
 two separate functions, even though we could share most of the code is
 not a clever approach, IMHO.
 As I told before, seltdwait() is not exposed so we can modify its
 argument without breaking anything.
 
It seems that you have been undecided between two approaches:
for some of these functions you have preserved the original function
that deals with ticks and introduced a new one that deals with the
  bintime,
whereas in other cases you have modified the original function to add
bt, precision, flags.
 
 
 I'm not. All the functions which are part of the public KPI (e.g.
 condvar(9), sleepq(9), *) are still available.  *_flags variants have
 been introduced so that consumers can take advantage of the new
 'precision tolerance mechanism' implemented. Also, *_bt variants have
 been introduced. I don't see any undecision between the two
 approaches.
 Please note that now the callout backend deals with bintime, so every
 time callout_reset_on() is called, the 'tick' argument passed is
 silently converted to bintime.

I will try to give more specific example, using the latest patch
from mav

http://people.freebsd.org/~mav/calloutng_12_16.patch

In the manpage, for instance, the existing functions
now are extended with two more variants (sometimes;
msleep_spin() for instance is missing msleep_spin_bt()
but perhaps that is just an oversight).

 .Nm sleepq_set_timeout ,
+.Nm sleepq_set_timeout_flags ,
+.Nm sleepq_set_timeout_bt ,

 .Nm msleep ,
+.Nm msleep_flags ,
+.Nm msleep_bt ,
 .Nm msleep_spin ,
+.Nm msleep_spin_flags ,
 .Nm pause ,
+.Nm pause_flags ,
+.Nm pause_bt ,
 .Nm tsleep ,
+.Nm tsleep_flags ,
+.Nm tsleep_bt ,

 .Nm cv_timedwait ,
+.Nm cv_timedwait_bt ,
+.Nm cv_timedwait_flags ,
 .Nm cv_timedwait_sig ,
+.Nm cv_timedwait_sig_bt ,
+.Nm cv_timedwait_sig_flags ,

 .Nm callout_reset ,
+.Nm callout_reset_flags ,
 .Nm callout_reset_on ,
+.Nm callout_reset_flags_on ,
+.Nm callout_reset_bt_on ,

If you look at the backends, they take both a timo and a bintime.

-int_cv_timedwait(struct cv *cvp, struct lock_object *lock, int 
timo);
-int_cv_timedwait_sig(struct cv *cvp, struct lock_object *lock, int 
timo);
+int_cv_timedwait(struct cv *cvp, struct lock_object *lock,
+   struct bintime *bt, struct bintime *precision, int timo,
+   int flags);
+int_cv_timedwait_sig(struct cv *cvp, struct lock_object *lock,
+   struct bintime *bt, struct bintime *precision, int timo,
+   int flags);

and then internally they call the 'timo' or the 'bt' version
depending on the case

+   if (bt == NULL)
+   sleepq_set_timeout_flags(cvp, timo, flags);
+   else
+   sleepq_set_timeout_bt(cvp, bt, precision);

So basically you are doing the following:
   + create two new variant for each existing function

foo(, ... timo, ... )   old method
foo_flags(, ... timo, ... ) new method
foo_bt(... , bt, precision, ...)new method

 (the 'API explosion' i am mentioning in the Subject)

   + the variants are mapped to the same internal function
_foo_internal(..., timo, bt, precision, flags, ...)

   + and then the internal function has a (runtime) conditional

if (bt == NULL) {
// convert timo to bt
} else {
// have a bt + precision
}
...

I would instead do the following:

+ create a new API function that takes only bintime+precision+flags, no ticks.
  I am not sure if there is even a need to have an internal name
_cv_timedwait_sig(  )
  or you can just have
cv_timedwait_sig_bt(...)

+ use a macro or an inline function to remap the old API to
  the (single) new one, making the argument conversion immediately.
#define cv_timedwait_sig(...)   cv_timedwait_sig_bt( ...)

  This has the advantage that conversions might be done at compile
  time perhaps with some advantage in terms of code and performance.

+ do not bother creating yet another cv_timedwait_sig_flags() function.
  Since it did not exist before, you have to do the conversion manually
  anyways, at which point you just change the argument to use bintime
  instead of ticks.

Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng

2012-12-17 Thread Davide Italiano

On Mon, Dec 17, 2012 at 11:27 AM, Luigi Rizzo ri...@iet.unipi.it wrote:
[addressing the various items separately]

Please see r242905 by mav@.

i see the goal of this patch is to pass along the amount of
time till the next timer.

I wonder why the choice is to use (actually, call) the value
microseconds rather use a bintime or something scaled and with a
well defined resolution.

In fact looking at the relevant diff

http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905r2=242904pathrev=242905

cpu_idleclock() actually returns a value that is not even microseconds
but 1/(2^20) seconds. The value seems to be ignored right now
so it would be a good time to discuss the resolution.

I would talk more about power consumption problem rather than performances.
Yes, you're right because now, even with calloutng changes, the CPU is
woken up at least twice per second.
The wheel scan, in case it doesn't find a new callout to schedule in
the next half-second, schedules an interrupt half a second from now
(where now is the time obtained using getbinuptime()/binuptime()).
This is a threshold we set up empirically, and it resulted is good
for now. But in the future someone may raise the threshold to 1
second, 10 seconds, or something like. So, I don't agree with your
statement.

Right now you are using a bintime with 20 fractional and 11 or 43
bits for the integer part.

cheers
luigi

Thanks

Davide
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: API explosion (Re: [RFC/RFT] calloutng)


Hi.

 I would instead do the following:

I also don't very like the wide API and want to hear fresh ideas, but 
approaches to time measurement there are too different to do what you 
are proposing.  Main problem is that while ticks value is relative, 
bintime is absolute. It is not easy to make conversion between them fast 
and precise. I've managed to do it, but the only function that does it 
now is _callout_reset_on(). All other functions are just passing values 
down. I am not sure I want to duplicate that code in each place, though 
doing it at least for for callout may be a good idea.


Creating sets of three functions I had three different goals:
 - callout_reset() -- it is legacy variant required to keep API 
compatibility;
 - callout_reset_flags() -- it is for cases where custom precision 
specification needs to be added to the existing code, or where direct 
callout execution is needed. Conversion to bintime would additionally 
complicate consumer code, that I would try to avoid.
 - callout_reset_bt() -- API for the new code, which needs high 
precision and doesn't mind to operate bintime. Now there is only three 
such places in kernel now, and I don't think there will be much more.


Respectively, these three options are replicated to other APIs where 
time intervals are used.


PS: Please keep me in CC.

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng