subject:"Re\: \[patch V2 00\/20\] timer\: Refactor the timer wheel"

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-26 Thread Pavel Machek

Hi!

On Sun 2016-06-26 12:21:46, Arjan van de Ven wrote:
> On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machek  wrote:
> >
> > Umm. I'm not sure if you should be designing kernel...
> >
> > I have alarm clock application. It does sleep(60) many times till its
> > time to wake me up. I'll be very angry if sleep(60) takes 65 seconds
> > without some very, very good reason.
> 
> I'm fairly sure you shouldn't be designing alarm clock applications!
> Because on busy systems you get random (scheduler) delays added to
>your timer.

I'm pretty sure I should not be designing alarm clock applications,
after looking at the timezone stuff. But alarm clock from mate eats 3%
cpu at my cellphone, so I kind of had to.

And yes, I'm aware that scheduler delays would add up. But if it is 79
seconds before alarm, I do sleep(79), and it would be strange to have
alarm fire 5 seconds too late.

> Having said that, your example is completely crooked here, sleep()
> does not use these kernel timers, it uses hrtimers instead.
> (hrtimers also have slack, but an alarm clock application that is this
> broken would have the choice to set such slack to 0)
> 
> What happened here is that these sigtimewait were actually not great,
> it is just about the only application visible interface that's still
> in jiffies/HZ,
> and in the follow-on patch set, Thomas converted them properly to
> hrtimers as well to make them both accurate and CONFIG_HZ
> independent.

So it is going to be fixed, good.

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-26 Thread Pavel Machek

Hi!

On Sun 2016-06-26 12:21:46, Arjan van de Ven wrote:
> On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machek  wrote:
> >
> > Umm. I'm not sure if you should be designing kernel...
> >
> > I have alarm clock application. It does sleep(60) many times till its
> > time to wake me up. I'll be very angry if sleep(60) takes 65 seconds
> > without some very, very good reason.
> 
> I'm fairly sure you shouldn't be designing alarm clock applications!
> Because on busy systems you get random (scheduler) delays added to
>your timer.

I'm pretty sure I should not be designing alarm clock applications,
after looking at the timezone stuff. But alarm clock from mate eats 3%
cpu at my cellphone, so I kind of had to.

And yes, I'm aware that scheduler delays would add up. But if it is 79
seconds before alarm, I do sleep(79), and it would be strange to have
alarm fire 5 seconds too late.

> Having said that, your example is completely crooked here, sleep()
> does not use these kernel timers, it uses hrtimers instead.
> (hrtimers also have slack, but an alarm clock application that is this
> broken would have the choice to set such slack to 0)
> 
> What happened here is that these sigtimewait were actually not great,
> it is just about the only application visible interface that's still
> in jiffies/HZ,
> and in the follow-on patch set, Thomas converted them properly to
> hrtimers as well to make them both accurate and CONFIG_HZ
> independent.

So it is going to be fixed, good.

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-26 Thread Arjan van de Ven

On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machek  wrote:
>
> Umm. I'm not sure if you should be designing kernel...
>
> I have alarm clock application. It does sleep(60) many times till its
> time to wake me up. I'll be very angry if sleep(60) takes 65 seconds
> without some very, very good reason.

I'm fairly sure you shouldn't be designing alarm clock applications!
Because on busy systems you get random (scheduler) delays added to your timer.

Having said that, your example is completely crooked here, sleep()
does not use these kernel timers, it uses hrtimers instead.
(hrtimers also have slack, but an alarm clock application that is this
broken would have the choice to set such slack to 0)

What happened here is that these sigtimewait were actually not great,
it is just about the only application visible interface that's still
in jiffies/HZ,
and in the follow-on patch set, Thomas converted them properly to
hrtimers as well to make them both accurate and CONFIG_HZ independent.

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-26 Thread Arjan van de Ven

On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machek  wrote:
>
> Umm. I'm not sure if you should be designing kernel...
>
> I have alarm clock application. It does sleep(60) many times till its
> time to wake me up. I'll be very angry if sleep(60) takes 65 seconds
> without some very, very good reason.

I'm fairly sure you shouldn't be designing alarm clock applications!
Because on busy systems you get random (scheduler) delays added to your timer.

Having said that, your example is completely crooked here, sleep()
does not use these kernel timers, it uses hrtimers instead.
(hrtimers also have slack, but an alarm clock application that is this
broken would have the choice to set such slack to 0)

What happened here is that these sigtimewait were actually not great,
it is just about the only application visible interface that's still
in jiffies/HZ,
and in the follow-on patch set, Thomas converted them properly to
hrtimers as well to make them both accurate and CONFIG_HZ independent.

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-26 Thread Pavel Machek

Hi!


> > FWIW, testing with ltp, I noticed a new failure in logs.  It turns out
> > to be intermittent, but the testcase mostly fails.
> 
> You forgot to cc the LTP folks ...
>  
> > rtbox:~ # 
> > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
> > Test FAILED: sigtimedwait() did not return in the required time
> > time_elapsed: 1.197057
> > ...come on, you can do it...
> > rtbox:~ # 
> > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
> > Test PASSED
> > 
> > #define ERRORMARGIN 0.1
> > ...
> > if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN)
> > || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) {
> > printf("Test FAILED: sigtimedwait() did not return in "
> > "the required time\n");
> > printf("time_elapsed: %lf\n", time_elapsed);
> > return PTS_FAIL;
> > }
> > 
> > Looks hohum to me, but gripe did arrive with patch set, so you get a note.
> 
> hohum is a euphemism. That's completely bogus.
> 
> The only guarantee a syscall with timers has is: timer does not fire
> > early.

Umm. I'm not sure if you should be designing kernel...

I have alarm clock application. It does sleep(60) many times till its
time to wake me up. I'll be very angry if sleep(60) takes 65 seconds
without some very, very good reason.

So yes, man page says this is the only requirement (and did not you
break it earlier in the patchset, with sleep for 5 days? :-) ), but
no, it is not really the only requirement you have.

You may argue LTP's ERRORMARGIN is too strict. But you can't argue LTP
is completely bogus, and I'd say error margin of 0.1 second is
completely reasonable (*).

If I update my alarm clock application to display seconds, I really
want them with better precision than 0.1 second (*), because 0.1 seconds
is already visible with the naked eye.

Best regards,
Pavel

(*) on reasonably idle system.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-26 Thread Pavel Machek

Hi!


> > FWIW, testing with ltp, I noticed a new failure in logs.  It turns out
> > to be intermittent, but the testcase mostly fails.
> 
> You forgot to cc the LTP folks ...
>  
> > rtbox:~ # 
> > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
> > Test FAILED: sigtimedwait() did not return in the required time
> > time_elapsed: 1.197057
> > ...come on, you can do it...
> > rtbox:~ # 
> > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
> > Test PASSED
> > 
> > #define ERRORMARGIN 0.1
> > ...
> > if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN)
> > || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) {
> > printf("Test FAILED: sigtimedwait() did not return in "
> > "the required time\n");
> > printf("time_elapsed: %lf\n", time_elapsed);
> > return PTS_FAIL;
> > }
> > 
> > Looks hohum to me, but gripe did arrive with patch set, so you get a note.
> 
> hohum is a euphemism. That's completely bogus.
> 
> The only guarantee a syscall with timers has is: timer does not fire
> > early.

Umm. I'm not sure if you should be designing kernel...

I have alarm clock application. It does sleep(60) many times till its
time to wake me up. I'll be very angry if sleep(60) takes 65 seconds
without some very, very good reason.

So yes, man page says this is the only requirement (and did not you
break it earlier in the patchset, with sleep for 5 days? :-) ), but
no, it is not really the only requirement you have.

You may argue LTP's ERRORMARGIN is too strict. But you can't argue LTP
is completely bogus, and I'd say error margin of 0.1 second is
completely reasonable (*).

If I update my alarm clock application to display seconds, I really
want them with better precision than 0.1 second (*), because 0.1 seconds
is already visible with the naked eye.

Best regards,
Pavel

(*) on reasonably idle system.
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-22 Thread Mike Galbraith

On Wed, 2016-06-22 at 11:06 +0200, Mike Galbraith wrote:
> On Wed, 2016-06-22 at 10:44 +0200, Thomas Gleixner wrote:
> > B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote:
> > > FWIW, testing with ltp, I noticed a new failure in logs.  It turns
> > out
> > > to be intermittent, but the testcase mostly fails.
> > 
> > You forgot to cc the LTP folks ...
> 
> This ain't the only one, it's just new.  I'll mention it.
>  
> File under FYI/FWIW: I also plugged the set into RT, and nothing fell
> out of local boxen.  The below is falling out of my 8 socket box
> though.. maybe a portage booboo.
> 
> [ 1503.988863] clocksource: timekeeping watchdog on CPU42: Marking 
> clocksource 'tsc' as unstable because the skew is too large:
> [ 1504.203800] clocksource:   'hpet' wd_now: 38b55bb 
> wd_last: 8303f269 mask: 
> [ 1504.296111] clocksource:   'tsc' cs_now: 3a3aa717794 
> cs_last: 354624eea7b mask: 
> [ 1504.402329] clocksource: Switched to clocksource hpet

Nope, not RT portage booboo.  Virgin x86-tip/WIP.timers..

vogelweide:~/:[130]# dmesg|grep clocksource:
[0.00] clocksource: refined-jiffies: mask: 0x max_cycles: 
0x, max_idle_ns: 1910969940391419 ns
[0.00] clocksource: hpet: mask: 0x max_cycles: 0x, 
max_idle_ns: 133484882848 ns
[5.608205] clocksource: jiffies: mask: 0x max_cycles: 0x, 
max_idle_ns: 1911260446275000 ns
[9.151208] clocksource: Switched to clocksource hpet
[9.485907] clocksource: acpi_pm: mask: 0xff max_cycles: 0xff, 
max_idle_ns: 2085701024 ns
[   11.947226] clocksource: tsc: mask: 0x max_cycles: 
0x20974986637, max_idle_ns: 440795286310 ns
[   13.012145] clocksource: Switched to clocksource tsc
[  434.868215] clocksource: timekeeping watchdog on CPU59: Marking clocksource 
'tsc' as unstable because the skew is too large:
[  434.982251] clocksource:   'hpet' wd_now: 732cdf37 
wd_last: df3d99d8 mask: 
[  435.085875] clocksource:   'tsc' cs_now: 16e6780d1eb 
cs_last: 11326fa576e mask: 
[  435.211249] clocksource: Switched to clocksource hpet
vogelweide:~/:[0]#

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-22 Thread Mike Galbraith

On Wed, 2016-06-22 at 11:06 +0200, Mike Galbraith wrote:
> On Wed, 2016-06-22 at 10:44 +0200, Thomas Gleixner wrote:
> > B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote:
> > > FWIW, testing with ltp, I noticed a new failure in logs.  It turns
> > out
> > > to be intermittent, but the testcase mostly fails.
> > 
> > You forgot to cc the LTP folks ...
> 
> This ain't the only one, it's just new.  I'll mention it.
>  
> File under FYI/FWIW: I also plugged the set into RT, and nothing fell
> out of local boxen.  The below is falling out of my 8 socket box
> though.. maybe a portage booboo.
> 
> [ 1503.988863] clocksource: timekeeping watchdog on CPU42: Marking 
> clocksource 'tsc' as unstable because the skew is too large:
> [ 1504.203800] clocksource:   'hpet' wd_now: 38b55bb 
> wd_last: 8303f269 mask: 
> [ 1504.296111] clocksource:   'tsc' cs_now: 3a3aa717794 
> cs_last: 354624eea7b mask: 
> [ 1504.402329] clocksource: Switched to clocksource hpet

Nope, not RT portage booboo.  Virgin x86-tip/WIP.timers..

vogelweide:~/:[130]# dmesg|grep clocksource:
[0.00] clocksource: refined-jiffies: mask: 0x max_cycles: 
0x, max_idle_ns: 1910969940391419 ns
[0.00] clocksource: hpet: mask: 0x max_cycles: 0x, 
max_idle_ns: 133484882848 ns
[5.608205] clocksource: jiffies: mask: 0x max_cycles: 0x, 
max_idle_ns: 1911260446275000 ns
[9.151208] clocksource: Switched to clocksource hpet
[9.485907] clocksource: acpi_pm: mask: 0xff max_cycles: 0xff, 
max_idle_ns: 2085701024 ns
[   11.947226] clocksource: tsc: mask: 0x max_cycles: 
0x20974986637, max_idle_ns: 440795286310 ns
[   13.012145] clocksource: Switched to clocksource tsc
[  434.868215] clocksource: timekeeping watchdog on CPU59: Marking clocksource 
'tsc' as unstable because the skew is too large:
[  434.982251] clocksource:   'hpet' wd_now: 732cdf37 
wd_last: df3d99d8 mask: 
[  435.085875] clocksource:   'tsc' cs_now: 16e6780d1eb 
cs_last: 11326fa576e mask: 
[  435.211249] clocksource: Switched to clocksource hpet
vogelweide:~/:[0]#

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-22 Thread Mike Galbraith

On Wed, 2016-06-22 at 10:44 +0200, Thomas Gleixner wrote:
> B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote:
> > FWIW, testing with ltp, I noticed a new failure in logs.  It turns
> out
> > to be intermittent, but the testcase mostly fails.
> 
> You forgot to cc the LTP folks ...

This ain't the only one, it's just new.  I'll mention it.
 
File under FYI/FWIW: I also plugged the set into RT, and nothing fell
out of local boxen.  The below is falling out of my 8 socket box
though.. maybe a portage booboo.

[ 1503.988863] clocksource: timekeeping watchdog on CPU42: Marking clocksource 
'tsc' as unstable because the skew is too large:
[ 1504.203800] clocksource:   'hpet' wd_now: 38b55bb 
wd_last: 8303f269 mask: 
[ 1504.296111] clocksource:   'tsc' cs_now: 3a3aa717794 
cs_last: 354624eea7b mask: 
[ 1504.402329] clocksource: Switched to clocksource hpet

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-22 Thread Mike Galbraith

On Wed, 2016-06-22 at 10:44 +0200, Thomas Gleixner wrote:
> B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote:
> > FWIW, testing with ltp, I noticed a new failure in logs.  It turns
> out
> > to be intermittent, but the testcase mostly fails.
> 
> You forgot to cc the LTP folks ...

This ain't the only one, it's just new.  I'll mention it.
 
File under FYI/FWIW: I also plugged the set into RT, and nothing fell
out of local boxen.  The below is falling out of my 8 socket box
though.. maybe a portage booboo.

[ 1503.988863] clocksource: timekeeping watchdog on CPU42: Marking clocksource 
'tsc' as unstable because the skew is too large:
[ 1504.203800] clocksource:   'hpet' wd_now: 38b55bb 
wd_last: 8303f269 mask: 
[ 1504.296111] clocksource:   'tsc' cs_now: 3a3aa717794 
cs_last: 354624eea7b mask: 
[ 1504.402329] clocksource: Switched to clocksource hpet

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-22 Thread Thomas Gleixner

B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote:
> FWIW, testing with ltp, I noticed a new failure in logs.  It turns out
> to be intermittent, but the testcase mostly fails.

You forgot to cc the LTP folks ...
 
> rtbox:~ # 
> /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
> Test FAILED: sigtimedwait() did not return in the required time
> time_elapsed: 1.197057
> ...come on, you can do it...
> rtbox:~ # 
> /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
> Test PASSED
> 
> #define ERRORMARGIN 0.1
> ...
> if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN)
> || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) {
> printf("Test FAILED: sigtimedwait() did not return in "
> "the required time\n");
> printf("time_elapsed: %lf\n", time_elapsed);
> return PTS_FAIL;
> }
> 
> Looks hohum to me, but gripe did arrive with patch set, so you get a note.

hohum is a euphemism. That's completely bogus.

The only guarantee a syscall with timers has is: timer does not fire early.

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-22 Thread Thomas Gleixner

B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote:
> FWIW, testing with ltp, I noticed a new failure in logs.  It turns out
> to be intermittent, but the testcase mostly fails.

You forgot to cc the LTP folks ...
 
> rtbox:~ # 
> /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
> Test FAILED: sigtimedwait() did not return in the required time
> time_elapsed: 1.197057
> ...come on, you can do it...
> rtbox:~ # 
> /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
> Test PASSED
> 
> #define ERRORMARGIN 0.1
> ...
> if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN)
> || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) {
> printf("Test FAILED: sigtimedwait() did not return in "
> "the required time\n");
> printf("time_elapsed: %lf\n", time_elapsed);
> return PTS_FAIL;
> }
> 
> Looks hohum to me, but gripe did arrive with patch set, so you get a note.

hohum is a euphemism. That's completely bogus.

The only guarantee a syscall with timers has is: timer does not fire early.

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-22 Thread Mike Galbraith

On Fri, 2016-06-17 at 13:26 +, Thomas Gleixner wrote:
> This is the second version of the timer wheel rework series. The first series
> can be found here:
> 
>http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de
> 
> The series is also available in git:
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers

FWIW, testing with ltp, I noticed a new failure in logs.  It turns out
to be intermittent, but the testcase mostly fails.

rtbox:~ # 
/usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
Test FAILED: sigtimedwait() did not return in the required time
time_elapsed: 1.197057
...come on, you can do it...
rtbox:~ # 
/usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
Test PASSED

#define ERRORMARGIN 0.1
...
if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN)
|| (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) {
printf("Test FAILED: sigtimedwait() did not return in "
"the required time\n");
printf("time_elapsed: %lf\n", time_elapsed);
return PTS_FAIL;
}

Looks hohum to me, but gripe did arrive with patch set, so you get a note.

-Mike

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-22 Thread Mike Galbraith

On Fri, 2016-06-17 at 13:26 +, Thomas Gleixner wrote:
> This is the second version of the timer wheel rework series. The first series
> can be found here:
> 
>http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de
> 
> The series is also available in git:
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers

FWIW, testing with ltp, I noticed a new failure in logs.  It turns out
to be intermittent, but the testcase mostly fails.

rtbox:~ # 
/usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
Test FAILED: sigtimedwait() did not return in the required time
time_elapsed: 1.197057
...come on, you can do it...
rtbox:~ # 
/usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test
Test PASSED

#define ERRORMARGIN 0.1
...
if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN)
|| (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) {
printf("Test FAILED: sigtimedwait() did not return in "
"the required time\n");
printf("time_elapsed: %lf\n", time_elapsed);
return PTS_FAIL;
}

Looks hohum to me, but gripe did arrive with patch set, so you get a note.

-Mike

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Eric Dumazet

On Mon, Jun 20, 2016 at 12:03 PM, Rik van Riel  wrote:
> On Mon, 2016-06-20 at 15:56 +0200, Thomas Gleixner wrote:
>>
>> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a
>> 1000HZ
>>option so datacenter folks can use this and people who don't care
>> and want
>>better batching for power can use the 4ms thingy.
>>
>
> It might be easy enough to simply re-queue a timer that
> has not expired yet after 37 hours.
>
> How many 37 hour timers will there be outstanding at any
> one time, that expire around the same time?
>
> Chances are, not many at all. In fact, the vast majority
> of them are likely to be deleted long before they ever
> expire.
>
> Timers lasting longer than 37 hours do not seem like
> something worth optimizing for.
>

I totally agree that these long timers should probably be handled (if
really someone needs them)
using an additional set of helpers able to rearm the timer if it
expires 'too soon'

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Eric Dumazet

On Mon, Jun 20, 2016 at 12:03 PM, Rik van Riel  wrote:
> On Mon, 2016-06-20 at 15:56 +0200, Thomas Gleixner wrote:
>>
>> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a
>> 1000HZ
>>option so datacenter folks can use this and people who don't care
>> and want
>>better batching for power can use the 4ms thingy.
>>
>
> It might be easy enough to simply re-queue a timer that
> has not expired yet after 37 hours.
>
> How many 37 hour timers will there be outstanding at any
> one time, that expire around the same time?
>
> Chances are, not many at all. In fact, the vast majority
> of them are likely to be deleted long before they ever
> expire.
>
> Timers lasting longer than 37 hours do not seem like
> something worth optimizing for.
>

I totally agree that these long timers should probably be handled (if
really someone needs them)
using an additional set of helpers able to rearm the timer if it
expires 'too soon'

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Rik van Riel

On Mon, 2016-06-20 at 15:56 +0200, Thomas Gleixner wrote:
> 
> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a
> 1000HZ
>    option so datacenter folks can use this and people who don't care
> and want
>    better batching for power can use the 4ms thingy.
> 

It might be easy enough to simply re-queue a timer that
has not expired yet after 37 hours.

How many 37 hour timers will there be outstanding at any
one time, that expire around the same time?

Chances are, not many at all. In fact, the vast majority
of them are likely to be deleted long before they ever
expire.

Timers lasting longer than 37 hours do not seem like
something worth optimizing for.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Rik van Riel

On Mon, 2016-06-20 at 15:56 +0200, Thomas Gleixner wrote:
> 
> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a
> 1000HZ
>    option so datacenter folks can use this and people who don't care
> and want
>    better batching for power can use the 4ms thingy.
> 

It might be easy enough to simply re-queue a timer that
has not expired yet after 37 hours.

How many 37 hour timers will there be outstanding at any
one time, that expire around the same time?

Chances are, not many at all. In fact, the vast majority
of them are likely to be deleted long before they ever
expire.

Timers lasting longer than 37 hours do not seem like
something worth optimizing for.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Arjan van de Ven

so is there really an issue? sounds like KISS principle can apply

On Mon, Jun 20, 2016 at 7:46 AM, Thomas Gleixner  wrote:
> On Mon, 20 Jun 2016, Arjan van de Ven wrote:
>> On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner  wrote:
>> >
>> > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 
>> > 1000HZ
>> >option so datacenter folks can use this and people who don't care and 
>> > want
>> >better batching for power can use the 4ms thingy.
>>
>>
>> if there really is one user of such long timers... could we possibly
>> make that one robust against early fire of the timer?
>>
>> eg rule is: if you set timers > 37 hours, you need to cope with early timer 
>> fire
>
> The only user I found is networking contrack (5 days). Eric thought its not a
> big problem if it fires earlier.
>
> Thanks,
>
> tglx
>

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Arjan van de Ven

so is there really an issue? sounds like KISS principle can apply

On Mon, Jun 20, 2016 at 7:46 AM, Thomas Gleixner  wrote:
> On Mon, 20 Jun 2016, Arjan van de Ven wrote:
>> On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner  wrote:
>> >
>> > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 
>> > 1000HZ
>> >option so datacenter folks can use this and people who don't care and 
>> > want
>> >better batching for power can use the 4ms thingy.
>>
>>
>> if there really is one user of such long timers... could we possibly
>> make that one robust against early fire of the timer?
>>
>> eg rule is: if you set timers > 37 hours, you need to cope with early timer 
>> fire
>
> The only user I found is networking contrack (5 days). Eric thought its not a
> big problem if it fires earlier.
>
> Thanks,
>
> tglx
>

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Paul E. McKenney

On Mon, Jun 20, 2016 at 05:13:41PM +0200, Thomas Gleixner wrote:
> On Mon, 20 Jun 2016, Paul E. McKenney wrote:
> 
> > On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote:
> > > This is the second version of the timer wheel rework series. The first 
> > > series
> > > can be found here:
> > > 
> > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de
> > > 
> > > The series is also available in git:
> > > 
> > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers
> > 
> > Ran some longer rcutorture tests, and the scripting complained about
> > hangs.  This turned out to be due to the 12.5% uncertainty, so I fixed
> 
> Is that stuff so sensitive? I'm surprised, because the old slack stuff got you
> 6.25% already.

But didn't you have to ask for slack?

Anyway, rcutorture allows three minutes longer than the duration, and
then kills the test (unless it is actively dumping the ftrace buffer).
A 30-minute test does fine either way, but a 60-minute test gets killed
with high probability.  Changing to hrtimers makes things work nicely
(other than SRCU), even for 60-minute runs.  I have run ten-hour
rcutorture runs with normal completion with the old timers.

Might well be that this switch to hrtimer is needed in some situations
for the old setup.  Given that it happens only once per run, it clearly
has little or no performance downside, so I am queueing it regardless.
Well, I will do so once I take care of the arithmetic limitations that
are causing link-time errors on 32-bit systems.

Thanx, Paul

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Paul E. McKenney

On Mon, Jun 20, 2016 at 05:13:41PM +0200, Thomas Gleixner wrote:
> On Mon, 20 Jun 2016, Paul E. McKenney wrote:
> 
> > On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote:
> > > This is the second version of the timer wheel rework series. The first 
> > > series
> > > can be found here:
> > > 
> > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de
> > > 
> > > The series is also available in git:
> > > 
> > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers
> > 
> > Ran some longer rcutorture tests, and the scripting complained about
> > hangs.  This turned out to be due to the 12.5% uncertainty, so I fixed
> 
> Is that stuff so sensitive? I'm surprised, because the old slack stuff got you
> 6.25% already.

But didn't you have to ask for slack?

Anyway, rcutorture allows three minutes longer than the duration, and
then kills the test (unless it is actively dumping the ftrace buffer).
A 30-minute test does fine either way, but a 60-minute test gets killed
with high probability.  Changing to hrtimers makes things work nicely
(other than SRCU), even for 60-minute runs.  I have run ten-hour
rcutorture runs with normal completion with the old timers.

Might well be that this switch to hrtimer is needed in some situations
for the old setup.  Given that it happens only once per run, it clearly
has little or no performance downside, so I am queueing it regardless.
Well, I will do so once I take care of the arithmetic limitations that
are causing link-time errors on 32-bit systems.

Thanx, Paul

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Paul E. McKenney

On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote:
> This is the second version of the timer wheel rework series. The first series
> can be found here:
> 
>http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de
> 
> The series is also available in git:
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers

Ran some longer rcutorture tests, and the scripting complained about
hangs.  This turned out to be due to the 12.5% uncertainty, so I fixed
this by switching the rcutorture stop-test timer to hrtimers.  Things are
now working as well as before, with the exception of SRCU, for which I
am getting lots of grace-period stall complaints.  This came as a bit
of a surprise.  Anyway, I will be reviewing SRCU for timing dependencies.

Thanx, Paul

> Changes vs. V1:
> 
>  - Addressed the review comments of V1
> 
>  - Fixed the fallout in tty/metag (noticed by Arjan)
>  - Renamed the hlist helper (noticed by Paolo/George)
>  - Used the proper mask in get_timer_base() (noticed by Richard)
>  - Fixed the inverse state check in internal_add_timer() (noticed by 
> Richard)
>  - Simplified the macro maze, removed wrapper (noticed by George)
>  - Reordered data retrieval in run_timer() (noticed by George)
> 
>  - Removed cascading completely
> 
>We have a hard cutoff of expiry times at the capacity of the last wheel
>level now. Timers which insist on timeouts longer than that, i.e. ~6days,
>will expire at the cutoff, i.e. ~6 days. From our data gathering the
>largest timeouts are 5 days (networking contrack), which are well in the
>capacity.
> 
>To achieve this capacity with HZ=1000 without increasing the storage size
>by another level, we reduced the granularity of the first wheel level from
>1ms to 4ms. According to our data, there is no user which relies on that
>1ms granularity and 99% of those timers are canceled before expiry.
> 
>As a side effect there is the benefit of better batching in the first level
>which helps networking to avoid rearming timers in the hotpath.
> 
> We gathered more data about performance and batching. Compared to mainline the
> following changes have been observed:
> 
>- The bad outliers in mainline when the timer wheel needs to be forwarded
>  after a long idle sleep are completely gone.
> 
>- The total cpu time used for timer softirq processing is significantly
>  reduced. Depending on the HZ setting and workload this ranges from factor
>  2 to 6.
> 
>- The average invocation period of the timer softirq on an idle system
>  increases significantly. Depending on the HZ settings and workload this
>  ranges from factor 1.5 to 5. That means that the residency in deep
>  c-states should be improved. Have not yet have time to verify this with
>  the power tools.
> 
> Thanks,
> 
>   tglx
> 
> ---
>  arch/x86/kernel/apic/x2apic_uv_x.c  |4 
>  arch/x86/kernel/cpu/mcheck/mce.c|4 
>  block/genhd.c   |5 
>  drivers/cpufreq/powernv-cpufreq.c   |5 
>  drivers/mmc/host/jz4740_mmc.c   |2 
>  drivers/net/ethernet/tile/tilepro.c |4 
>  drivers/power/bq27xxx_battery.c |5 
>  drivers/tty/metag_da.c  |4 
>  drivers/tty/mips_ejtag_fdc.c|4 
>  drivers/usb/host/ohci-hcd.c |1 
>  drivers/usb/host/xhci.c |2 
>  include/linux/list.h|   10 
>  include/linux/timer.h   |   30 
>  kernel/time/tick-internal.h |1 
>  kernel/time/tick-sched.c|   46 -
>  kernel/time/timer.c | 1099 
> +---
>  lib/random32.c  |1 
>  net/ipv4/inet_connection_sock.c |7 
>  net/ipv4/inet_timewait_sock.c   |5 
>  19 files changed, 725 insertions(+), 514 deletions(-)
> 
>

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Paul E. McKenney

On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote:
> This is the second version of the timer wheel rework series. The first series
> can be found here:
> 
>http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de
> 
> The series is also available in git:
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers

Ran some longer rcutorture tests, and the scripting complained about
hangs.  This turned out to be due to the 12.5% uncertainty, so I fixed
this by switching the rcutorture stop-test timer to hrtimers.  Things are
now working as well as before, with the exception of SRCU, for which I
am getting lots of grace-period stall complaints.  This came as a bit
of a surprise.  Anyway, I will be reviewing SRCU for timing dependencies.

Thanx, Paul

> Changes vs. V1:
> 
>  - Addressed the review comments of V1
> 
>  - Fixed the fallout in tty/metag (noticed by Arjan)
>  - Renamed the hlist helper (noticed by Paolo/George)
>  - Used the proper mask in get_timer_base() (noticed by Richard)
>  - Fixed the inverse state check in internal_add_timer() (noticed by 
> Richard)
>  - Simplified the macro maze, removed wrapper (noticed by George)
>  - Reordered data retrieval in run_timer() (noticed by George)
> 
>  - Removed cascading completely
> 
>We have a hard cutoff of expiry times at the capacity of the last wheel
>level now. Timers which insist on timeouts longer than that, i.e. ~6days,
>will expire at the cutoff, i.e. ~6 days. From our data gathering the
>largest timeouts are 5 days (networking contrack), which are well in the
>capacity.
> 
>To achieve this capacity with HZ=1000 without increasing the storage size
>by another level, we reduced the granularity of the first wheel level from
>1ms to 4ms. According to our data, there is no user which relies on that
>1ms granularity and 99% of those timers are canceled before expiry.
> 
>As a side effect there is the benefit of better batching in the first level
>which helps networking to avoid rearming timers in the hotpath.
> 
> We gathered more data about performance and batching. Compared to mainline the
> following changes have been observed:
> 
>- The bad outliers in mainline when the timer wheel needs to be forwarded
>  after a long idle sleep are completely gone.
> 
>- The total cpu time used for timer softirq processing is significantly
>  reduced. Depending on the HZ setting and workload this ranges from factor
>  2 to 6.
> 
>- The average invocation period of the timer softirq on an idle system
>  increases significantly. Depending on the HZ settings and workload this
>  ranges from factor 1.5 to 5. That means that the residency in deep
>  c-states should be improved. Have not yet have time to verify this with
>  the power tools.
> 
> Thanks,
> 
>   tglx
> 
> ---
>  arch/x86/kernel/apic/x2apic_uv_x.c  |4 
>  arch/x86/kernel/cpu/mcheck/mce.c|4 
>  block/genhd.c   |5 
>  drivers/cpufreq/powernv-cpufreq.c   |5 
>  drivers/mmc/host/jz4740_mmc.c   |2 
>  drivers/net/ethernet/tile/tilepro.c |4 
>  drivers/power/bq27xxx_battery.c |5 
>  drivers/tty/metag_da.c  |4 
>  drivers/tty/mips_ejtag_fdc.c|4 
>  drivers/usb/host/ohci-hcd.c |1 
>  drivers/usb/host/xhci.c |2 
>  include/linux/list.h|   10 
>  include/linux/timer.h   |   30 
>  kernel/time/tick-internal.h |1 
>  kernel/time/tick-sched.c|   46 -
>  kernel/time/timer.c | 1099 
> +---
>  lib/random32.c  |1 
>  net/ipv4/inet_connection_sock.c |7 
>  net/ipv4/inet_timewait_sock.c   |5 
>  19 files changed, 725 insertions(+), 514 deletions(-)
> 
>

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Thomas Gleixner

On Mon, 20 Jun 2016, Paul E. McKenney wrote:

> On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote:
> > This is the second version of the timer wheel rework series. The first 
> > series
> > can be found here:
> > 
> >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de
> > 
> > The series is also available in git:
> > 
> >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers
> 
> Ran some longer rcutorture tests, and the scripting complained about
> hangs.  This turned out to be due to the 12.5% uncertainty, so I fixed

Is that stuff so sensitive? I'm surprised, because the old slack stuff got you
6.25% already.

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Thomas Gleixner

On Mon, 20 Jun 2016, Paul E. McKenney wrote:

> On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote:
> > This is the second version of the timer wheel rework series. The first 
> > series
> > can be found here:
> > 
> >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de
> > 
> > The series is also available in git:
> > 
> >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers
> 
> Ran some longer rcutorture tests, and the scripting complained about
> hangs.  This turned out to be due to the 12.5% uncertainty, so I fixed

Is that stuff so sensitive? I'm surprised, because the old slack stuff got you
6.25% already.

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Thomas Gleixner

On Mon, 20 Jun 2016, Arjan van de Ven wrote:
> On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner  wrote:
> >
> > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ
> >option so datacenter folks can use this and people who don't care and 
> > want
> >better batching for power can use the 4ms thingy.
> 
> 
> if there really is one user of such long timers... could we possibly
> make that one robust against early fire of the timer?
> 
> eg rule is: if you set timers > 37 hours, you need to cope with early timer 
> fire

The only user I found is networking contrack (5 days). Eric thought its not a
big problem if it fires earlier.

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Thomas Gleixner

On Mon, 20 Jun 2016, Arjan van de Ven wrote:
> On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner  wrote:
> >
> > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ
> >option so datacenter folks can use this and people who don't care and 
> > want
> >better batching for power can use the 4ms thingy.
> 
> 
> if there really is one user of such long timers... could we possibly
> make that one robust against early fire of the timer?
> 
> eg rule is: if you set timers > 37 hours, you need to cope with early timer 
> fire

The only user I found is networking contrack (5 days). Eric thought its not a
big problem if it fires earlier.

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Arjan van de Ven

On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner  wrote:
>
> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ
>option so datacenter folks can use this and people who don't care and want
>better batching for power can use the 4ms thingy.


if there really is one user of such long timers... could we possibly
make that one robust against early fire of the timer?

eg rule is: if you set timers > 37 hours, you need to cope with early timer fire

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Arjan van de Ven

On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner  wrote:
>
> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ
>option so datacenter folks can use this and people who don't care and want
>better batching for power can use the 4ms thingy.


if there really is one user of such long timers... could we possibly
make that one robust against early fire of the timer?

eg rule is: if you set timers > 37 hours, you need to cope with early timer fire

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Thomas Gleixner

On Fri, 17 Jun 2016, Eric Dumazet wrote:
> To avoid increasing probability of such events we would need to have
> at least 4 ms difference between the RTO timer and delack timer.
> 
> Meaning we have to increase both of them and increase P99 latencies of
> RPC workloads.
> 
> Maybe a switch to hrtimer would be less risky.
> But I do not know yet if it is doable without big performance penalty.

That will be a big performance issue. So we have the following choices:

1) Increase the wheel size for HZ=1000. Doable, but utter waste of space and
   obviously more pointless work when collecting expired timers.

2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ
   option so datacenter folks can use this and people who don't care and want
   better batching for power can use the 4ms thingy.

3) Split the wheel granularities. That would leave the first wheel with tick
   granularity and the next 3 with 12.5% worst case and then for the further
   out timers we'd switch to 25%.

Thoughts?

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-20 Thread Thomas Gleixner

On Fri, 17 Jun 2016, Eric Dumazet wrote:
> To avoid increasing probability of such events we would need to have
> at least 4 ms difference between the RTO timer and delack timer.
> 
> Meaning we have to increase both of them and increase P99 latencies of
> RPC workloads.
> 
> Maybe a switch to hrtimer would be less risky.
> But I do not know yet if it is doable without big performance penalty.

That will be a big performance issue. So we have the following choices:

1) Increase the wheel size for HZ=1000. Doable, but utter waste of space and
   obviously more pointless work when collecting expired timers.

2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ
   option so datacenter folks can use this and people who don't care and want
   better batching for power can use the 4ms thingy.

3) Split the wheel granularities. That would leave the first wheel with tick
   granularity and the next 3 with 12.5% worst case and then for the further
   out timers we'd switch to 25%.

Thoughts?

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Arjan van de Ven

>To achieve this capacity with HZ=1000 without increasing the storage size
>by another level, we reduced the granularity of the first wheel level from
>1ms to 4ms. According to our data, there is no user which relies on that
>1ms granularity and 99% of those timers are canceled before expiry.


the only likely problem cases are msleep(1) uses... but we could just
map those to usleep(1000,2000)

(imo we should anyway)

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Arjan van de Ven

>To achieve this capacity with HZ=1000 without increasing the storage size
>by another level, we reduced the granularity of the first wheel level from
>1ms to 4ms. According to our data, there is no user which relies on that
>1ms granularity and 99% of those timers are canceled before expiry.


the only likely problem cases are msleep(1) uses... but we could just
map those to usleep(1000,2000)

(imo we should anyway)

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Eric Dumazet

On Fri, Jun 17, 2016 at 6:57 AM, Thomas Gleixner  wrote:
> On Fri, 17 Jun 2016, Eric Dumazet wrote:
>> >
>> >To achieve this capacity with HZ=1000 without increasing the storage 
>> > size
>> >by another level, we reduced the granularity of the first wheel level 
>> > from
>> >1ms to 4ms. According to our data, there is no user which relies on that
>> >1ms granularity and 99% of those timers are canceled before expiry.
>> >
>>
>> Ah... This might be a problem for people using small TCP RTO timers in
>> datacenters (order of 5 ms)
>> (and small delay ack timers as well, in the order of 4 ms)
>>
>> TCP/pacing uses high resolution timer in sch_fq.c so no problem there.
>>
>> If we arm a timer for 5 ms, what are the exact consequences ?
>
> The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250
>
>> I fear we might trigger lot more of spurious retransmits.
>>
>> Or maybe I should read the patch series. I'll take some time today.
>
> Maybe just throw it at such a workload and see what happens :)

Well, when a network congestion happens in a cluster, and hundred of
millions of RTO timers fire,
adding fuel to the fire, it is a nightmare already ;)

To avoid increasing probability of such events we would need to have
at least 4 ms difference between the RTO timer and delack timer.

Meaning we have to increase both of them and increase P99 latencies of
RPC workloads.

Maybe a switch to hrtimer would be less risky.
But I do not know yet if it is doable without big performance penalty.

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Eric Dumazet

On Fri, Jun 17, 2016 at 6:57 AM, Thomas Gleixner  wrote:
> On Fri, 17 Jun 2016, Eric Dumazet wrote:
>> >
>> >To achieve this capacity with HZ=1000 without increasing the storage 
>> > size
>> >by another level, we reduced the granularity of the first wheel level 
>> > from
>> >1ms to 4ms. According to our data, there is no user which relies on that
>> >1ms granularity and 99% of those timers are canceled before expiry.
>> >
>>
>> Ah... This might be a problem for people using small TCP RTO timers in
>> datacenters (order of 5 ms)
>> (and small delay ack timers as well, in the order of 4 ms)
>>
>> TCP/pacing uses high resolution timer in sch_fq.c so no problem there.
>>
>> If we arm a timer for 5 ms, what are the exact consequences ?
>
> The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250
>
>> I fear we might trigger lot more of spurious retransmits.
>>
>> Or maybe I should read the patch series. I'll take some time today.
>
> Maybe just throw it at such a workload and see what happens :)

Well, when a network congestion happens in a cluster, and hundred of
millions of RTO timers fire,
adding fuel to the fire, it is a nightmare already ;)

To avoid increasing probability of such events we would need to have
at least 4 ms difference between the RTO timer and delack timer.

Meaning we have to increase both of them and increase P99 latencies of
RPC workloads.

Maybe a switch to hrtimer would be less risky.
But I do not know yet if it is doable without big performance penalty.

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Thomas Gleixner

On Fri, 17 Jun 2016, Eric Dumazet wrote:
> >
> >To achieve this capacity with HZ=1000 without increasing the storage size
> >by another level, we reduced the granularity of the first wheel level 
> > from
> >1ms to 4ms. According to our data, there is no user which relies on that
> >1ms granularity and 99% of those timers are canceled before expiry.
> >
> 
> Ah... This might be a problem for people using small TCP RTO timers in
> datacenters (order of 5 ms)
> (and small delay ack timers as well, in the order of 4 ms)
> 
> TCP/pacing uses high resolution timer in sch_fq.c so no problem there.
> 
> If we arm a timer for 5 ms, what are the exact consequences ?

The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250

> I fear we might trigger lot more of spurious retransmits.
> 
> Or maybe I should read the patch series. I'll take some time today.

Maybe just throw it at such a workload and see what happens :)

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Thomas Gleixner

On Fri, 17 Jun 2016, Eric Dumazet wrote:
> >
> >To achieve this capacity with HZ=1000 without increasing the storage size
> >by another level, we reduced the granularity of the first wheel level 
> > from
> >1ms to 4ms. According to our data, there is no user which relies on that
> >1ms granularity and 99% of those timers are canceled before expiry.
> >
> 
> Ah... This might be a problem for people using small TCP RTO timers in
> datacenters (order of 5 ms)
> (and small delay ack timers as well, in the order of 4 ms)
> 
> TCP/pacing uses high resolution timer in sch_fq.c so no problem there.
> 
> If we arm a timer for 5 ms, what are the exact consequences ?

The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250

> I fear we might trigger lot more of spurious retransmits.
> 
> Or maybe I should read the patch series. I'll take some time today.

Maybe just throw it at such a workload and see what happens :)

Thanks,

tglx

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Eric Dumazet

>
>To achieve this capacity with HZ=1000 without increasing the storage size
>by another level, we reduced the granularity of the first wheel level from
>1ms to 4ms. According to our data, there is no user which relies on that
>1ms granularity and 99% of those timers are canceled before expiry.
>

Ah... This might be a problem for people using small TCP RTO timers in
datacenters (order of 5 ms)
(and small delay ack timers as well, in the order of 4 ms)

TCP/pacing uses high resolution timer in sch_fq.c so no problem there.

If we arm a timer for 5 ms, what are the exact consequences ?

I fear we might trigger lot more of spurious retransmits.

Or maybe I should read the patch series. I'll take some time today.

Thanks !

Re: [patch V2 00/20] timer: Refactor the timer wheel

2016-06-17 Thread Eric Dumazet

>
>To achieve this capacity with HZ=1000 without increasing the storage size
>by another level, we reduced the granularity of the first wheel level from
>1ms to 4ms. According to our data, there is no user which relies on that
>1ms granularity and 99% of those timers are canceled before expiry.
>

Ah... This might be a problem for people using small TCP RTO timers in
datacenters (order of 5 ms)
(and small delay ack timers as well, in the order of 4 ms)

TCP/pacing uses high resolution timer in sch_fq.c so no problem there.

If we arm a timer for 5 ms, what are the exact consequences ?

I fear we might trigger lot more of spurious retransmits.

Or maybe I should read the patch series. I'll take some time today.

Thanks !

40 matches

Mail list logo