Re: [patch V2 00/20] timer: Refactor the timer wheel
Hi! On Sun 2016-06-26 12:21:46, Arjan van de Ven wrote: > On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machekwrote: > > > > Umm. I'm not sure if you should be designing kernel... > > > > I have alarm clock application. It does sleep(60) many times till its > > time to wake me up. I'll be very angry if sleep(60) takes 65 seconds > > without some very, very good reason. > > I'm fairly sure you shouldn't be designing alarm clock applications! > Because on busy systems you get random (scheduler) delays added to >your timer. I'm pretty sure I should not be designing alarm clock applications, after looking at the timezone stuff. But alarm clock from mate eats 3% cpu at my cellphone, so I kind of had to. And yes, I'm aware that scheduler delays would add up. But if it is 79 seconds before alarm, I do sleep(79), and it would be strange to have alarm fire 5 seconds too late. > Having said that, your example is completely crooked here, sleep() > does not use these kernel timers, it uses hrtimers instead. > (hrtimers also have slack, but an alarm clock application that is this > broken would have the choice to set such slack to 0) > > What happened here is that these sigtimewait were actually not great, > it is just about the only application visible interface that's still > in jiffies/HZ, > and in the follow-on patch set, Thomas converted them properly to > hrtimers as well to make them both accurate and CONFIG_HZ > independent. So it is going to be fixed, good. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: [patch V2 00/20] timer: Refactor the timer wheel
Hi! On Sun 2016-06-26 12:21:46, Arjan van de Ven wrote: > On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machek wrote: > > > > Umm. I'm not sure if you should be designing kernel... > > > > I have alarm clock application. It does sleep(60) many times till its > > time to wake me up. I'll be very angry if sleep(60) takes 65 seconds > > without some very, very good reason. > > I'm fairly sure you shouldn't be designing alarm clock applications! > Because on busy systems you get random (scheduler) delays added to >your timer. I'm pretty sure I should not be designing alarm clock applications, after looking at the timezone stuff. But alarm clock from mate eats 3% cpu at my cellphone, so I kind of had to. And yes, I'm aware that scheduler delays would add up. But if it is 79 seconds before alarm, I do sleep(79), and it would be strange to have alarm fire 5 seconds too late. > Having said that, your example is completely crooked here, sleep() > does not use these kernel timers, it uses hrtimers instead. > (hrtimers also have slack, but an alarm clock application that is this > broken would have the choice to set such slack to 0) > > What happened here is that these sigtimewait were actually not great, > it is just about the only application visible interface that's still > in jiffies/HZ, > and in the follow-on patch set, Thomas converted them properly to > hrtimers as well to make them both accurate and CONFIG_HZ > independent. So it is going to be fixed, good. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machekwrote: > > Umm. I'm not sure if you should be designing kernel... > > I have alarm clock application. It does sleep(60) many times till its > time to wake me up. I'll be very angry if sleep(60) takes 65 seconds > without some very, very good reason. I'm fairly sure you shouldn't be designing alarm clock applications! Because on busy systems you get random (scheduler) delays added to your timer. Having said that, your example is completely crooked here, sleep() does not use these kernel timers, it uses hrtimers instead. (hrtimers also have slack, but an alarm clock application that is this broken would have the choice to set such slack to 0) What happened here is that these sigtimewait were actually not great, it is just about the only application visible interface that's still in jiffies/HZ, and in the follow-on patch set, Thomas converted them properly to hrtimers as well to make them both accurate and CONFIG_HZ independent.
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Sun, Jun 26, 2016 at 12:00 PM, Pavel Machek wrote: > > Umm. I'm not sure if you should be designing kernel... > > I have alarm clock application. It does sleep(60) many times till its > time to wake me up. I'll be very angry if sleep(60) takes 65 seconds > without some very, very good reason. I'm fairly sure you shouldn't be designing alarm clock applications! Because on busy systems you get random (scheduler) delays added to your timer. Having said that, your example is completely crooked here, sleep() does not use these kernel timers, it uses hrtimers instead. (hrtimers also have slack, but an alarm clock application that is this broken would have the choice to set such slack to 0) What happened here is that these sigtimewait were actually not great, it is just about the only application visible interface that's still in jiffies/HZ, and in the follow-on patch set, Thomas converted them properly to hrtimers as well to make them both accurate and CONFIG_HZ independent.
Re: [patch V2 00/20] timer: Refactor the timer wheel
Hi! > > FWIW, testing with ltp, I noticed a new failure in logs. It turns out > > to be intermittent, but the testcase mostly fails. > > You forgot to cc the LTP folks ... > > > rtbox:~ # > > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test > > Test FAILED: sigtimedwait() did not return in the required time > > time_elapsed: 1.197057 > > ...come on, you can do it... > > rtbox:~ # > > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test > > Test PASSED > > > > #define ERRORMARGIN 0.1 > > ... > > if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN) > > || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) { > > printf("Test FAILED: sigtimedwait() did not return in " > > "the required time\n"); > > printf("time_elapsed: %lf\n", time_elapsed); > > return PTS_FAIL; > > } > > > > Looks hohum to me, but gripe did arrive with patch set, so you get a note. > > hohum is a euphemism. That's completely bogus. > > The only guarantee a syscall with timers has is: timer does not fire > > early. Umm. I'm not sure if you should be designing kernel... I have alarm clock application. It does sleep(60) many times till its time to wake me up. I'll be very angry if sleep(60) takes 65 seconds without some very, very good reason. So yes, man page says this is the only requirement (and did not you break it earlier in the patchset, with sleep for 5 days? :-) ), but no, it is not really the only requirement you have. You may argue LTP's ERRORMARGIN is too strict. But you can't argue LTP is completely bogus, and I'd say error margin of 0.1 second is completely reasonable (*). If I update my alarm clock application to display seconds, I really want them with better precision than 0.1 second (*), because 0.1 seconds is already visible with the naked eye. Best regards, Pavel (*) on reasonably idle system. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: [patch V2 00/20] timer: Refactor the timer wheel
Hi! > > FWIW, testing with ltp, I noticed a new failure in logs. It turns out > > to be intermittent, but the testcase mostly fails. > > You forgot to cc the LTP folks ... > > > rtbox:~ # > > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test > > Test FAILED: sigtimedwait() did not return in the required time > > time_elapsed: 1.197057 > > ...come on, you can do it... > > rtbox:~ # > > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test > > Test PASSED > > > > #define ERRORMARGIN 0.1 > > ... > > if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN) > > || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) { > > printf("Test FAILED: sigtimedwait() did not return in " > > "the required time\n"); > > printf("time_elapsed: %lf\n", time_elapsed); > > return PTS_FAIL; > > } > > > > Looks hohum to me, but gripe did arrive with patch set, so you get a note. > > hohum is a euphemism. That's completely bogus. > > The only guarantee a syscall with timers has is: timer does not fire > > early. Umm. I'm not sure if you should be designing kernel... I have alarm clock application. It does sleep(60) many times till its time to wake me up. I'll be very angry if sleep(60) takes 65 seconds without some very, very good reason. So yes, man page says this is the only requirement (and did not you break it earlier in the patchset, with sleep for 5 days? :-) ), but no, it is not really the only requirement you have. You may argue LTP's ERRORMARGIN is too strict. But you can't argue LTP is completely bogus, and I'd say error margin of 0.1 second is completely reasonable (*). If I update my alarm clock application to display seconds, I really want them with better precision than 0.1 second (*), because 0.1 seconds is already visible with the naked eye. Best regards, Pavel (*) on reasonably idle system. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Wed, 2016-06-22 at 11:06 +0200, Mike Galbraith wrote: > On Wed, 2016-06-22 at 10:44 +0200, Thomas Gleixner wrote: > > B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote: > > > FWIW, testing with ltp, I noticed a new failure in logs. It turns > > out > > > to be intermittent, but the testcase mostly fails. > > > > You forgot to cc the LTP folks ... > > This ain't the only one, it's just new. I'll mention it. > > File under FYI/FWIW: I also plugged the set into RT, and nothing fell > out of local boxen. The below is falling out of my 8 socket box > though.. maybe a portage booboo. > > [ 1503.988863] clocksource: timekeeping watchdog on CPU42: Marking > clocksource 'tsc' as unstable because the skew is too large: > [ 1504.203800] clocksource: 'hpet' wd_now: 38b55bb > wd_last: 8303f269 mask: > [ 1504.296111] clocksource: 'tsc' cs_now: 3a3aa717794 > cs_last: 354624eea7b mask: > [ 1504.402329] clocksource: Switched to clocksource hpet Nope, not RT portage booboo. Virgin x86-tip/WIP.timers.. vogelweide:~/:[130]# dmesg|grep clocksource: [0.00] clocksource: refined-jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 1910969940391419 ns [0.00] clocksource: hpet: mask: 0x max_cycles: 0x, max_idle_ns: 133484882848 ns [5.608205] clocksource: jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 1911260446275000 ns [9.151208] clocksource: Switched to clocksource hpet [9.485907] clocksource: acpi_pm: mask: 0xff max_cycles: 0xff, max_idle_ns: 2085701024 ns [ 11.947226] clocksource: tsc: mask: 0x max_cycles: 0x20974986637, max_idle_ns: 440795286310 ns [ 13.012145] clocksource: Switched to clocksource tsc [ 434.868215] clocksource: timekeeping watchdog on CPU59: Marking clocksource 'tsc' as unstable because the skew is too large: [ 434.982251] clocksource: 'hpet' wd_now: 732cdf37 wd_last: df3d99d8 mask: [ 435.085875] clocksource: 'tsc' cs_now: 16e6780d1eb cs_last: 11326fa576e mask: [ 435.211249] clocksource: Switched to clocksource hpet vogelweide:~/:[0]#
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Wed, 2016-06-22 at 11:06 +0200, Mike Galbraith wrote: > On Wed, 2016-06-22 at 10:44 +0200, Thomas Gleixner wrote: > > B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote: > > > FWIW, testing with ltp, I noticed a new failure in logs. It turns > > out > > > to be intermittent, but the testcase mostly fails. > > > > You forgot to cc the LTP folks ... > > This ain't the only one, it's just new. I'll mention it. > > File under FYI/FWIW: I also plugged the set into RT, and nothing fell > out of local boxen. The below is falling out of my 8 socket box > though.. maybe a portage booboo. > > [ 1503.988863] clocksource: timekeeping watchdog on CPU42: Marking > clocksource 'tsc' as unstable because the skew is too large: > [ 1504.203800] clocksource: 'hpet' wd_now: 38b55bb > wd_last: 8303f269 mask: > [ 1504.296111] clocksource: 'tsc' cs_now: 3a3aa717794 > cs_last: 354624eea7b mask: > [ 1504.402329] clocksource: Switched to clocksource hpet Nope, not RT portage booboo. Virgin x86-tip/WIP.timers.. vogelweide:~/:[130]# dmesg|grep clocksource: [0.00] clocksource: refined-jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 1910969940391419 ns [0.00] clocksource: hpet: mask: 0x max_cycles: 0x, max_idle_ns: 133484882848 ns [5.608205] clocksource: jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 1911260446275000 ns [9.151208] clocksource: Switched to clocksource hpet [9.485907] clocksource: acpi_pm: mask: 0xff max_cycles: 0xff, max_idle_ns: 2085701024 ns [ 11.947226] clocksource: tsc: mask: 0x max_cycles: 0x20974986637, max_idle_ns: 440795286310 ns [ 13.012145] clocksource: Switched to clocksource tsc [ 434.868215] clocksource: timekeeping watchdog on CPU59: Marking clocksource 'tsc' as unstable because the skew is too large: [ 434.982251] clocksource: 'hpet' wd_now: 732cdf37 wd_last: df3d99d8 mask: [ 435.085875] clocksource: 'tsc' cs_now: 16e6780d1eb cs_last: 11326fa576e mask: [ 435.211249] clocksource: Switched to clocksource hpet vogelweide:~/:[0]#
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Wed, 2016-06-22 at 10:44 +0200, Thomas Gleixner wrote: > B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote: > > FWIW, testing with ltp, I noticed a new failure in logs. It turns > out > > to be intermittent, but the testcase mostly fails. > > You forgot to cc the LTP folks ... This ain't the only one, it's just new. I'll mention it. File under FYI/FWIW: I also plugged the set into RT, and nothing fell out of local boxen. The below is falling out of my 8 socket box though.. maybe a portage booboo. [ 1503.988863] clocksource: timekeeping watchdog on CPU42: Marking clocksource 'tsc' as unstable because the skew is too large: [ 1504.203800] clocksource: 'hpet' wd_now: 38b55bb wd_last: 8303f269 mask: [ 1504.296111] clocksource: 'tsc' cs_now: 3a3aa717794 cs_last: 354624eea7b mask: [ 1504.402329] clocksource: Switched to clocksource hpet
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Wed, 2016-06-22 at 10:44 +0200, Thomas Gleixner wrote: > B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote: > > FWIW, testing with ltp, I noticed a new failure in logs. It turns > out > > to be intermittent, but the testcase mostly fails. > > You forgot to cc the LTP folks ... This ain't the only one, it's just new. I'll mention it. File under FYI/FWIW: I also plugged the set into RT, and nothing fell out of local boxen. The below is falling out of my 8 socket box though.. maybe a portage booboo. [ 1503.988863] clocksource: timekeeping watchdog on CPU42: Marking clocksource 'tsc' as unstable because the skew is too large: [ 1504.203800] clocksource: 'hpet' wd_now: 38b55bb wd_last: 8303f269 mask: [ 1504.296111] clocksource: 'tsc' cs_now: 3a3aa717794 cs_last: 354624eea7b mask: [ 1504.402329] clocksource: Switched to clocksource hpet
Re: [patch V2 00/20] timer: Refactor the timer wheel
B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote: > FWIW, testing with ltp, I noticed a new failure in logs. It turns out > to be intermittent, but the testcase mostly fails. You forgot to cc the LTP folks ... > rtbox:~ # > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test > Test FAILED: sigtimedwait() did not return in the required time > time_elapsed: 1.197057 > ...come on, you can do it... > rtbox:~ # > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test > Test PASSED > > #define ERRORMARGIN 0.1 > ... > if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN) > || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) { > printf("Test FAILED: sigtimedwait() did not return in " > "the required time\n"); > printf("time_elapsed: %lf\n", time_elapsed); > return PTS_FAIL; > } > > Looks hohum to me, but gripe did arrive with patch set, so you get a note. hohum is a euphemism. That's completely bogus. The only guarantee a syscall with timers has is: timer does not fire early. Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
B1;2802;0cOn Wed, 22 Jun 2016, Mike Galbraith wrote: > FWIW, testing with ltp, I noticed a new failure in logs. It turns out > to be intermittent, but the testcase mostly fails. You forgot to cc the LTP folks ... > rtbox:~ # > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test > Test FAILED: sigtimedwait() did not return in the required time > time_elapsed: 1.197057 > ...come on, you can do it... > rtbox:~ # > /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test > Test PASSED > > #define ERRORMARGIN 0.1 > ... > if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN) > || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) { > printf("Test FAILED: sigtimedwait() did not return in " > "the required time\n"); > printf("time_elapsed: %lf\n", time_elapsed); > return PTS_FAIL; > } > > Looks hohum to me, but gripe did arrive with patch set, so you get a note. hohum is a euphemism. That's completely bogus. The only guarantee a syscall with timers has is: timer does not fire early. Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, 2016-06-17 at 13:26 +, Thomas Gleixner wrote: > This is the second version of the timer wheel rework series. The first series > can be found here: > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de > > The series is also available in git: > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers FWIW, testing with ltp, I noticed a new failure in logs. It turns out to be intermittent, but the testcase mostly fails. rtbox:~ # /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test Test FAILED: sigtimedwait() did not return in the required time time_elapsed: 1.197057 ...come on, you can do it... rtbox:~ # /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test Test PASSED #define ERRORMARGIN 0.1 ... if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN) || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) { printf("Test FAILED: sigtimedwait() did not return in " "the required time\n"); printf("time_elapsed: %lf\n", time_elapsed); return PTS_FAIL; } Looks hohum to me, but gripe did arrive with patch set, so you get a note. -Mike
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, 2016-06-17 at 13:26 +, Thomas Gleixner wrote: > This is the second version of the timer wheel rework series. The first series > can be found here: > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de > > The series is also available in git: > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers FWIW, testing with ltp, I noticed a new failure in logs. It turns out to be intermittent, but the testcase mostly fails. rtbox:~ # /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test Test FAILED: sigtimedwait() did not return in the required time time_elapsed: 1.197057 ...come on, you can do it... rtbox:~ # /usr/local/ltp/conformance/interfaces/sigtimedwait/sigtimedwait_1-1.run-test Test PASSED #define ERRORMARGIN 0.1 ... if ((time_elapsed > SIGTIMEDWAITSEC + ERRORMARGIN) || (time_elapsed < SIGTIMEDWAITSEC - ERRORMARGIN)) { printf("Test FAILED: sigtimedwait() did not return in " "the required time\n"); printf("time_elapsed: %lf\n", time_elapsed); return PTS_FAIL; } Looks hohum to me, but gripe did arrive with patch set, so you get a note. -Mike
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, Jun 20, 2016 at 12:03 PM, Rik van Rielwrote: > On Mon, 2016-06-20 at 15:56 +0200, Thomas Gleixner wrote: >> >> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a >> 1000HZ >>option so datacenter folks can use this and people who don't care >> and want >>better batching for power can use the 4ms thingy. >> > > It might be easy enough to simply re-queue a timer that > has not expired yet after 37 hours. > > How many 37 hour timers will there be outstanding at any > one time, that expire around the same time? > > Chances are, not many at all. In fact, the vast majority > of them are likely to be deleted long before they ever > expire. > > Timers lasting longer than 37 hours do not seem like > something worth optimizing for. > I totally agree that these long timers should probably be handled (if really someone needs them) using an additional set of helpers able to rearm the timer if it expires 'too soon'
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, Jun 20, 2016 at 12:03 PM, Rik van Riel wrote: > On Mon, 2016-06-20 at 15:56 +0200, Thomas Gleixner wrote: >> >> 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a >> 1000HZ >>option so datacenter folks can use this and people who don't care >> and want >>better batching for power can use the 4ms thingy. >> > > It might be easy enough to simply re-queue a timer that > has not expired yet after 37 hours. > > How many 37 hour timers will there be outstanding at any > one time, that expire around the same time? > > Chances are, not many at all. In fact, the vast majority > of them are likely to be deleted long before they ever > expire. > > Timers lasting longer than 37 hours do not seem like > something worth optimizing for. > I totally agree that these long timers should probably be handled (if really someone needs them) using an additional set of helpers able to rearm the timer if it expires 'too soon'
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, 2016-06-20 at 15:56 +0200, Thomas Gleixner wrote: > > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a > 1000HZ > option so datacenter folks can use this and people who don't care > and want > better batching for power can use the 4ms thingy. > It might be easy enough to simply re-queue a timer that has not expired yet after 37 hours. How many 37 hour timers will there be outstanding at any one time, that expire around the same time? Chances are, not many at all. In fact, the vast majority of them are likely to be deleted long before they ever expire. Timers lasting longer than 37 hours do not seem like something worth optimizing for. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, 2016-06-20 at 15:56 +0200, Thomas Gleixner wrote: > > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a > 1000HZ > option so datacenter folks can use this and people who don't care > and want > better batching for power can use the 4ms thingy. > It might be easy enough to simply re-queue a timer that has not expired yet after 37 hours. How many 37 hour timers will there be outstanding at any one time, that expire around the same time? Chances are, not many at all. In fact, the vast majority of them are likely to be deleted long before they ever expire. Timers lasting longer than 37 hours do not seem like something worth optimizing for. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [patch V2 00/20] timer: Refactor the timer wheel
so is there really an issue? sounds like KISS principle can apply On Mon, Jun 20, 2016 at 7:46 AM, Thomas Gleixnerwrote: > On Mon, 20 Jun 2016, Arjan van de Ven wrote: >> On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner wrote: >> > >> > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a >> > 1000HZ >> >option so datacenter folks can use this and people who don't care and >> > want >> >better batching for power can use the 4ms thingy. >> >> >> if there really is one user of such long timers... could we possibly >> make that one robust against early fire of the timer? >> >> eg rule is: if you set timers > 37 hours, you need to cope with early timer >> fire > > The only user I found is networking contrack (5 days). Eric thought its not a > big problem if it fires earlier. > > Thanks, > > tglx >
Re: [patch V2 00/20] timer: Refactor the timer wheel
so is there really an issue? sounds like KISS principle can apply On Mon, Jun 20, 2016 at 7:46 AM, Thomas Gleixner wrote: > On Mon, 20 Jun 2016, Arjan van de Ven wrote: >> On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner wrote: >> > >> > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a >> > 1000HZ >> >option so datacenter folks can use this and people who don't care and >> > want >> >better batching for power can use the 4ms thingy. >> >> >> if there really is one user of such long timers... could we possibly >> make that one robust against early fire of the timer? >> >> eg rule is: if you set timers > 37 hours, you need to cope with early timer >> fire > > The only user I found is networking contrack (5 days). Eric thought its not a > big problem if it fires earlier. > > Thanks, > > tglx >
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, Jun 20, 2016 at 05:13:41PM +0200, Thomas Gleixner wrote: > On Mon, 20 Jun 2016, Paul E. McKenney wrote: > > > On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote: > > > This is the second version of the timer wheel rework series. The first > > > series > > > can be found here: > > > > > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de > > > > > > The series is also available in git: > > > > > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers > > > > Ran some longer rcutorture tests, and the scripting complained about > > hangs. This turned out to be due to the 12.5% uncertainty, so I fixed > > Is that stuff so sensitive? I'm surprised, because the old slack stuff got you > 6.25% already. But didn't you have to ask for slack? Anyway, rcutorture allows three minutes longer than the duration, and then kills the test (unless it is actively dumping the ftrace buffer). A 30-minute test does fine either way, but a 60-minute test gets killed with high probability. Changing to hrtimers makes things work nicely (other than SRCU), even for 60-minute runs. I have run ten-hour rcutorture runs with normal completion with the old timers. Might well be that this switch to hrtimer is needed in some situations for the old setup. Given that it happens only once per run, it clearly has little or no performance downside, so I am queueing it regardless. Well, I will do so once I take care of the arithmetic limitations that are causing link-time errors on 32-bit systems. Thanx, Paul
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, Jun 20, 2016 at 05:13:41PM +0200, Thomas Gleixner wrote: > On Mon, 20 Jun 2016, Paul E. McKenney wrote: > > > On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote: > > > This is the second version of the timer wheel rework series. The first > > > series > > > can be found here: > > > > > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de > > > > > > The series is also available in git: > > > > > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers > > > > Ran some longer rcutorture tests, and the scripting complained about > > hangs. This turned out to be due to the 12.5% uncertainty, so I fixed > > Is that stuff so sensitive? I'm surprised, because the old slack stuff got you > 6.25% already. But didn't you have to ask for slack? Anyway, rcutorture allows three minutes longer than the duration, and then kills the test (unless it is actively dumping the ftrace buffer). A 30-minute test does fine either way, but a 60-minute test gets killed with high probability. Changing to hrtimers makes things work nicely (other than SRCU), even for 60-minute runs. I have run ten-hour rcutorture runs with normal completion with the old timers. Might well be that this switch to hrtimer is needed in some situations for the old setup. Given that it happens only once per run, it clearly has little or no performance downside, so I am queueing it regardless. Well, I will do so once I take care of the arithmetic limitations that are causing link-time errors on 32-bit systems. Thanx, Paul
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote: > This is the second version of the timer wheel rework series. The first series > can be found here: > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de > > The series is also available in git: > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers Ran some longer rcutorture tests, and the scripting complained about hangs. This turned out to be due to the 12.5% uncertainty, so I fixed this by switching the rcutorture stop-test timer to hrtimers. Things are now working as well as before, with the exception of SRCU, for which I am getting lots of grace-period stall complaints. This came as a bit of a surprise. Anyway, I will be reviewing SRCU for timing dependencies. Thanx, Paul > Changes vs. V1: > > - Addressed the review comments of V1 > > - Fixed the fallout in tty/metag (noticed by Arjan) > - Renamed the hlist helper (noticed by Paolo/George) > - Used the proper mask in get_timer_base() (noticed by Richard) > - Fixed the inverse state check in internal_add_timer() (noticed by > Richard) > - Simplified the macro maze, removed wrapper (noticed by George) > - Reordered data retrieval in run_timer() (noticed by George) > > - Removed cascading completely > >We have a hard cutoff of expiry times at the capacity of the last wheel >level now. Timers which insist on timeouts longer than that, i.e. ~6days, >will expire at the cutoff, i.e. ~6 days. From our data gathering the >largest timeouts are 5 days (networking contrack), which are well in the >capacity. > >To achieve this capacity with HZ=1000 without increasing the storage size >by another level, we reduced the granularity of the first wheel level from >1ms to 4ms. According to our data, there is no user which relies on that >1ms granularity and 99% of those timers are canceled before expiry. > >As a side effect there is the benefit of better batching in the first level >which helps networking to avoid rearming timers in the hotpath. > > We gathered more data about performance and batching. Compared to mainline the > following changes have been observed: > >- The bad outliers in mainline when the timer wheel needs to be forwarded > after a long idle sleep are completely gone. > >- The total cpu time used for timer softirq processing is significantly > reduced. Depending on the HZ setting and workload this ranges from factor > 2 to 6. > >- The average invocation period of the timer softirq on an idle system > increases significantly. Depending on the HZ settings and workload this > ranges from factor 1.5 to 5. That means that the residency in deep > c-states should be improved. Have not yet have time to verify this with > the power tools. > > Thanks, > > tglx > > --- > arch/x86/kernel/apic/x2apic_uv_x.c |4 > arch/x86/kernel/cpu/mcheck/mce.c|4 > block/genhd.c |5 > drivers/cpufreq/powernv-cpufreq.c |5 > drivers/mmc/host/jz4740_mmc.c |2 > drivers/net/ethernet/tile/tilepro.c |4 > drivers/power/bq27xxx_battery.c |5 > drivers/tty/metag_da.c |4 > drivers/tty/mips_ejtag_fdc.c|4 > drivers/usb/host/ohci-hcd.c |1 > drivers/usb/host/xhci.c |2 > include/linux/list.h| 10 > include/linux/timer.h | 30 > kernel/time/tick-internal.h |1 > kernel/time/tick-sched.c| 46 - > kernel/time/timer.c | 1099 > +--- > lib/random32.c |1 > net/ipv4/inet_connection_sock.c |7 > net/ipv4/inet_timewait_sock.c |5 > 19 files changed, 725 insertions(+), 514 deletions(-) > >
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote: > This is the second version of the timer wheel rework series. The first series > can be found here: > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de > > The series is also available in git: > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers Ran some longer rcutorture tests, and the scripting complained about hangs. This turned out to be due to the 12.5% uncertainty, so I fixed this by switching the rcutorture stop-test timer to hrtimers. Things are now working as well as before, with the exception of SRCU, for which I am getting lots of grace-period stall complaints. This came as a bit of a surprise. Anyway, I will be reviewing SRCU for timing dependencies. Thanx, Paul > Changes vs. V1: > > - Addressed the review comments of V1 > > - Fixed the fallout in tty/metag (noticed by Arjan) > - Renamed the hlist helper (noticed by Paolo/George) > - Used the proper mask in get_timer_base() (noticed by Richard) > - Fixed the inverse state check in internal_add_timer() (noticed by > Richard) > - Simplified the macro maze, removed wrapper (noticed by George) > - Reordered data retrieval in run_timer() (noticed by George) > > - Removed cascading completely > >We have a hard cutoff of expiry times at the capacity of the last wheel >level now. Timers which insist on timeouts longer than that, i.e. ~6days, >will expire at the cutoff, i.e. ~6 days. From our data gathering the >largest timeouts are 5 days (networking contrack), which are well in the >capacity. > >To achieve this capacity with HZ=1000 without increasing the storage size >by another level, we reduced the granularity of the first wheel level from >1ms to 4ms. According to our data, there is no user which relies on that >1ms granularity and 99% of those timers are canceled before expiry. > >As a side effect there is the benefit of better batching in the first level >which helps networking to avoid rearming timers in the hotpath. > > We gathered more data about performance and batching. Compared to mainline the > following changes have been observed: > >- The bad outliers in mainline when the timer wheel needs to be forwarded > after a long idle sleep are completely gone. > >- The total cpu time used for timer softirq processing is significantly > reduced. Depending on the HZ setting and workload this ranges from factor > 2 to 6. > >- The average invocation period of the timer softirq on an idle system > increases significantly. Depending on the HZ settings and workload this > ranges from factor 1.5 to 5. That means that the residency in deep > c-states should be improved. Have not yet have time to verify this with > the power tools. > > Thanks, > > tglx > > --- > arch/x86/kernel/apic/x2apic_uv_x.c |4 > arch/x86/kernel/cpu/mcheck/mce.c|4 > block/genhd.c |5 > drivers/cpufreq/powernv-cpufreq.c |5 > drivers/mmc/host/jz4740_mmc.c |2 > drivers/net/ethernet/tile/tilepro.c |4 > drivers/power/bq27xxx_battery.c |5 > drivers/tty/metag_da.c |4 > drivers/tty/mips_ejtag_fdc.c|4 > drivers/usb/host/ohci-hcd.c |1 > drivers/usb/host/xhci.c |2 > include/linux/list.h| 10 > include/linux/timer.h | 30 > kernel/time/tick-internal.h |1 > kernel/time/tick-sched.c| 46 - > kernel/time/timer.c | 1099 > +--- > lib/random32.c |1 > net/ipv4/inet_connection_sock.c |7 > net/ipv4/inet_timewait_sock.c |5 > 19 files changed, 725 insertions(+), 514 deletions(-) > >
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, 20 Jun 2016, Paul E. McKenney wrote: > On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote: > > This is the second version of the timer wheel rework series. The first > > series > > can be found here: > > > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de > > > > The series is also available in git: > > > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers > > Ran some longer rcutorture tests, and the scripting complained about > hangs. This turned out to be due to the 12.5% uncertainty, so I fixed Is that stuff so sensitive? I'm surprised, because the old slack stuff got you 6.25% already. Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, 20 Jun 2016, Paul E. McKenney wrote: > On Fri, Jun 17, 2016 at 01:26:28PM -, Thomas Gleixner wrote: > > This is the second version of the timer wheel rework series. The first > > series > > can be found here: > > > >http://lkml.kernel.org/r/20160613070440.950649...@linutronix.de > > > > The series is also available in git: > > > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.timers > > Ran some longer rcutorture tests, and the scripting complained about > hangs. This turned out to be due to the 12.5% uncertainty, so I fixed Is that stuff so sensitive? I'm surprised, because the old slack stuff got you 6.25% already. Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, 20 Jun 2016, Arjan van de Ven wrote: > On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixnerwrote: > > > > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ > >option so datacenter folks can use this and people who don't care and > > want > >better batching for power can use the 4ms thingy. > > > if there really is one user of such long timers... could we possibly > make that one robust against early fire of the timer? > > eg rule is: if you set timers > 37 hours, you need to cope with early timer > fire The only user I found is networking contrack (5 days). Eric thought its not a big problem if it fires earlier. Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, 20 Jun 2016, Arjan van de Ven wrote: > On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner wrote: > > > > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ > >option so datacenter folks can use this and people who don't care and > > want > >better batching for power can use the 4ms thingy. > > > if there really is one user of such long timers... could we possibly > make that one robust against early fire of the timer? > > eg rule is: if you set timers > 37 hours, you need to cope with early timer > fire The only user I found is networking contrack (5 days). Eric thought its not a big problem if it fires earlier. Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixnerwrote: > > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ >option so datacenter folks can use this and people who don't care and want >better batching for power can use the 4ms thingy. if there really is one user of such long timers... could we possibly make that one robust against early fire of the timer? eg rule is: if you set timers > 37 hours, you need to cope with early timer fire
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Mon, Jun 20, 2016 at 6:56 AM, Thomas Gleixner wrote: > > 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ >option so datacenter folks can use this and people who don't care and want >better batching for power can use the 4ms thingy. if there really is one user of such long timers... could we possibly make that one robust against early fire of the timer? eg rule is: if you set timers > 37 hours, you need to cope with early timer fire
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, 17 Jun 2016, Eric Dumazet wrote: > To avoid increasing probability of such events we would need to have > at least 4 ms difference between the RTO timer and delack timer. > > Meaning we have to increase both of them and increase P99 latencies of > RPC workloads. > > Maybe a switch to hrtimer would be less risky. > But I do not know yet if it is doable without big performance penalty. That will be a big performance issue. So we have the following choices: 1) Increase the wheel size for HZ=1000. Doable, but utter waste of space and obviously more pointless work when collecting expired timers. 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ option so datacenter folks can use this and people who don't care and want better batching for power can use the 4ms thingy. 3) Split the wheel granularities. That would leave the first wheel with tick granularity and the next 3 with 12.5% worst case and then for the further out timers we'd switch to 25%. Thoughts? Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, 17 Jun 2016, Eric Dumazet wrote: > To avoid increasing probability of such events we would need to have > at least 4 ms difference between the RTO timer and delack timer. > > Meaning we have to increase both of them and increase P99 latencies of > RPC workloads. > > Maybe a switch to hrtimer would be less risky. > But I do not know yet if it is doable without big performance penalty. That will be a big performance issue. So we have the following choices: 1) Increase the wheel size for HZ=1000. Doable, but utter waste of space and obviously more pointless work when collecting expired timers. 2) Cut off at 37hrs for HZ=1000. We could make this configurable as a 1000HZ option so datacenter folks can use this and people who don't care and want better batching for power can use the 4ms thingy. 3) Split the wheel granularities. That would leave the first wheel with tick granularity and the next 3 with 12.5% worst case and then for the further out timers we'd switch to 25%. Thoughts? Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
>To achieve this capacity with HZ=1000 without increasing the storage size >by another level, we reduced the granularity of the first wheel level from >1ms to 4ms. According to our data, there is no user which relies on that >1ms granularity and 99% of those timers are canceled before expiry. the only likely problem cases are msleep(1) uses... but we could just map those to usleep(1000,2000) (imo we should anyway)
Re: [patch V2 00/20] timer: Refactor the timer wheel
>To achieve this capacity with HZ=1000 without increasing the storage size >by another level, we reduced the granularity of the first wheel level from >1ms to 4ms. According to our data, there is no user which relies on that >1ms granularity and 99% of those timers are canceled before expiry. the only likely problem cases are msleep(1) uses... but we could just map those to usleep(1000,2000) (imo we should anyway)
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, Jun 17, 2016 at 6:57 AM, Thomas Gleixnerwrote: > On Fri, 17 Jun 2016, Eric Dumazet wrote: >> > >> >To achieve this capacity with HZ=1000 without increasing the storage >> > size >> >by another level, we reduced the granularity of the first wheel level >> > from >> >1ms to 4ms. According to our data, there is no user which relies on that >> >1ms granularity and 99% of those timers are canceled before expiry. >> > >> >> Ah... This might be a problem for people using small TCP RTO timers in >> datacenters (order of 5 ms) >> (and small delay ack timers as well, in the order of 4 ms) >> >> TCP/pacing uses high resolution timer in sch_fq.c so no problem there. >> >> If we arm a timer for 5 ms, what are the exact consequences ? > > The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250 > >> I fear we might trigger lot more of spurious retransmits. >> >> Or maybe I should read the patch series. I'll take some time today. > > Maybe just throw it at such a workload and see what happens :) Well, when a network congestion happens in a cluster, and hundred of millions of RTO timers fire, adding fuel to the fire, it is a nightmare already ;) To avoid increasing probability of such events we would need to have at least 4 ms difference between the RTO timer and delack timer. Meaning we have to increase both of them and increase P99 latencies of RPC workloads. Maybe a switch to hrtimer would be less risky. But I do not know yet if it is doable without big performance penalty.
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, Jun 17, 2016 at 6:57 AM, Thomas Gleixner wrote: > On Fri, 17 Jun 2016, Eric Dumazet wrote: >> > >> >To achieve this capacity with HZ=1000 without increasing the storage >> > size >> >by another level, we reduced the granularity of the first wheel level >> > from >> >1ms to 4ms. According to our data, there is no user which relies on that >> >1ms granularity and 99% of those timers are canceled before expiry. >> > >> >> Ah... This might be a problem for people using small TCP RTO timers in >> datacenters (order of 5 ms) >> (and small delay ack timers as well, in the order of 4 ms) >> >> TCP/pacing uses high resolution timer in sch_fq.c so no problem there. >> >> If we arm a timer for 5 ms, what are the exact consequences ? > > The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250 > >> I fear we might trigger lot more of spurious retransmits. >> >> Or maybe I should read the patch series. I'll take some time today. > > Maybe just throw it at such a workload and see what happens :) Well, when a network congestion happens in a cluster, and hundred of millions of RTO timers fire, adding fuel to the fire, it is a nightmare already ;) To avoid increasing probability of such events we would need to have at least 4 ms difference between the RTO timer and delack timer. Meaning we have to increase both of them and increase P99 latencies of RPC workloads. Maybe a switch to hrtimer would be less risky. But I do not know yet if it is doable without big performance penalty.
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, 17 Jun 2016, Eric Dumazet wrote: > > > >To achieve this capacity with HZ=1000 without increasing the storage size > >by another level, we reduced the granularity of the first wheel level > > from > >1ms to 4ms. According to our data, there is no user which relies on that > >1ms granularity and 99% of those timers are canceled before expiry. > > > > Ah... This might be a problem for people using small TCP RTO timers in > datacenters (order of 5 ms) > (and small delay ack timers as well, in the order of 4 ms) > > TCP/pacing uses high resolution timer in sch_fq.c so no problem there. > > If we arm a timer for 5 ms, what are the exact consequences ? The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250 > I fear we might trigger lot more of spurious retransmits. > > Or maybe I should read the patch series. I'll take some time today. Maybe just throw it at such a workload and see what happens :) Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
On Fri, 17 Jun 2016, Eric Dumazet wrote: > > > >To achieve this capacity with HZ=1000 without increasing the storage size > >by another level, we reduced the granularity of the first wheel level > > from > >1ms to 4ms. According to our data, there is no user which relies on that > >1ms granularity and 99% of those timers are canceled before expiry. > > > > Ah... This might be a problem for people using small TCP RTO timers in > datacenters (order of 5 ms) > (and small delay ack timers as well, in the order of 4 ms) > > TCP/pacing uses high resolution timer in sch_fq.c so no problem there. > > If we arm a timer for 5 ms, what are the exact consequences ? The worst case expiry time is 8ms on HZ=1000 as it is on HZ=250 > I fear we might trigger lot more of spurious retransmits. > > Or maybe I should read the patch series. I'll take some time today. Maybe just throw it at such a workload and see what happens :) Thanks, tglx
Re: [patch V2 00/20] timer: Refactor the timer wheel
> >To achieve this capacity with HZ=1000 without increasing the storage size >by another level, we reduced the granularity of the first wheel level from >1ms to 4ms. According to our data, there is no user which relies on that >1ms granularity and 99% of those timers are canceled before expiry. > Ah... This might be a problem for people using small TCP RTO timers in datacenters (order of 5 ms) (and small delay ack timers as well, in the order of 4 ms) TCP/pacing uses high resolution timer in sch_fq.c so no problem there. If we arm a timer for 5 ms, what are the exact consequences ? I fear we might trigger lot more of spurious retransmits. Or maybe I should read the patch series. I'll take some time today. Thanks !
Re: [patch V2 00/20] timer: Refactor the timer wheel
> >To achieve this capacity with HZ=1000 without increasing the storage size >by another level, we reduced the granularity of the first wheel level from >1ms to 4ms. According to our data, there is no user which relies on that >1ms granularity and 99% of those timers are canceled before expiry. > Ah... This might be a problem for people using small TCP RTO timers in datacenters (order of 5 ms) (and small delay ack timers as well, in the order of 4 ms) TCP/pacing uses high resolution timer in sch_fq.c so no problem there. If we arm a timer for 5 ms, what are the exact consequences ? I fear we might trigger lot more of spurious retransmits. Or maybe I should read the patch series. I'll take some time today. Thanks !