Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-02 Thread David Gibson
On Fri, Dec 02, 2016 at 09:36:42AM +0100, Thomas Gleixner wrote:
> On Fri, 2 Dec 2016, David Gibson wrote:
> > On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote:
> > > So I assume that you are talking about a VM which was not scheduled by the
> > > host due to overcommitment (who ever thought that this is a good idea) or
> > > whatever other reason (yes, people were complaining about wreckage caused
> > > by stopping kernels with debuggers) for a long enough time to trigger that
> > > overflow situation. If that's the case then the unsigned conversion will
> > > just make it more unlikely but it still will happen.
> > 
> > It was essentially the stopped by debugger case.  I forget exactly
> > why, but the guest was being explicitly stopped from outside, it
> > wasn't just scheduling lag.  I think it was something in the vicinity
> > of 10 minutes stopped.
> 
> Ok. Debuggers stopping stuff is one issue, but if I understood Liav
> correctly, then he is seing the issue on a heavy loaded machine.

Right.  I can't speak to other situations which might trigger this.

> Liav, can you please describe the scenario in detail? Are you observing
> this on bare metal or in a VM which gets scheduled out long enough or was
> there debugging/hypervisor intervention involved?
> 
> > It's long enough ago that I can't be sure, but I thought we'd tried
> > various different stoppage periods, which should have also triggered
> > the unsigned overflow you're describing, and didn't observe the crash
> > once the change was applied.  Note that there have been other changes
> > to the timekeeping code since then, which might have made a
> > difference.
> > 
> > I agree that it's not reasonable for the guest to be entirely
> > unaffected by such a large stoppage: I'd have no complaints if the
> > guest time was messed up, and/or it spewed warnings.  But complete
> > guest death seems a rather more fragile response to the situation than
> > we'd like.
> 
> Guests death? Is it really dead/crashed or just stuck in that endless loop
> trying to add that huge negative value piecewise?

Well, I don't know.  But the point was it was unusable from the
console, and didn't come back any time soon.

> That's at least what Liav was describing as he mentioned
> __iter_div_u64_rem() explicitely.
> 
> While I'm less worried about debuggers, I worry about the real thing.
> 
> I agree that we should not starve after resume from a debug stop, but in
> that case the least of my worries is time going backwards.
> 
> Though if the signed mult overrun is observable in a live system, then we
> need to worry about time going backwards even with the unsigned
> conversion. Simply because once we fixed the starvation issue people with
> insane enough setups will trigger the unsigned overrun and complain about
> time going backwards.
> 
> Thanks,
> 
>   tglx
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-02 Thread David Gibson
On Fri, Dec 02, 2016 at 09:36:42AM +0100, Thomas Gleixner wrote:
> On Fri, 2 Dec 2016, David Gibson wrote:
> > On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote:
> > > So I assume that you are talking about a VM which was not scheduled by the
> > > host due to overcommitment (who ever thought that this is a good idea) or
> > > whatever other reason (yes, people were complaining about wreckage caused
> > > by stopping kernels with debuggers) for a long enough time to trigger that
> > > overflow situation. If that's the case then the unsigned conversion will
> > > just make it more unlikely but it still will happen.
> > 
> > It was essentially the stopped by debugger case.  I forget exactly
> > why, but the guest was being explicitly stopped from outside, it
> > wasn't just scheduling lag.  I think it was something in the vicinity
> > of 10 minutes stopped.
> 
> Ok. Debuggers stopping stuff is one issue, but if I understood Liav
> correctly, then he is seing the issue on a heavy loaded machine.

Right.  I can't speak to other situations which might trigger this.

> Liav, can you please describe the scenario in detail? Are you observing
> this on bare metal or in a VM which gets scheduled out long enough or was
> there debugging/hypervisor intervention involved?
> 
> > It's long enough ago that I can't be sure, but I thought we'd tried
> > various different stoppage periods, which should have also triggered
> > the unsigned overflow you're describing, and didn't observe the crash
> > once the change was applied.  Note that there have been other changes
> > to the timekeeping code since then, which might have made a
> > difference.
> > 
> > I agree that it's not reasonable for the guest to be entirely
> > unaffected by such a large stoppage: I'd have no complaints if the
> > guest time was messed up, and/or it spewed warnings.  But complete
> > guest death seems a rather more fragile response to the situation than
> > we'd like.
> 
> Guests death? Is it really dead/crashed or just stuck in that endless loop
> trying to add that huge negative value piecewise?

Well, I don't know.  But the point was it was unusable from the
console, and didn't come back any time soon.

> That's at least what Liav was describing as he mentioned
> __iter_div_u64_rem() explicitely.
> 
> While I'm less worried about debuggers, I worry about the real thing.
> 
> I agree that we should not starve after resume from a debug stop, but in
> that case the least of my worries is time going backwards.
> 
> Though if the signed mult overrun is observable in a live system, then we
> need to worry about time going backwards even with the unsigned
> conversion. Simply because once we fixed the starvation issue people with
> insane enough setups will trigger the unsigned overrun and complain about
> time going backwards.
> 
> Thanks,
> 
>   tglx
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-02 Thread Thomas Gleixner
On Fri, 2 Dec 2016, David Gibson wrote:
> On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote:
> > So I assume that you are talking about a VM which was not scheduled by the
> > host due to overcommitment (who ever thought that this is a good idea) or
> > whatever other reason (yes, people were complaining about wreckage caused
> > by stopping kernels with debuggers) for a long enough time to trigger that
> > overflow situation. If that's the case then the unsigned conversion will
> > just make it more unlikely but it still will happen.
> 
> It was essentially the stopped by debugger case.  I forget exactly
> why, but the guest was being explicitly stopped from outside, it
> wasn't just scheduling lag.  I think it was something in the vicinity
> of 10 minutes stopped.

Ok. Debuggers stopping stuff is one issue, but if I understood Liav
correctly, then he is seing the issue on a heavy loaded machine.

Liav, can you please describe the scenario in detail? Are you observing
this on bare metal or in a VM which gets scheduled out long enough or was
there debugging/hypervisor intervention involved?

> It's long enough ago that I can't be sure, but I thought we'd tried
> various different stoppage periods, which should have also triggered
> the unsigned overflow you're describing, and didn't observe the crash
> once the change was applied.  Note that there have been other changes
> to the timekeeping code since then, which might have made a
> difference.
> 
> I agree that it's not reasonable for the guest to be entirely
> unaffected by such a large stoppage: I'd have no complaints if the
> guest time was messed up, and/or it spewed warnings.  But complete
> guest death seems a rather more fragile response to the situation than
> we'd like.

Guests death? Is it really dead/crashed or just stuck in that endless loop
trying to add that huge negative value piecewise?

That's at least what Liav was describing as he mentioned
__iter_div_u64_rem() explicitely.

While I'm less worried about debuggers, I worry about the real thing.

I agree that we should not starve after resume from a debug stop, but in
that case the least of my worries is time going backwards.

Though if the signed mult overrun is observable in a live system, then we
need to worry about time going backwards even with the unsigned
conversion. Simply because once we fixed the starvation issue people with
insane enough setups will trigger the unsigned overrun and complain about
time going backwards.

Thanks,

tglx




Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-02 Thread Thomas Gleixner
On Fri, 2 Dec 2016, David Gibson wrote:
> On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote:
> > So I assume that you are talking about a VM which was not scheduled by the
> > host due to overcommitment (who ever thought that this is a good idea) or
> > whatever other reason (yes, people were complaining about wreckage caused
> > by stopping kernels with debuggers) for a long enough time to trigger that
> > overflow situation. If that's the case then the unsigned conversion will
> > just make it more unlikely but it still will happen.
> 
> It was essentially the stopped by debugger case.  I forget exactly
> why, but the guest was being explicitly stopped from outside, it
> wasn't just scheduling lag.  I think it was something in the vicinity
> of 10 minutes stopped.

Ok. Debuggers stopping stuff is one issue, but if I understood Liav
correctly, then he is seing the issue on a heavy loaded machine.

Liav, can you please describe the scenario in detail? Are you observing
this on bare metal or in a VM which gets scheduled out long enough or was
there debugging/hypervisor intervention involved?

> It's long enough ago that I can't be sure, but I thought we'd tried
> various different stoppage periods, which should have also triggered
> the unsigned overflow you're describing, and didn't observe the crash
> once the change was applied.  Note that there have been other changes
> to the timekeeping code since then, which might have made a
> difference.
> 
> I agree that it's not reasonable for the guest to be entirely
> unaffected by such a large stoppage: I'd have no complaints if the
> guest time was messed up, and/or it spewed warnings.  But complete
> guest death seems a rather more fragile response to the situation than
> we'd like.

Guests death? Is it really dead/crashed or just stuck in that endless loop
trying to add that huge negative value piecewise?

That's at least what Liav was describing as he mentioned
__iter_div_u64_rem() explicitely.

While I'm less worried about debuggers, I worry about the real thing.

I agree that we should not starve after resume from a debug stop, but in
that case the least of my worries is time going backwards.

Though if the signed mult overrun is observable in a live system, then we
need to worry about time going backwards even with the unsigned
conversion. Simply because once we fixed the starvation issue people with
insane enough setups will trigger the unsigned overrun and complain about
time going backwards.

Thanks,

tglx




Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread David Gibson
On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote:
> On Thu, 1 Dec 2016, David Gibson wrote:
> > On Thu, Dec 01, 2016 at 12:21:02AM +0100, Thomas Gleixner wrote:
> > > On Wed, 30 Nov 2016, David Gibson wrote:
> > > > On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > > > > If we have legitimate use cases with a negative delta, then this patch
> > > > > breaks them no matter what. See the basic C course section in the 
> > > > > second
> > > > > link.
> > > > 
> > > > So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> > > > every case - just to make the consequences less bad if something goes
> > > > wrong.  An overflow here can still mess up timekeeping, it's true, but
> > > > time going backwards tends to cause things to go horribly, horribly
> > > > wrong - which was why I spotted this in the first place.
> > > 
> > > I completely understand the intention.
> > > 
> > > We _cannot_ make that whole thing unsigned when it is not 100% clear
> > > that there is no legitimate caller which hands in a negative delta and
> > > rightfully expects to get a negative nanoseconds value handed back.
> > 
> > But.. delta is a cycle_t, which is typedef'd to u64, so how could it
> > be negative?
> 
> Indeed. To be honest I did not bother to look that up and for some reason I
> was assuming that it's a s64. :(
> 
> So yes, we can make all this unsigned and all worries about negative deltas
> are moot.
> 
> But we really should get rid of that cycle_t typedef and simply use u64 and
> be done with it. All this typedeffery for no value is just causing
> confusion. I'm very well able to confuse myself, so I don't need extra
> stimulus.
> 
> > This is why I believed my original version (35a4933) to be safe - it
> > was merely removing a signed intermediate from what was essentially an
> > unsigned calculation (technically the output was signed, but the right
> > shift means that's not relevant).
> > 
> > > If someone sits down and proves that this cannot happen there is no reason
> > > to hold that off.
> > > 
> > > But that still does not solve the underlying root cause. Assume the
> > > following:
> > > 
> > >   T1 = base + to_nsec(delta1)
> > > 
> > >where delta1 is big, but the multiplication does not overflow 64bit
> > > 
> > >   Now wait a bit and do:
> > >   
> > >   T2 = base + to_nsec(delta2)
> > > 
> > >now delta2 is big enough, so the  multiplication does overflow 
> > > 64bit
> > >now delta2 is big enough to overflow 64bit with the multiplication.
> > > 
> > >   The result is T2 < T1, i.e. time goes backwards.
> > 
> > Hm, I see.  Do we ever actually update time that way (at least core
> > system time), rather than using the last result as a base?
> 
> It's:
> 
> delta = read(clocksoure) - last_update;
> T = base + to_nsec(delta)
> 
> and in the update function we do:
> 
> now = read(clocksoure);
> delta = now - last_update;
> last_update = now;
> base += to_ns(delta);
> 
> Usually the update happens once per tick and if all cpus are idle we have a
> safe guard which makes sure that the update happens _before_ we run into
> the overflow situation or into a wraparound of the clocksource, which ever
> comes first.

Ah, so base won't go backwards, but T could.  I guess that's what John
means about going backwards only within one interval.  I don't know
the timekeeping code well enough to know how bad the consequences of
that will be.

> > It does seem like the safer approach might be to clamp the result in
> > case of overflow, though.
> 
> Right, but clamping has its own issues. See below.

Doubtless :/.

> > > All what the unsigned conversion does is to procrastinate the problem by a
> > > factor of 2. So instead of failing after 10 seconds we fail after 20
> > > seconds. And just because you never observed the 20 seconds problem it 
> > > does
> > > not go away magically.
> > 
> > At least in the case I was observing I'm pretty sure we weren't
> > updating time that way - we always used a delta from the last value,
> > so to_nsec() returning always positive was enough to make time not go
> > backwards.
> 
> See above. But in case of failure the delta to the last update value was
> big enough to overflow into negative space. Making it unsigned just
> happened to 'work' because the stop time of the VM was not large enough to
> trigger the unsigned mult overflow.
> 
> > > The proper solution is to figure out WHY we are running into that 
> > > situation
> > > at all. So far all I have seen are symptom reports and fairy tales about
> > > ftp connections, but no real root cause analysis.
> > 
> > In the case I hit, it was due to running in a VM that had been stopped
> > for a substantial amount of time, so nothing that's actually under the
> > guest kernel's control.  The bug-as-reported was that if the VM was
> > suspended for too long it would blow up immediately upon resume.
> 
> Suspended as in suspend/resume? 

Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread David Gibson
On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote:
> On Thu, 1 Dec 2016, David Gibson wrote:
> > On Thu, Dec 01, 2016 at 12:21:02AM +0100, Thomas Gleixner wrote:
> > > On Wed, 30 Nov 2016, David Gibson wrote:
> > > > On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > > > > If we have legitimate use cases with a negative delta, then this patch
> > > > > breaks them no matter what. See the basic C course section in the 
> > > > > second
> > > > > link.
> > > > 
> > > > So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> > > > every case - just to make the consequences less bad if something goes
> > > > wrong.  An overflow here can still mess up timekeeping, it's true, but
> > > > time going backwards tends to cause things to go horribly, horribly
> > > > wrong - which was why I spotted this in the first place.
> > > 
> > > I completely understand the intention.
> > > 
> > > We _cannot_ make that whole thing unsigned when it is not 100% clear
> > > that there is no legitimate caller which hands in a negative delta and
> > > rightfully expects to get a negative nanoseconds value handed back.
> > 
> > But.. delta is a cycle_t, which is typedef'd to u64, so how could it
> > be negative?
> 
> Indeed. To be honest I did not bother to look that up and for some reason I
> was assuming that it's a s64. :(
> 
> So yes, we can make all this unsigned and all worries about negative deltas
> are moot.
> 
> But we really should get rid of that cycle_t typedef and simply use u64 and
> be done with it. All this typedeffery for no value is just causing
> confusion. I'm very well able to confuse myself, so I don't need extra
> stimulus.
> 
> > This is why I believed my original version (35a4933) to be safe - it
> > was merely removing a signed intermediate from what was essentially an
> > unsigned calculation (technically the output was signed, but the right
> > shift means that's not relevant).
> > 
> > > If someone sits down and proves that this cannot happen there is no reason
> > > to hold that off.
> > > 
> > > But that still does not solve the underlying root cause. Assume the
> > > following:
> > > 
> > >   T1 = base + to_nsec(delta1)
> > > 
> > >where delta1 is big, but the multiplication does not overflow 64bit
> > > 
> > >   Now wait a bit and do:
> > >   
> > >   T2 = base + to_nsec(delta2)
> > > 
> > >now delta2 is big enough, so the  multiplication does overflow 
> > > 64bit
> > >now delta2 is big enough to overflow 64bit with the multiplication.
> > > 
> > >   The result is T2 < T1, i.e. time goes backwards.
> > 
> > Hm, I see.  Do we ever actually update time that way (at least core
> > system time), rather than using the last result as a base?
> 
> It's:
> 
> delta = read(clocksoure) - last_update;
> T = base + to_nsec(delta)
> 
> and in the update function we do:
> 
> now = read(clocksoure);
> delta = now - last_update;
> last_update = now;
> base += to_ns(delta);
> 
> Usually the update happens once per tick and if all cpus are idle we have a
> safe guard which makes sure that the update happens _before_ we run into
> the overflow situation or into a wraparound of the clocksource, which ever
> comes first.

Ah, so base won't go backwards, but T could.  I guess that's what John
means about going backwards only within one interval.  I don't know
the timekeeping code well enough to know how bad the consequences of
that will be.

> > It does seem like the safer approach might be to clamp the result in
> > case of overflow, though.
> 
> Right, but clamping has its own issues. See below.

Doubtless :/.

> > > All what the unsigned conversion does is to procrastinate the problem by a
> > > factor of 2. So instead of failing after 10 seconds we fail after 20
> > > seconds. And just because you never observed the 20 seconds problem it 
> > > does
> > > not go away magically.
> > 
> > At least in the case I was observing I'm pretty sure we weren't
> > updating time that way - we always used a delta from the last value,
> > so to_nsec() returning always positive was enough to make time not go
> > backwards.
> 
> See above. But in case of failure the delta to the last update value was
> big enough to overflow into negative space. Making it unsigned just
> happened to 'work' because the stop time of the VM was not large enough to
> trigger the unsigned mult overflow.
> 
> > > The proper solution is to figure out WHY we are running into that 
> > > situation
> > > at all. So far all I have seen are symptom reports and fairy tales about
> > > ftp connections, but no real root cause analysis.
> > 
> > In the case I hit, it was due to running in a VM that had been stopped
> > for a substantial amount of time, so nothing that's actually under the
> > guest kernel's control.  The bug-as-reported was that if the VM was
> > suspended for too long it would blow up immediately upon resume.
> 
> Suspended as in suspend/resume? 

Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread Thomas Gleixner
On Thu, 1 Dec 2016, John Stultz wrote:
> Yes, you're right here and apologies, as I wasn't being precise.  In
> this case time does go backward, but its limited to within the current
> interval (just as it would be with a narrow clocksource wrapping
> fully). But without this patch, when the overflow occurs, if the
> signed bit is set, the signed shift pulls the sign bits down, the time
> can go backwards far beyond the current interval, which causes major
> wreckage.

Backwards is backwards, no matter how much. Depending on the computation
which sees the timejump you can end up with major crap as well. Think NTP,
PTP, PPS or whatever is able to tweak timekeeping in really bad ways.

Thanks,

tglx




Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread Thomas Gleixner
On Thu, 1 Dec 2016, John Stultz wrote:
> Yes, you're right here and apologies, as I wasn't being precise.  In
> this case time does go backward, but its limited to within the current
> interval (just as it would be with a narrow clocksource wrapping
> fully). But without this patch, when the overflow occurs, if the
> signed bit is set, the signed shift pulls the sign bits down, the time
> can go backwards far beyond the current interval, which causes major
> wreckage.

Backwards is backwards, no matter how much. Depending on the computation
which sees the timejump you can end up with major crap as well. Think NTP,
PTP, PPS or whatever is able to tweak timekeeping in really bad ways.

Thanks,

tglx




Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread John Stultz
On Thu, Dec 1, 2016 at 2:44 PM, Thomas Gleixner  wrote:
> On Thu, 1 Dec 2016, John Stultz wrote:
>
>> On Thu, Dec 1, 2016 at 12:46 PM, Thomas Gleixner  wrote:
>> > On Thu, 1 Dec 2016, John Stultz wrote:
>> >> I would also suggest:
>> >> 3) If the systems are halted for longer then the timekeeping core
>> >> expects, the system will "miss" or "lose" some portion of that halted
>> >> time, but otherwise the system will function properly.  Which is the
>> >> result with this patch.
>> >
>> > Wrong. This is not the result with this patch.
>> >
>> > If the time advances enough to overflow the unsigned mult, which is
>> > entirely possible as it takes just twice the time of the negative overflow,
>> > then time will go backwards again and that's not 'miss' or 'lose', that's
>> > just broken.
>>
>> Eh? If you overflow the 64bits on the mult, the shift (which is likely
>> large if you're actually hitting the overflow) brings the value back
>> down to a smaller value. Time doesn't go backwards, its just smaller
>> then it ought to be (since the high bits were lost).
>
> WTF?
>
> If the mult overflows, what on earth gurantees that any of the upper bits
> is set?
>
> A very simple example:
>
> T1:
>u64 delta = 0x10 - 1;
>u64 mult  = 0x1000;
>u64 res;
>
>res = delta * mult;
>
> ==> res == 0xf000
>
> T2:
>u64 delta = 0x10;
>u64 mult  = 0x1000;
>u64 res;
>
>res = delta * mult;
>
> ==> res == 0
>
> because delta * mult == 1 << 64
>
> Ergo: T2 < T1, AKA: Time goes backwards.

Yes, you're right here and apologies, as I wasn't being precise.  In
this case time does go backward, but its limited to within the current
interval (just as it would be with a narrow clocksource wrapping
fully). But without this patch, when the overflow occurs, if the
signed bit is set, the signed shift pulls the sign bits down, the time
can go backwards far beyond the current interval, which causes major
wreckage.

thanks
-john


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread John Stultz
On Thu, Dec 1, 2016 at 2:44 PM, Thomas Gleixner  wrote:
> On Thu, 1 Dec 2016, John Stultz wrote:
>
>> On Thu, Dec 1, 2016 at 12:46 PM, Thomas Gleixner  wrote:
>> > On Thu, 1 Dec 2016, John Stultz wrote:
>> >> I would also suggest:
>> >> 3) If the systems are halted for longer then the timekeeping core
>> >> expects, the system will "miss" or "lose" some portion of that halted
>> >> time, but otherwise the system will function properly.  Which is the
>> >> result with this patch.
>> >
>> > Wrong. This is not the result with this patch.
>> >
>> > If the time advances enough to overflow the unsigned mult, which is
>> > entirely possible as it takes just twice the time of the negative overflow,
>> > then time will go backwards again and that's not 'miss' or 'lose', that's
>> > just broken.
>>
>> Eh? If you overflow the 64bits on the mult, the shift (which is likely
>> large if you're actually hitting the overflow) brings the value back
>> down to a smaller value. Time doesn't go backwards, its just smaller
>> then it ought to be (since the high bits were lost).
>
> WTF?
>
> If the mult overflows, what on earth gurantees that any of the upper bits
> is set?
>
> A very simple example:
>
> T1:
>u64 delta = 0x10 - 1;
>u64 mult  = 0x1000;
>u64 res;
>
>res = delta * mult;
>
> ==> res == 0xf000
>
> T2:
>u64 delta = 0x10;
>u64 mult  = 0x1000;
>u64 res;
>
>res = delta * mult;
>
> ==> res == 0
>
> because delta * mult == 1 << 64
>
> Ergo: T2 < T1, AKA: Time goes backwards.

Yes, you're right here and apologies, as I wasn't being precise.  In
this case time does go backward, but its limited to within the current
interval (just as it would be with a narrow clocksource wrapping
fully). But without this patch, when the overflow occurs, if the
signed bit is set, the signed shift pulls the sign bits down, the time
can go backwards far beyond the current interval, which causes major
wreckage.

thanks
-john


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread Thomas Gleixner
On Thu, 1 Dec 2016, John Stultz wrote:

> On Thu, Dec 1, 2016 at 12:46 PM, Thomas Gleixner  wrote:
> > On Thu, 1 Dec 2016, John Stultz wrote:
> >> I would also suggest:
> >> 3) If the systems are halted for longer then the timekeeping core
> >> expects, the system will "miss" or "lose" some portion of that halted
> >> time, but otherwise the system will function properly.  Which is the
> >> result with this patch.
> >
> > Wrong. This is not the result with this patch.
> >
> > If the time advances enough to overflow the unsigned mult, which is
> > entirely possible as it takes just twice the time of the negative overflow,
> > then time will go backwards again and that's not 'miss' or 'lose', that's
> > just broken.
> 
> Eh? If you overflow the 64bits on the mult, the shift (which is likely
> large if you're actually hitting the overflow) brings the value back
> down to a smaller value. Time doesn't go backwards, its just smaller
> then it ought to be (since the high bits were lost).

WTF?

If the mult overflows, what on earth gurantees that any of the upper bits
is set?

A very simple example:

T1:
   u64 delta = 0x10 - 1;
   u64 mult  = 0x1000;
   u64 res;

   res = delta * mult;

==> res == 0xf000

T2:
   u64 delta = 0x10;
   u64 mult  = 0x1000;
   u64 res;

   res = delta * mult;

==> res == 0

because delta * mult == 1 << 64

Ergo: T2 < T1, AKA: Time goes backwards.

Maybe it's just me not understanding how the bits are set by the following
shift

> > If we want to prevent that, then we either have to clamp the delta value,
> > which is the worst choice or use 128bit math to avoid the overflow.
> 
> I'm not convinced yet either of these approaches are really needed.

Then please explain how you solve the issue without time going backwards
and not impacting the fast path.

> >> I'm not sure if its really worth trying to recover that time or be
> >> perfect in those situations. Especially since on narrow clocksources
> >> you'll have the same result.
> >
> > We can deal with the 64bit overflow at least for wide clocksources which
> > all virtualizaton infected architectures provide in a sane way.
> 
> Another approach would be to push back on the virtualization
> environments to step in and virtualize a solution if they've idled a
> host for too long. They could do like the old tick-based
> virtualization environments used to and trigger a few timer interrupts
> while slowly removing a fake negative clocksource offset to allow time
> to catch up more normally after a long stall.

And that's going to happen after we retired, right?

Aside of that it's just silly hackery and wont ever work reliably because
there is no guarantee that the guest can handle the interrupts _before_ it
trips over the time going backwards issue. You can call ktime_get() in
interrupt disabled code.

> Or they could require clocksources that have smaller shift values to
> allow longer idle periods.

Could require? You have to do that in the guest kernel for the price of
less accuracy. The hypervisor wont help with that.

> > For bare metal systems with narrow clocksources the whole issue is non
> > existant we can make the 128bit math depend on both a config switch and a
> > static key, so bare metal will not have to take the burden.
> 
> Bare metal machines also sometimes run virtualization. I'm not sure
> the two are usefully exclusive.

Bare metal does not have the problem, whether the system is used as a
hypervisor or not. The guests CANNOT prevent the host from running the tick
interrupt, but the host very well can prevent the guest from running.

If you are talking about S390/PPC style hypervisors which pretend that
Linux is running on bare metal, then yes Linux is still a guest and prone
to the same issue, if that hypervisor supports overcommitment and is silly
enough to keep the guests scheduled out long enough.

Thanks,

tglx




Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread Thomas Gleixner
On Thu, 1 Dec 2016, John Stultz wrote:

> On Thu, Dec 1, 2016 at 12:46 PM, Thomas Gleixner  wrote:
> > On Thu, 1 Dec 2016, John Stultz wrote:
> >> I would also suggest:
> >> 3) If the systems are halted for longer then the timekeeping core
> >> expects, the system will "miss" or "lose" some portion of that halted
> >> time, but otherwise the system will function properly.  Which is the
> >> result with this patch.
> >
> > Wrong. This is not the result with this patch.
> >
> > If the time advances enough to overflow the unsigned mult, which is
> > entirely possible as it takes just twice the time of the negative overflow,
> > then time will go backwards again and that's not 'miss' or 'lose', that's
> > just broken.
> 
> Eh? If you overflow the 64bits on the mult, the shift (which is likely
> large if you're actually hitting the overflow) brings the value back
> down to a smaller value. Time doesn't go backwards, its just smaller
> then it ought to be (since the high bits were lost).

WTF?

If the mult overflows, what on earth gurantees that any of the upper bits
is set?

A very simple example:

T1:
   u64 delta = 0x10 - 1;
   u64 mult  = 0x1000;
   u64 res;

   res = delta * mult;

==> res == 0xf000

T2:
   u64 delta = 0x10;
   u64 mult  = 0x1000;
   u64 res;

   res = delta * mult;

==> res == 0

because delta * mult == 1 << 64

Ergo: T2 < T1, AKA: Time goes backwards.

Maybe it's just me not understanding how the bits are set by the following
shift

> > If we want to prevent that, then we either have to clamp the delta value,
> > which is the worst choice or use 128bit math to avoid the overflow.
> 
> I'm not convinced yet either of these approaches are really needed.

Then please explain how you solve the issue without time going backwards
and not impacting the fast path.

> >> I'm not sure if its really worth trying to recover that time or be
> >> perfect in those situations. Especially since on narrow clocksources
> >> you'll have the same result.
> >
> > We can deal with the 64bit overflow at least for wide clocksources which
> > all virtualizaton infected architectures provide in a sane way.
> 
> Another approach would be to push back on the virtualization
> environments to step in and virtualize a solution if they've idled a
> host for too long. They could do like the old tick-based
> virtualization environments used to and trigger a few timer interrupts
> while slowly removing a fake negative clocksource offset to allow time
> to catch up more normally after a long stall.

And that's going to happen after we retired, right?

Aside of that it's just silly hackery and wont ever work reliably because
there is no guarantee that the guest can handle the interrupts _before_ it
trips over the time going backwards issue. You can call ktime_get() in
interrupt disabled code.

> Or they could require clocksources that have smaller shift values to
> allow longer idle periods.

Could require? You have to do that in the guest kernel for the price of
less accuracy. The hypervisor wont help with that.

> > For bare metal systems with narrow clocksources the whole issue is non
> > existant we can make the 128bit math depend on both a config switch and a
> > static key, so bare metal will not have to take the burden.
> 
> Bare metal machines also sometimes run virtualization. I'm not sure
> the two are usefully exclusive.

Bare metal does not have the problem, whether the system is used as a
hypervisor or not. The guests CANNOT prevent the host from running the tick
interrupt, but the host very well can prevent the guest from running.

If you are talking about S390/PPC style hypervisors which pretend that
Linux is running on bare metal, then yes Linux is still a guest and prone
to the same issue, if that hypervisor supports overcommitment and is silly
enough to keep the guests scheduled out long enough.

Thanks,

tglx




Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread John Stultz
On Thu, Dec 1, 2016 at 12:46 PM, Thomas Gleixner  wrote:
> On Thu, 1 Dec 2016, John Stultz wrote:
>> I would also suggest:
>> 3) If the systems are halted for longer then the timekeeping core
>> expects, the system will "miss" or "lose" some portion of that halted
>> time, but otherwise the system will function properly.  Which is the
>> result with this patch.
>
> Wrong. This is not the result with this patch.
>
> If the time advances enough to overflow the unsigned mult, which is
> entirely possible as it takes just twice the time of the negative overflow,
> then time will go backwards again and that's not 'miss' or 'lose', that's
> just broken.

Eh? If you overflow the 64bits on the mult, the shift (which is likely
large if you're actually hitting the overflow) brings the value back
down to a smaller value. Time doesn't go backwards, its just smaller
then it ought to be (since the high bits were lost).

> If we want to prevent that, then we either have to clamp the delta value,
> which is the worst choice or use 128bit math to avoid the overflow.

I'm not convinced yet either of these approaches are really needed.

>> I'm not sure if its really worth trying to recover that time or be
>> perfect in those situations. Especially since on narrow clocksources
>> you'll have the same result.
>
> We can deal with the 64bit overflow at least for wide clocksources which
> all virtualizaton infected architectures provide in a sane way.

Another approach would be to push back on the virtualization
environments to step in and virtualize a solution if they've idled a
host for too long. They could do like the old tick-based
virtualization environments used to and trigger a few timer interrupts
while slowly removing a fake negative clocksource offset to allow time
to catch up more normally after a long stall.

Or they could require clocksources that have smaller shift values to
allow longer idle periods.

> For bare metal systems with narrow clocksources the whole issue is non
> existant we can make the 128bit math depend on both a config switch and a
> static key, so bare metal will not have to take the burden.

Bare metal machines also sometimes run virtualization. I'm not sure
the two are usefully exclusive.

thanks
-john


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread John Stultz
On Thu, Dec 1, 2016 at 12:46 PM, Thomas Gleixner  wrote:
> On Thu, 1 Dec 2016, John Stultz wrote:
>> I would also suggest:
>> 3) If the systems are halted for longer then the timekeeping core
>> expects, the system will "miss" or "lose" some portion of that halted
>> time, but otherwise the system will function properly.  Which is the
>> result with this patch.
>
> Wrong. This is not the result with this patch.
>
> If the time advances enough to overflow the unsigned mult, which is
> entirely possible as it takes just twice the time of the negative overflow,
> then time will go backwards again and that's not 'miss' or 'lose', that's
> just broken.

Eh? If you overflow the 64bits on the mult, the shift (which is likely
large if you're actually hitting the overflow) brings the value back
down to a smaller value. Time doesn't go backwards, its just smaller
then it ought to be (since the high bits were lost).

> If we want to prevent that, then we either have to clamp the delta value,
> which is the worst choice or use 128bit math to avoid the overflow.

I'm not convinced yet either of these approaches are really needed.

>> I'm not sure if its really worth trying to recover that time or be
>> perfect in those situations. Especially since on narrow clocksources
>> you'll have the same result.
>
> We can deal with the 64bit overflow at least for wide clocksources which
> all virtualizaton infected architectures provide in a sane way.

Another approach would be to push back on the virtualization
environments to step in and virtualize a solution if they've idled a
host for too long. They could do like the old tick-based
virtualization environments used to and trigger a few timer interrupts
while slowly removing a fake negative clocksource offset to allow time
to catch up more normally after a long stall.

Or they could require clocksources that have smaller shift values to
allow longer idle periods.

> For bare metal systems with narrow clocksources the whole issue is non
> existant we can make the 128bit math depend on both a config switch and a
> static key, so bare metal will not have to take the burden.

Bare metal machines also sometimes run virtualization. I'm not sure
the two are usefully exclusive.

thanks
-john


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread Thomas Gleixner
On Thu, 1 Dec 2016, John Stultz wrote:
> On Thu, Dec 1, 2016 at 3:59 AM, Thomas Gleixner  wrote:
> > So yes, we can make all this unsigned and all worries about negative deltas
> > are moot.
> 
> Sorry for the slow response, and thanks David, for stepping in here.
> 
> So apologies for not rewriting the commit message, but this is the
> reason I came around on this patch. I didn't see negative deltas as
> valid and so moving to u64 seemed proper.

I understand that, but the changelog is just a fairy tale and not really
helpful in explaining WHY this is the right thing to do.

Aside of that just making that single point u64 is just a sloppy
hack. Either we clean up the whole chain or leave it as is.

And that's something I was asking for from the very beginning, but all I
got so far was handwaving and changelogs which are worse than no changelog
at all.

> > But we really should get rid of that cycle_t typedef and simply use u64 and
> > be done with it. All this typedeffery for no value is just causing
> > confusion. I'm very well able to confuse myself, so I don't need extra
> > stimulus.
> 
> Yea, it can be obscuring. However I worry if we just have a bunch of
> u64s around folks (or maybe just me :) will mix up which are cycles
> and which are nanoseconds more easily.

Well, I really prefer non obscure data types and having:

  u64 nsec, cycles;

makes it pretty clear.

> > We have two options to deal with the issue for wide enough clocksources:
> >
> > 1) Checking that delta is less than clocksource->max_cycles.
> >
> >I really hate this because we invested a lot of thoughts to squeeze
> >everything we need for the time getters hotpath into a single cache
> >line. Accessing clocksource->max_cycles forces us to touch another
> >one. Bah!
> >
> >Aside of that what guarantees that we never run into a situation where
> >something doing timekeeping updates (NTP, PTP, PPS ...) uses such a
> >clamped value and comes to completely bogus conclusions? Are we going to
> >analyze and fixup all of that in order to prevent such wreckage?
> >
> >I seriously doubt that given the fact, that nobody sat down and analyzed
> >the signed/unsigned issue proper, which is way less complex.
> >
> > 2) Use mul_u64_u32_shr()
> >
> >That works without an extra cache line, but it's more expensive in terms
> >of text size especially on architectures which do not support the mul64
> >expansion to 128bit natively.
> >
> >But that seems like the most robust solution. We can be clever and make
> >this conditional on both a configuration switch and a static key which
> >can be turned on by guests. I'll send out a RFC series later today.
> >
> > Yet another proof that virtualization is creating more problems than it
> > solves.
> 
> I would also suggest:
> 3) If the systems are halted for longer then the timekeeping core
> expects, the system will "miss" or "lose" some portion of that halted
> time, but otherwise the system will function properly.  Which is the
> result with this patch.

Wrong. This is not the result with this patch.

If the time advances enough to overflow the unsigned mult, which is
entirely possible as it takes just twice the time of the negative overflow,
then time will go backwards again and that's not 'miss' or 'lose', that's
just broken.

If we want to prevent that, then we either have to clamp the delta value,
which is the worst choice or use 128bit math to avoid the overflow.

> I'm not sure if its really worth trying to recover that time or be
> perfect in those situations. Especially since on narrow clocksources
> you'll have the same result.

We can deal with the 64bit overflow at least for wide clocksources which
all virtualizaton infected architectures provide in a sane way.

For bare metal systems with narrow clocksources the whole issue is non
existant we can make the 128bit math depend on both a config switch and a
static key, so bare metal will not have to take the burden.

Thanks,

tglx


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread Thomas Gleixner
On Thu, 1 Dec 2016, John Stultz wrote:
> On Thu, Dec 1, 2016 at 3:59 AM, Thomas Gleixner  wrote:
> > So yes, we can make all this unsigned and all worries about negative deltas
> > are moot.
> 
> Sorry for the slow response, and thanks David, for stepping in here.
> 
> So apologies for not rewriting the commit message, but this is the
> reason I came around on this patch. I didn't see negative deltas as
> valid and so moving to u64 seemed proper.

I understand that, but the changelog is just a fairy tale and not really
helpful in explaining WHY this is the right thing to do.

Aside of that just making that single point u64 is just a sloppy
hack. Either we clean up the whole chain or leave it as is.

And that's something I was asking for from the very beginning, but all I
got so far was handwaving and changelogs which are worse than no changelog
at all.

> > But we really should get rid of that cycle_t typedef and simply use u64 and
> > be done with it. All this typedeffery for no value is just causing
> > confusion. I'm very well able to confuse myself, so I don't need extra
> > stimulus.
> 
> Yea, it can be obscuring. However I worry if we just have a bunch of
> u64s around folks (or maybe just me :) will mix up which are cycles
> and which are nanoseconds more easily.

Well, I really prefer non obscure data types and having:

  u64 nsec, cycles;

makes it pretty clear.

> > We have two options to deal with the issue for wide enough clocksources:
> >
> > 1) Checking that delta is less than clocksource->max_cycles.
> >
> >I really hate this because we invested a lot of thoughts to squeeze
> >everything we need for the time getters hotpath into a single cache
> >line. Accessing clocksource->max_cycles forces us to touch another
> >one. Bah!
> >
> >Aside of that what guarantees that we never run into a situation where
> >something doing timekeeping updates (NTP, PTP, PPS ...) uses such a
> >clamped value and comes to completely bogus conclusions? Are we going to
> >analyze and fixup all of that in order to prevent such wreckage?
> >
> >I seriously doubt that given the fact, that nobody sat down and analyzed
> >the signed/unsigned issue proper, which is way less complex.
> >
> > 2) Use mul_u64_u32_shr()
> >
> >That works without an extra cache line, but it's more expensive in terms
> >of text size especially on architectures which do not support the mul64
> >expansion to 128bit natively.
> >
> >But that seems like the most robust solution. We can be clever and make
> >this conditional on both a configuration switch and a static key which
> >can be turned on by guests. I'll send out a RFC series later today.
> >
> > Yet another proof that virtualization is creating more problems than it
> > solves.
> 
> I would also suggest:
> 3) If the systems are halted for longer then the timekeeping core
> expects, the system will "miss" or "lose" some portion of that halted
> time, but otherwise the system will function properly.  Which is the
> result with this patch.

Wrong. This is not the result with this patch.

If the time advances enough to overflow the unsigned mult, which is
entirely possible as it takes just twice the time of the negative overflow,
then time will go backwards again and that's not 'miss' or 'lose', that's
just broken.

If we want to prevent that, then we either have to clamp the delta value,
which is the worst choice or use 128bit math to avoid the overflow.

> I'm not sure if its really worth trying to recover that time or be
> perfect in those situations. Especially since on narrow clocksources
> you'll have the same result.

We can deal with the 64bit overflow at least for wide clocksources which
all virtualizaton infected architectures provide in a sane way.

For bare metal systems with narrow clocksources the whole issue is non
existant we can make the 128bit math depend on both a config switch and a
static key, so bare metal will not have to take the burden.

Thanks,

tglx


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread John Stultz
On Thu, Dec 1, 2016 at 3:59 AM, Thomas Gleixner  wrote:
> On Thu, 1 Dec 2016, David Gibson wrote:
>> But.. delta is a cycle_t, which is typedef'd to u64, so how could it
>> be negative?
>
> Indeed. To be honest I did not bother to look that up and for some reason I
> was assuming that it's a s64. :(
>
> So yes, we can make all this unsigned and all worries about negative deltas
> are moot.

Sorry for the slow response, and thanks David, for stepping in here.

So apologies for not rewriting the commit message, but this is the
reason I came around on this patch. I didn't see negative deltas as
valid and so moving to u64 seemed proper.


> But we really should get rid of that cycle_t typedef and simply use u64 and
> be done with it. All this typedeffery for no value is just causing
> confusion. I'm very well able to confuse myself, so I don't need extra
> stimulus.

Yea, it can be obscuring. However I worry if we just have a bunch of
u64s around folks (or maybe just me :) will mix up which are cycles
and which are nanoseconds more easily.


>> > The proper solution is to figure out WHY we are running into that situation
>> > at all. So far all I have seen are symptom reports and fairy tales about
>> > ftp connections, but no real root cause analysis.
>>
>> In the case I hit, it was due to running in a VM that had been stopped
>> for a substantial amount of time, so nothing that's actually under the
>> guest kernel's control.  The bug-as-reported was that if the VM was
>> suspended for too long it would blow up immediately upon resume.
>
> Suspended as in suspend/resume? The timekeeping_resume() code path does not
> use the conversion function, it has already it's own algorithm to protect
> against the potential overflow.
>
> So I assume that you are talking about a VM which was not scheduled by the
> host due to overcommitment (who ever thought that this is a good idea) or
> whatever other reason (yes, people were complaining about wreckage caused
> by stopping kernels with debuggers) for a long enough time to trigger that
> overflow situation. If that's the case then the unsigned conversion will
> just make it more unlikely but it still will happen.
>
> I agree that clamping the result would prevent the time going backwards
> issue for clocksources which have a wide enough counter (x86 TSC, powerpc
> incrementer, ...), but it won't prevent problems with clocksources which
> wrap around due to a small bit width of the counter.
>
> We have two options to deal with the issue for wide enough clocksources:
>
> 1) Checking that delta is less than clocksource->max_cycles.
>
>I really hate this because we invested a lot of thoughts to squeeze
>everything we need for the time getters hotpath into a single cache
>line. Accessing clocksource->max_cycles forces us to touch another
>one. Bah!
>
>Aside of that what guarantees that we never run into a situation where
>something doing timekeeping updates (NTP, PTP, PPS ...) uses such a
>clamped value and comes to completely bogus conclusions? Are we going to
>analyze and fixup all of that in order to prevent such wreckage?
>
>I seriously doubt that given the fact, that nobody sat down and analyzed
>the signed/unsigned issue proper, which is way less complex.
>
> 2) Use mul_u64_u32_shr()
>
>That works without an extra cache line, but it's more expensive in terms
>of text size especially on architectures which do not support the mul64
>expansion to 128bit natively.
>
>But that seems like the most robust solution. We can be clever and make
>this conditional on both a configuration switch and a static key which
>can be turned on by guests. I'll send out a RFC series later today.
>
> Yet another proof that virtualization is creating more problems than it
> solves.

I would also suggest:
3) If the systems are halted for longer then the timekeeping core
expects, the system will "miss" or "lose" some portion of that halted
time, but otherwise the system will function properly.  Which is the
result with this patch.

I'm not sure if its really worth trying to recover that time or be
perfect in those situations. Especially since on narrow clocksources
you'll have the same result.

thanks
-john


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread John Stultz
On Thu, Dec 1, 2016 at 3:59 AM, Thomas Gleixner  wrote:
> On Thu, 1 Dec 2016, David Gibson wrote:
>> But.. delta is a cycle_t, which is typedef'd to u64, so how could it
>> be negative?
>
> Indeed. To be honest I did not bother to look that up and for some reason I
> was assuming that it's a s64. :(
>
> So yes, we can make all this unsigned and all worries about negative deltas
> are moot.

Sorry for the slow response, and thanks David, for stepping in here.

So apologies for not rewriting the commit message, but this is the
reason I came around on this patch. I didn't see negative deltas as
valid and so moving to u64 seemed proper.


> But we really should get rid of that cycle_t typedef and simply use u64 and
> be done with it. All this typedeffery for no value is just causing
> confusion. I'm very well able to confuse myself, so I don't need extra
> stimulus.

Yea, it can be obscuring. However I worry if we just have a bunch of
u64s around folks (or maybe just me :) will mix up which are cycles
and which are nanoseconds more easily.


>> > The proper solution is to figure out WHY we are running into that situation
>> > at all. So far all I have seen are symptom reports and fairy tales about
>> > ftp connections, but no real root cause analysis.
>>
>> In the case I hit, it was due to running in a VM that had been stopped
>> for a substantial amount of time, so nothing that's actually under the
>> guest kernel's control.  The bug-as-reported was that if the VM was
>> suspended for too long it would blow up immediately upon resume.
>
> Suspended as in suspend/resume? The timekeeping_resume() code path does not
> use the conversion function, it has already it's own algorithm to protect
> against the potential overflow.
>
> So I assume that you are talking about a VM which was not scheduled by the
> host due to overcommitment (who ever thought that this is a good idea) or
> whatever other reason (yes, people were complaining about wreckage caused
> by stopping kernels with debuggers) for a long enough time to trigger that
> overflow situation. If that's the case then the unsigned conversion will
> just make it more unlikely but it still will happen.
>
> I agree that clamping the result would prevent the time going backwards
> issue for clocksources which have a wide enough counter (x86 TSC, powerpc
> incrementer, ...), but it won't prevent problems with clocksources which
> wrap around due to a small bit width of the counter.
>
> We have two options to deal with the issue for wide enough clocksources:
>
> 1) Checking that delta is less than clocksource->max_cycles.
>
>I really hate this because we invested a lot of thoughts to squeeze
>everything we need for the time getters hotpath into a single cache
>line. Accessing clocksource->max_cycles forces us to touch another
>one. Bah!
>
>Aside of that what guarantees that we never run into a situation where
>something doing timekeeping updates (NTP, PTP, PPS ...) uses such a
>clamped value and comes to completely bogus conclusions? Are we going to
>analyze and fixup all of that in order to prevent such wreckage?
>
>I seriously doubt that given the fact, that nobody sat down and analyzed
>the signed/unsigned issue proper, which is way less complex.
>
> 2) Use mul_u64_u32_shr()
>
>That works without an extra cache line, but it's more expensive in terms
>of text size especially on architectures which do not support the mul64
>expansion to 128bit natively.
>
>But that seems like the most robust solution. We can be clever and make
>this conditional on both a configuration switch and a static key which
>can be turned on by guests. I'll send out a RFC series later today.
>
> Yet another proof that virtualization is creating more problems than it
> solves.

I would also suggest:
3) If the systems are halted for longer then the timekeeping core
expects, the system will "miss" or "lose" some portion of that halted
time, but otherwise the system will function properly.  Which is the
result with this patch.

I'm not sure if its really worth trying to recover that time or be
perfect in those situations. Especially since on narrow clocksources
you'll have the same result.

thanks
-john


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread Thomas Gleixner
On Thu, 1 Dec 2016, David Gibson wrote:
> On Thu, Dec 01, 2016 at 12:21:02AM +0100, Thomas Gleixner wrote:
> > On Wed, 30 Nov 2016, David Gibson wrote:
> > > On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > > > If we have legitimate use cases with a negative delta, then this patch
> > > > breaks them no matter what. See the basic C course section in the second
> > > > link.
> > > 
> > > So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> > > every case - just to make the consequences less bad if something goes
> > > wrong.  An overflow here can still mess up timekeeping, it's true, but
> > > time going backwards tends to cause things to go horribly, horribly
> > > wrong - which was why I spotted this in the first place.
> > 
> > I completely understand the intention.
> > 
> > We _cannot_ make that whole thing unsigned when it is not 100% clear
> > that there is no legitimate caller which hands in a negative delta and
> > rightfully expects to get a negative nanoseconds value handed back.
> 
> But.. delta is a cycle_t, which is typedef'd to u64, so how could it
> be negative?

Indeed. To be honest I did not bother to look that up and for some reason I
was assuming that it's a s64. :(

So yes, we can make all this unsigned and all worries about negative deltas
are moot.

But we really should get rid of that cycle_t typedef and simply use u64 and
be done with it. All this typedeffery for no value is just causing
confusion. I'm very well able to confuse myself, so I don't need extra
stimulus.

> This is why I believed my original version (35a4933) to be safe - it
> was merely removing a signed intermediate from what was essentially an
> unsigned calculation (technically the output was signed, but the right
> shift means that's not relevant).
> 
> > If someone sits down and proves that this cannot happen there is no reason
> > to hold that off.
> > 
> > But that still does not solve the underlying root cause. Assume the
> > following:
> > 
> >   T1 = base + to_nsec(delta1)
> > 
> >where delta1 is big, but the multiplication does not overflow 64bit
> > 
> >   Now wait a bit and do:
> >   
> >   T2 = base + to_nsec(delta2)
> > 
> >now delta2 is big enough, so the  multiplication does overflow 64bit
> >now delta2 is big enough to overflow 64bit with the multiplication.
> > 
> >   The result is T2 < T1, i.e. time goes backwards.
> 
> Hm, I see.  Do we ever actually update time that way (at least core
> system time), rather than using the last result as a base?

It's:

delta = read(clocksoure) - last_update;
T = base + to_nsec(delta)

and in the update function we do:

now = read(clocksoure);
delta = now - last_update;
last_update = now;
base += to_ns(delta);

Usually the update happens once per tick and if all cpus are idle we have a
safe guard which makes sure that the update happens _before_ we run into
the overflow situation or into a wraparound of the clocksource, which ever
comes first.

> It does seem like the safer approach might be to clamp the result in
> case of overflow, though.

Right, but clamping has its own issues. See below.

> > All what the unsigned conversion does is to procrastinate the problem by a
> > factor of 2. So instead of failing after 10 seconds we fail after 20
> > seconds. And just because you never observed the 20 seconds problem it does
> > not go away magically.
> 
> At least in the case I was observing I'm pretty sure we weren't
> updating time that way - we always used a delta from the last value,
> so to_nsec() returning always positive was enough to make time not go
> backwards.

See above. But in case of failure the delta to the last update value was
big enough to overflow into negative space. Making it unsigned just
happened to 'work' because the stop time of the VM was not large enough to
trigger the unsigned mult overflow.

> > The proper solution is to figure out WHY we are running into that situation
> > at all. So far all I have seen are symptom reports and fairy tales about
> > ftp connections, but no real root cause analysis.
> 
> In the case I hit, it was due to running in a VM that had been stopped
> for a substantial amount of time, so nothing that's actually under the
> guest kernel's control.  The bug-as-reported was that if the VM was
> suspended for too long it would blow up immediately upon resume.

Suspended as in suspend/resume? The timekeeping_resume() code path does not
use the conversion function, it has already it's own algorithm to protect
against the potential overflow.

So I assume that you are talking about a VM which was not scheduled by the
host due to overcommitment (who ever thought that this is a good idea) or
whatever other reason (yes, people were complaining about wreckage caused
by stopping kernels with debuggers) for a long enough time to trigger that
overflow situation. If that's the case then the unsigned conversion will
just make it more unlikely 

Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-12-01 Thread Thomas Gleixner
On Thu, 1 Dec 2016, David Gibson wrote:
> On Thu, Dec 01, 2016 at 12:21:02AM +0100, Thomas Gleixner wrote:
> > On Wed, 30 Nov 2016, David Gibson wrote:
> > > On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > > > If we have legitimate use cases with a negative delta, then this patch
> > > > breaks them no matter what. See the basic C course section in the second
> > > > link.
> > > 
> > > So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> > > every case - just to make the consequences less bad if something goes
> > > wrong.  An overflow here can still mess up timekeeping, it's true, but
> > > time going backwards tends to cause things to go horribly, horribly
> > > wrong - which was why I spotted this in the first place.
> > 
> > I completely understand the intention.
> > 
> > We _cannot_ make that whole thing unsigned when it is not 100% clear
> > that there is no legitimate caller which hands in a negative delta and
> > rightfully expects to get a negative nanoseconds value handed back.
> 
> But.. delta is a cycle_t, which is typedef'd to u64, so how could it
> be negative?

Indeed. To be honest I did not bother to look that up and for some reason I
was assuming that it's a s64. :(

So yes, we can make all this unsigned and all worries about negative deltas
are moot.

But we really should get rid of that cycle_t typedef and simply use u64 and
be done with it. All this typedeffery for no value is just causing
confusion. I'm very well able to confuse myself, so I don't need extra
stimulus.

> This is why I believed my original version (35a4933) to be safe - it
> was merely removing a signed intermediate from what was essentially an
> unsigned calculation (technically the output was signed, but the right
> shift means that's not relevant).
> 
> > If someone sits down and proves that this cannot happen there is no reason
> > to hold that off.
> > 
> > But that still does not solve the underlying root cause. Assume the
> > following:
> > 
> >   T1 = base + to_nsec(delta1)
> > 
> >where delta1 is big, but the multiplication does not overflow 64bit
> > 
> >   Now wait a bit and do:
> >   
> >   T2 = base + to_nsec(delta2)
> > 
> >now delta2 is big enough, so the  multiplication does overflow 64bit
> >now delta2 is big enough to overflow 64bit with the multiplication.
> > 
> >   The result is T2 < T1, i.e. time goes backwards.
> 
> Hm, I see.  Do we ever actually update time that way (at least core
> system time), rather than using the last result as a base?

It's:

delta = read(clocksoure) - last_update;
T = base + to_nsec(delta)

and in the update function we do:

now = read(clocksoure);
delta = now - last_update;
last_update = now;
base += to_ns(delta);

Usually the update happens once per tick and if all cpus are idle we have a
safe guard which makes sure that the update happens _before_ we run into
the overflow situation or into a wraparound of the clocksource, which ever
comes first.

> It does seem like the safer approach might be to clamp the result in
> case of overflow, though.

Right, but clamping has its own issues. See below.

> > All what the unsigned conversion does is to procrastinate the problem by a
> > factor of 2. So instead of failing after 10 seconds we fail after 20
> > seconds. And just because you never observed the 20 seconds problem it does
> > not go away magically.
> 
> At least in the case I was observing I'm pretty sure we weren't
> updating time that way - we always used a delta from the last value,
> so to_nsec() returning always positive was enough to make time not go
> backwards.

See above. But in case of failure the delta to the last update value was
big enough to overflow into negative space. Making it unsigned just
happened to 'work' because the stop time of the VM was not large enough to
trigger the unsigned mult overflow.

> > The proper solution is to figure out WHY we are running into that situation
> > at all. So far all I have seen are symptom reports and fairy tales about
> > ftp connections, but no real root cause analysis.
> 
> In the case I hit, it was due to running in a VM that had been stopped
> for a substantial amount of time, so nothing that's actually under the
> guest kernel's control.  The bug-as-reported was that if the VM was
> suspended for too long it would blow up immediately upon resume.

Suspended as in suspend/resume? The timekeeping_resume() code path does not
use the conversion function, it has already it's own algorithm to protect
against the potential overflow.

So I assume that you are talking about a VM which was not scheduled by the
host due to overcommitment (who ever thought that this is a good idea) or
whatever other reason (yes, people were complaining about wreckage caused
by stopping kernels with debuggers) for a long enough time to trigger that
overflow situation. If that's the case then the unsigned conversion will
just make it more unlikely 

Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-30 Thread David Gibson
On Thu, Dec 01, 2016 at 12:21:02AM +0100, Thomas Gleixner wrote:
> On Wed, 30 Nov 2016, David Gibson wrote:
> > On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > > If we have legitimate use cases with a negative delta, then this patch
> > > breaks them no matter what. See the basic C course section in the second
> > > link.
> > 
> > So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> > every case - just to make the consequences less bad if something goes
> > wrong.  An overflow here can still mess up timekeeping, it's true, but
> > time going backwards tends to cause things to go horribly, horribly
> > wrong - which was why I spotted this in the first place.
> 
> I completely understand the intention.
> 
> We _cannot_ make that whole thing unsigned when it is not 100% clear
> that there is no legitimate caller which hands in a negative delta and
> rightfully expects to get a negative nanoseconds value handed back.

But.. delta is a cycle_t, which is typedef'd to u64, so how could it
be negative?

This is why I believed my original version (35a4933) to be safe - it
was merely removing a signed intermediate from what was essentially an
unsigned calculation (technically the output was signed, but the right
shift means that's not relevant).

> If someone sits down and proves that this cannot happen there is no reason
> to hold that off.
> 
> But that still does not solve the underlying root cause. Assume the
> following:
> 
>   T1 = base + to_nsec(delta1)
> 
>where delta1 is big, but the multiplication does not overflow 64bit
> 
>   Now wait a bit and do:
>   
>   T2 = base + to_nsec(delta2)
> 
>now delta2 is big enough, so the  multiplication does overflow 64bit
>now delta2 is big enough to overflow 64bit with the multiplication.
> 
>   The result is T2 < T1, i.e. time goes backwards.

Hm, I see.  Do we ever actually update time that way (at least core
system time), rather than using the last result as a base?

It does seem like the safer approach might be to clamp the result in
case of overflow, though.

> All what the unsigned conversion does is to procrastinate the problem by a
> factor of 2. So instead of failing after 10 seconds we fail after 20
> seconds. And just because you never observed the 20 seconds problem it does
> not go away magically.

At least in the case I was observing I'm pretty sure we weren't
updating time that way - we always used a delta from the last value,
so to_nsec() returning always positive was enough to make time not go
backwards.

> The proper solution is to figure out WHY we are running into that situation
> at all. So far all I have seen are symptom reports and fairy tales about
> ftp connections, but no real root cause analysis.

In the case I hit, it was due to running in a VM that had been stopped
for a substantial amount of time, so nothing that's actually under the
guest kernel's control.  The bug-as-reported was that if the VM was
suspended for too long it would blow up immediately upon resume.

> The only reason for this to happen is that 'base' does not get updated for
> a too long time, so the delta grows into the overflow range.
> 
> We already have protection against idle sleeping too long for this to
> happen. If the idle protection is not working then it needs to be fixed.
> 
> if some other situation can cause the base not to be updated for a long
> time, then this needs to be fixed.
> 
> Curing the symptom is a guarantee that the root cause will show another
> symptom sooner than later.
> 
> Thanks,
> 
>   tglx
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-30 Thread David Gibson
On Thu, Dec 01, 2016 at 12:21:02AM +0100, Thomas Gleixner wrote:
> On Wed, 30 Nov 2016, David Gibson wrote:
> > On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > > If we have legitimate use cases with a negative delta, then this patch
> > > breaks them no matter what. See the basic C course section in the second
> > > link.
> > 
> > So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> > every case - just to make the consequences less bad if something goes
> > wrong.  An overflow here can still mess up timekeeping, it's true, but
> > time going backwards tends to cause things to go horribly, horribly
> > wrong - which was why I spotted this in the first place.
> 
> I completely understand the intention.
> 
> We _cannot_ make that whole thing unsigned when it is not 100% clear
> that there is no legitimate caller which hands in a negative delta and
> rightfully expects to get a negative nanoseconds value handed back.

But.. delta is a cycle_t, which is typedef'd to u64, so how could it
be negative?

This is why I believed my original version (35a4933) to be safe - it
was merely removing a signed intermediate from what was essentially an
unsigned calculation (technically the output was signed, but the right
shift means that's not relevant).

> If someone sits down and proves that this cannot happen there is no reason
> to hold that off.
> 
> But that still does not solve the underlying root cause. Assume the
> following:
> 
>   T1 = base + to_nsec(delta1)
> 
>where delta1 is big, but the multiplication does not overflow 64bit
> 
>   Now wait a bit and do:
>   
>   T2 = base + to_nsec(delta2)
> 
>now delta2 is big enough, so the  multiplication does overflow 64bit
>now delta2 is big enough to overflow 64bit with the multiplication.
> 
>   The result is T2 < T1, i.e. time goes backwards.

Hm, I see.  Do we ever actually update time that way (at least core
system time), rather than using the last result as a base?

It does seem like the safer approach might be to clamp the result in
case of overflow, though.

> All what the unsigned conversion does is to procrastinate the problem by a
> factor of 2. So instead of failing after 10 seconds we fail after 20
> seconds. And just because you never observed the 20 seconds problem it does
> not go away magically.

At least in the case I was observing I'm pretty sure we weren't
updating time that way - we always used a delta from the last value,
so to_nsec() returning always positive was enough to make time not go
backwards.

> The proper solution is to figure out WHY we are running into that situation
> at all. So far all I have seen are symptom reports and fairy tales about
> ftp connections, but no real root cause analysis.

In the case I hit, it was due to running in a VM that had been stopped
for a substantial amount of time, so nothing that's actually under the
guest kernel's control.  The bug-as-reported was that if the VM was
suspended for too long it would blow up immediately upon resume.

> The only reason for this to happen is that 'base' does not get updated for
> a too long time, so the delta grows into the overflow range.
> 
> We already have protection against idle sleeping too long for this to
> happen. If the idle protection is not working then it needs to be fixed.
> 
> if some other situation can cause the base not to be updated for a long
> time, then this needs to be fixed.
> 
> Curing the symptom is a guarantee that the root cause will show another
> symptom sooner than later.
> 
> Thanks,
> 
>   tglx
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-30 Thread Thomas Gleixner
On Wed, 30 Nov 2016, David Gibson wrote:
> On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > If we have legitimate use cases with a negative delta, then this patch
> > breaks them no matter what. See the basic C course section in the second
> > link.
> 
> So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> every case - just to make the consequences less bad if something goes
> wrong.  An overflow here can still mess up timekeeping, it's true, but
> time going backwards tends to cause things to go horribly, horribly
> wrong - which was why I spotted this in the first place.

I completely understand the intention.

We _cannot_ make that whole thing unsigned when it is not 100% clear
that there is no legitimate caller which hands in a negative delta and
rightfully expects to get a negative nanoseconds value handed back.

If someone sits down and proves that this cannot happen there is no reason
to hold that off.

But that still does not solve the underlying root cause. Assume the
following:

  T1 = base + to_nsec(delta1)

   where delta1 is big, but the multiplication does not overflow 64bit

  Now wait a bit and do:
  
  T2 = base + to_nsec(delta2)

   now delta2 is big enough, so the  multiplication does overflow 64bit
   now delta2 is big enough to overflow 64bit with the multiplication.

  The result is T2 < T1, i.e. time goes backwards.

All what the unsigned conversion does is to procrastinate the problem by a
factor of 2. So instead of failing after 10 seconds we fail after 20
seconds. And just because you never observed the 20 seconds problem it does
not go away magically.

The proper solution is to figure out WHY we are running into that situation
at all. So far all I have seen are symptom reports and fairy tales about
ftp connections, but no real root cause analysis.

The only reason for this to happen is that 'base' does not get updated for
a too long time, so the delta grows into the overflow range.

We already have protection against idle sleeping too long for this to
happen. If the idle protection is not working then it needs to be fixed.

if some other situation can cause the base not to be updated for a long
time, then this needs to be fixed.

Curing the symptom is a guarantee that the root cause will show another
symptom sooner than later.

Thanks,

tglx


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-30 Thread Thomas Gleixner
On Wed, 30 Nov 2016, David Gibson wrote:
> On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > If we have legitimate use cases with a negative delta, then this patch
> > breaks them no matter what. See the basic C course section in the second
> > link.
> 
> So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> every case - just to make the consequences less bad if something goes
> wrong.  An overflow here can still mess up timekeeping, it's true, but
> time going backwards tends to cause things to go horribly, horribly
> wrong - which was why I spotted this in the first place.

I completely understand the intention.

We _cannot_ make that whole thing unsigned when it is not 100% clear
that there is no legitimate caller which hands in a negative delta and
rightfully expects to get a negative nanoseconds value handed back.

If someone sits down and proves that this cannot happen there is no reason
to hold that off.

But that still does not solve the underlying root cause. Assume the
following:

  T1 = base + to_nsec(delta1)

   where delta1 is big, but the multiplication does not overflow 64bit

  Now wait a bit and do:
  
  T2 = base + to_nsec(delta2)

   now delta2 is big enough, so the  multiplication does overflow 64bit
   now delta2 is big enough to overflow 64bit with the multiplication.

  The result is T2 < T1, i.e. time goes backwards.

All what the unsigned conversion does is to procrastinate the problem by a
factor of 2. So instead of failing after 10 seconds we fail after 20
seconds. And just because you never observed the 20 seconds problem it does
not go away magically.

The proper solution is to figure out WHY we are running into that situation
at all. So far all I have seen are symptom reports and fairy tales about
ftp connections, but no real root cause analysis.

The only reason for this to happen is that 'base' does not get updated for
a too long time, so the delta grows into the overflow range.

We already have protection against idle sleeping too long for this to
happen. If the idle protection is not working then it needs to be fixed.

if some other situation can cause the base not to be updated for a long
time, then this needs to be fixed.

Curing the symptom is a guarantee that the root cause will show another
symptom sooner than later.

Thanks,

tglx


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-29 Thread David Gibson
On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> On Fri, 18 Nov 2016, John Stultz wrote:
> > From: Liav Rehana 
> > 
> > During the calculation of the nsec variable in the inline function
> > timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> > is set just before the shift. The sign extension may, in some cases,
> > gain it a value near the maximum value of the 64-bit range. This is
> > bad when it is later used in a division function, such as
> > __iter_div_u64_rem, where the amount of loops it will go through to
> > calculate the division will be too large. One can encounter such a
> > problem, for example, when trying to connect through ftp from an
> > outside host to the operation system. When the OS is too overloaded,
> > delta will get a high enough value for the msb of the sum
> > delta * tkr->mult + tkr->xtime_nsec to be set, and so after the
> > shift the nsec variable will gain a value similar to
> > 0xff00. Using a variable with such a value in the
> > inline function __iter_div_u64_rem will take too long, making the
> > ftp connection attempt seem to get stuck.
> > The following commit fixes that chance of sign extension, while
> > maintaining the type of the nsec variable as signed for other
> > functions that use this variable, for possible legit negative
> > time intervals.
> >
> > Thomas/Ingo: This is for tip:timers/urgent.
> 
> Certainly not! My objections against this still stand. See:
> 
> http://lkml.kernel.org/r/alpine.DEB.2.20.1609261956160.4915@nanos
> 
> http://lkml.kernel.org/r/alpine.DEB.2.20.1609270929170.4891@nanos
> 
> If we have legitimate use cases with a negative delta, then this patch
> breaks them no matter what. See the basic C course section in the second
> link.

So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
every case - just to make the consequences less bad if something goes
wrong.  An overflow here can still mess up timekeeping, it's true, but
time going backwards tends to cause things to go horribly, horribly
wrong - which was why I spotted this in the first place.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-29 Thread David Gibson
On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> On Fri, 18 Nov 2016, John Stultz wrote:
> > From: Liav Rehana 
> > 
> > During the calculation of the nsec variable in the inline function
> > timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> > is set just before the shift. The sign extension may, in some cases,
> > gain it a value near the maximum value of the 64-bit range. This is
> > bad when it is later used in a division function, such as
> > __iter_div_u64_rem, where the amount of loops it will go through to
> > calculate the division will be too large. One can encounter such a
> > problem, for example, when trying to connect through ftp from an
> > outside host to the operation system. When the OS is too overloaded,
> > delta will get a high enough value for the msb of the sum
> > delta * tkr->mult + tkr->xtime_nsec to be set, and so after the
> > shift the nsec variable will gain a value similar to
> > 0xff00. Using a variable with such a value in the
> > inline function __iter_div_u64_rem will take too long, making the
> > ftp connection attempt seem to get stuck.
> > The following commit fixes that chance of sign extension, while
> > maintaining the type of the nsec variable as signed for other
> > functions that use this variable, for possible legit negative
> > time intervals.
> >
> > Thomas/Ingo: This is for tip:timers/urgent.
> 
> Certainly not! My objections against this still stand. See:
> 
> http://lkml.kernel.org/r/alpine.DEB.2.20.1609261956160.4915@nanos
> 
> http://lkml.kernel.org/r/alpine.DEB.2.20.1609270929170.4891@nanos
> 
> If we have legitimate use cases with a negative delta, then this patch
> breaks them no matter what. See the basic C course section in the second
> link.

So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
every case - just to make the consequences less bad if something goes
wrong.  An overflow here can still mess up timekeeping, it's true, but
time going backwards tends to cause things to go horribly, horribly
wrong - which was why I spotted this in the first place.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-29 Thread Thomas Gleixner
On Fri, 18 Nov 2016, John Stultz wrote:
> From: Liav Rehana 
> 
> During the calculation of the nsec variable in the inline function
> timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> is set just before the shift. The sign extension may, in some cases,
> gain it a value near the maximum value of the 64-bit range. This is
> bad when it is later used in a division function, such as
> __iter_div_u64_rem, where the amount of loops it will go through to
> calculate the division will be too large. One can encounter such a
> problem, for example, when trying to connect through ftp from an
> outside host to the operation system. When the OS is too overloaded,
> delta will get a high enough value for the msb of the sum
> delta * tkr->mult + tkr->xtime_nsec to be set, and so after the
> shift the nsec variable will gain a value similar to
> 0xff00. Using a variable with such a value in the
> inline function __iter_div_u64_rem will take too long, making the
> ftp connection attempt seem to get stuck.
> The following commit fixes that chance of sign extension, while
> maintaining the type of the nsec variable as signed for other
> functions that use this variable, for possible legit negative
> time intervals.
>
> Thomas/Ingo: This is for tip:timers/urgent.

Certainly not! My objections against this still stand. See:

http://lkml.kernel.org/r/alpine.DEB.2.20.1609261956160.4915@nanos

http://lkml.kernel.org/r/alpine.DEB.2.20.1609270929170.4891@nanos

If we have legitimate use cases with a negative delta, then this patch
breaks them no matter what. See the basic C course section in the second
link.

Thanks,

tglx


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-29 Thread Thomas Gleixner
On Fri, 18 Nov 2016, John Stultz wrote:
> From: Liav Rehana 
> 
> During the calculation of the nsec variable in the inline function
> timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> is set just before the shift. The sign extension may, in some cases,
> gain it a value near the maximum value of the 64-bit range. This is
> bad when it is later used in a division function, such as
> __iter_div_u64_rem, where the amount of loops it will go through to
> calculate the division will be too large. One can encounter such a
> problem, for example, when trying to connect through ftp from an
> outside host to the operation system. When the OS is too overloaded,
> delta will get a high enough value for the msb of the sum
> delta * tkr->mult + tkr->xtime_nsec to be set, and so after the
> shift the nsec variable will gain a value similar to
> 0xff00. Using a variable with such a value in the
> inline function __iter_div_u64_rem will take too long, making the
> ftp connection attempt seem to get stuck.
> The following commit fixes that chance of sign extension, while
> maintaining the type of the nsec variable as signed for other
> functions that use this variable, for possible legit negative
> time intervals.
>
> Thomas/Ingo: This is for tip:timers/urgent.

Certainly not! My objections against this still stand. See:

http://lkml.kernel.org/r/alpine.DEB.2.20.1609261956160.4915@nanos

http://lkml.kernel.org/r/alpine.DEB.2.20.1609270929170.4891@nanos

If we have legitimate use cases with a negative delta, then this patch
breaks them no matter what. See the basic C course section in the second
link.

Thanks,

tglx


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-28 Thread John Stultz
On Fri, Nov 18, 2016 at 8:53 PM, John Stultz  wrote:
> From: Liav Rehana 
>
> During the calculation of the nsec variable in the inline function
> timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> is set just before the shift. The sign extension may, in some cases,
> gain it a value near the maximum value of the 64-bit range. This is
> bad when it is later used in a division function, such as
> __iter_div_u64_rem, where the amount of loops it will go through to
> calculate the division will be too large. One can encounter such a
> problem, for example, when trying to connect through ftp from an
> outside host to the operation system. When the OS is too overloaded,
> delta will get a high enough value for the msb of the sum
> delta * tkr->mult + tkr->xtime_nsec to be set, and so after the
> shift the nsec variable will gain a value similar to
> 0xff00. Using a variable with such a value in the
> inline function __iter_div_u64_rem will take too long, making the
> ftp connection attempt seem to get stuck.
> The following commit fixes that chance of sign extension, while
> maintaining the type of the nsec variable as signed for other
> functions that use this variable, for possible legit negative
> time intervals.
>
> Cc: Chris Metcalf 
> Cc: Thomas Gleixner 
> Cc: Richard Cochran 
> Cc: Ingo Molnar 
> Cc: Prarit Bhargava 
> Cc: Laurent Vivier 
> Cc: David Gibson 
> Cc: "Christopher S . Hall" 
> Cc: sta...@vger.kernel.org  (4.6+)
> Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation")
> Also-Reported-by: Chris Metcalf 
> Signed-off-by: Liav Rehana 
> Signed-off-by: John Stultz 
> ---
> Thomas/Ingo: This is for tip:timers/urgent.

Hey Thomas, Ingo,
  I just wanted to follow up to make sure this wasn't missed last
time. Should be applied against tip/timers/urgent.

thanks
-john


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-11-28 Thread John Stultz
On Fri, Nov 18, 2016 at 8:53 PM, John Stultz  wrote:
> From: Liav Rehana 
>
> During the calculation of the nsec variable in the inline function
> timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> is set just before the shift. The sign extension may, in some cases,
> gain it a value near the maximum value of the 64-bit range. This is
> bad when it is later used in a division function, such as
> __iter_div_u64_rem, where the amount of loops it will go through to
> calculate the division will be too large. One can encounter such a
> problem, for example, when trying to connect through ftp from an
> outside host to the operation system. When the OS is too overloaded,
> delta will get a high enough value for the msb of the sum
> delta * tkr->mult + tkr->xtime_nsec to be set, and so after the
> shift the nsec variable will gain a value similar to
> 0xff00. Using a variable with such a value in the
> inline function __iter_div_u64_rem will take too long, making the
> ftp connection attempt seem to get stuck.
> The following commit fixes that chance of sign extension, while
> maintaining the type of the nsec variable as signed for other
> functions that use this variable, for possible legit negative
> time intervals.
>
> Cc: Chris Metcalf 
> Cc: Thomas Gleixner 
> Cc: Richard Cochran 
> Cc: Ingo Molnar 
> Cc: Prarit Bhargava 
> Cc: Laurent Vivier 
> Cc: David Gibson 
> Cc: "Christopher S . Hall" 
> Cc: sta...@vger.kernel.org  (4.6+)
> Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation")
> Also-Reported-by: Chris Metcalf 
> Signed-off-by: Liav Rehana 
> Signed-off-by: John Stultz 
> ---
> Thomas/Ingo: This is for tip:timers/urgent.

Hey Thomas, Ingo,
  I just wanted to follow up to make sure this wasn't missed last
time. Should be applied against tip/timers/urgent.

thanks
-john


RE: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-09-27 Thread Thomas Gleixner
On Tue, 27 Sep 2016, Liav Rehana wrote:
> -Original Message-
.
> 

Can you pretty please get rid of this --Original Message-- crap. There is
absolutely no value in copying the mail header into every reply.

> On Mon, 26 Sep 2016, Liav Rehana wrote:
> >> During the calculation of the nsec variable in the inline function 
> >> timekeeping_delta_to_ns, it may undergo a sign extension if its msb is 
> >> set just before the shift. The sign extension may, in some cases, gain 
> >> it a value near the maximum value of the 64-bit range. This is bad 
> >> when it is later used in a division function, such as 
> >> __iter_div_u64_rem, where the amount of loops it will go through to 
> >> calculate the division will be too large.
> >> The following commit fixes that chance of sign extension,
> >
> > Again. This does not fix anything, it papers over the underlying problem
> > that the calling code hands in a delta which is big enough to overflow
> > the multiplication into the negative space. >You just extend the range of
> > deltas which are handled gracefully, but that does not fix the underlying
> > problem as we still can run into the multiplication overflow. It won't
> > cause the result to be negative, but it will be crap nevertheless.
> >>
> >> while maintaining the type of the nsec variable as signed for other 
> >> functions that use this variable, for possible legit negative time 
> >> intervals.
> 
> > What is this maintaining? The type of this variable changes to u64 and
> > other functions cannot use this variable at all because it's local to
> > that function. This sentence makes no sense at >all.
>
> About your first note,

Please answer inline right below my answer and not at some remote place
which makes the conversation hard to follow.

> I understand that it doesn't fix the overflow chance, but I'm not quite
> sure what can be done about that. If there was a code in the calling
> function that detects such event, what do you think can be done about it
> ? Would you just print a warning and lower delta as to not overflow after
> the multiplication ? If not, how do you think the problem I ran into can
> be fixed, if not by eliminating the possibility for sign extension the
> way I did ?

The problem you ran into is part of the overall problem. You got lucky that
it just expands into the negative space which can be "fixed" by making the
computation unsigned. I think I explained what needs to be done in some of
the previous mails and it's really not rocket science to deal with it.

> About the other note, I understood from you and John that there are some
> cases where negative time intervals are valid. What I meant by
> 'maintaining the type of the nsec variable as signed' was, that for the
> other functions that call the function I've changed, they do define a
> variable named nsec, and they define it as signed. In their
> implementation they assign him a value that is returned from the function
> I've changed. While the nsec variable is unsigned now in the function
> that calculates it, it can still return a value with an MSB that is set,
> which will be handled as negative in the caller function, where it was
> defined as signed. In that case, the change I added just removes the
> possibility of a sign extension, but the nsec variable will still be
> viewed as negative on the caller functions where it was supposed to
> return a negative value in the function I've changed.

This is complete and utter bullshit. How on earth would you end up with a
sensible result when delta is negative?



With nsec being unsigend the result after the shift is guaranteed to have
the N topmost bits to be 0, where N = tkr->shift. And that does not get
magically get the MSB set just because you assign the u64 to a s64 at the
call site.



Thanks,

tglx


RE: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-09-27 Thread Thomas Gleixner
On Tue, 27 Sep 2016, Liav Rehana wrote:
> -Original Message-
.
> 

Can you pretty please get rid of this --Original Message-- crap. There is
absolutely no value in copying the mail header into every reply.

> On Mon, 26 Sep 2016, Liav Rehana wrote:
> >> During the calculation of the nsec variable in the inline function 
> >> timekeeping_delta_to_ns, it may undergo a sign extension if its msb is 
> >> set just before the shift. The sign extension may, in some cases, gain 
> >> it a value near the maximum value of the 64-bit range. This is bad 
> >> when it is later used in a division function, such as 
> >> __iter_div_u64_rem, where the amount of loops it will go through to 
> >> calculate the division will be too large.
> >> The following commit fixes that chance of sign extension,
> >
> > Again. This does not fix anything, it papers over the underlying problem
> > that the calling code hands in a delta which is big enough to overflow
> > the multiplication into the negative space. >You just extend the range of
> > deltas which are handled gracefully, but that does not fix the underlying
> > problem as we still can run into the multiplication overflow. It won't
> > cause the result to be negative, but it will be crap nevertheless.
> >>
> >> while maintaining the type of the nsec variable as signed for other 
> >> functions that use this variable, for possible legit negative time 
> >> intervals.
> 
> > What is this maintaining? The type of this variable changes to u64 and
> > other functions cannot use this variable at all because it's local to
> > that function. This sentence makes no sense at >all.
>
> About your first note,

Please answer inline right below my answer and not at some remote place
which makes the conversation hard to follow.

> I understand that it doesn't fix the overflow chance, but I'm not quite
> sure what can be done about that. If there was a code in the calling
> function that detects such event, what do you think can be done about it
> ? Would you just print a warning and lower delta as to not overflow after
> the multiplication ? If not, how do you think the problem I ran into can
> be fixed, if not by eliminating the possibility for sign extension the
> way I did ?

The problem you ran into is part of the overall problem. You got lucky that
it just expands into the negative space which can be "fixed" by making the
computation unsigned. I think I explained what needs to be done in some of
the previous mails and it's really not rocket science to deal with it.

> About the other note, I understood from you and John that there are some
> cases where negative time intervals are valid. What I meant by
> 'maintaining the type of the nsec variable as signed' was, that for the
> other functions that call the function I've changed, they do define a
> variable named nsec, and they define it as signed. In their
> implementation they assign him a value that is returned from the function
> I've changed. While the nsec variable is unsigned now in the function
> that calculates it, it can still return a value with an MSB that is set,
> which will be handled as negative in the caller function, where it was
> defined as signed. In that case, the change I added just removes the
> possibility of a sign extension, but the nsec variable will still be
> viewed as negative on the caller functions where it was supposed to
> return a negative value in the function I've changed.

This is complete and utter bullshit. How on earth would you end up with a
sensible result when delta is negative?



With nsec being unsigend the result after the shift is guaranteed to have
the N topmost bits to be 0, where N = tkr->shift. And that does not get
magically get the MSB set just because you assign the u64 to a s64 at the
call site.



Thanks,

tglx


RE: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-09-26 Thread Liav Rehana


-Original Message-
From: Thomas Gleixner [mailto:t...@linutronix.de] 
Sent: Tuesday, September 27, 2016 3:01 AM
To: Liav Rehana 
Cc: john.stu...@linaro.org; linux-kernel@vger.kernel.org; Elad Kanfi 
; Noam Camus 
Subject: Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in 
its calculation.

On Mon, 26 Sep 2016, Liav Rehana wrote:
>> During the calculation of the nsec variable in the inline function 
>> timekeeping_delta_to_ns, it may undergo a sign extension if its msb is 
>> set just before the shift. The sign extension may, in some cases, gain 
>> it a value near the maximum value of the 64-bit range. This is bad 
>> when it is later used in a division function, such as 
>> __iter_div_u64_rem, where the amount of loops it will go through to 
>> calculate the division will be too large.
>> The following commit fixes that chance of sign extension,

>Again. This does not fix anything, it papers over the underlying problem that 
>the calling code hands in a delta which is big enough to overflow the 
>multiplication into the negative space. >You just extend the range of deltas 
>which are handled gracefully, but that does not fix the underlying problem as 
>we still can run into the multiplication overflow. It won't cause the >result 
>to be negative, but it will be crap nevertheless.

>> while maintaining the type of the nsec variable as signed for other 
>> functions that use this variable, for possible legit negative time 
>> intervals.

>What is this maintaining? The type of this variable changes to u64 and other 
>functions cannot use this variable at all because it's local to that function. 
>This sentence makes no sense at >all.

About your first note, I understand that it doesn't fix the overflow chance, 
but I'm not quite sure what can be done about that. If there was a code in the 
calling function that detects
such event, what do you think can be done about it ? Would you just print a 
warning and lower delta as to not overflow after the multiplication ? If not, 
how do you think the problem I ran into can be fixed, if not by eliminating the 
possibility for sign extension the way I did ?

About the other note, I understood from you and John that there are some cases 
where negative time intervals are valid. What I meant by 'maintaining the type 
of the nsec variable as signed' was, that for the other functions that call the 
function I've changed, they do define a variable named nsec, and they define it 
as signed. In their implementation they assign him a value that is returned 
from the function I've changed. While the nsec variable is unsigned now in the 
function that calculates it, it can still return a value with an MSB that is 
set, which will be handled as negative in the caller function, where it was 
defined as signed. In that case, the change I added just removes the 
possibility of a sign extension, but the nsec variable will still be viewed as 
negative on the caller functions where it was supposed to return a negative 
value in the function I've changed.


RE: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-09-26 Thread Liav Rehana


-Original Message-
From: Thomas Gleixner [mailto:t...@linutronix.de] 
Sent: Tuesday, September 27, 2016 3:01 AM
To: Liav Rehana 
Cc: john.stu...@linaro.org; linux-kernel@vger.kernel.org; Elad Kanfi 
; Noam Camus 
Subject: Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in 
its calculation.

On Mon, 26 Sep 2016, Liav Rehana wrote:
>> During the calculation of the nsec variable in the inline function 
>> timekeeping_delta_to_ns, it may undergo a sign extension if its msb is 
>> set just before the shift. The sign extension may, in some cases, gain 
>> it a value near the maximum value of the 64-bit range. This is bad 
>> when it is later used in a division function, such as 
>> __iter_div_u64_rem, where the amount of loops it will go through to 
>> calculate the division will be too large.
>> The following commit fixes that chance of sign extension,

>Again. This does not fix anything, it papers over the underlying problem that 
>the calling code hands in a delta which is big enough to overflow the 
>multiplication into the negative space. >You just extend the range of deltas 
>which are handled gracefully, but that does not fix the underlying problem as 
>we still can run into the multiplication overflow. It won't cause the >result 
>to be negative, but it will be crap nevertheless.

>> while maintaining the type of the nsec variable as signed for other 
>> functions that use this variable, for possible legit negative time 
>> intervals.

>What is this maintaining? The type of this variable changes to u64 and other 
>functions cannot use this variable at all because it's local to that function. 
>This sentence makes no sense at >all.

About your first note, I understand that it doesn't fix the overflow chance, 
but I'm not quite sure what can be done about that. If there was a code in the 
calling function that detects
such event, what do you think can be done about it ? Would you just print a 
warning and lower delta as to not overflow after the multiplication ? If not, 
how do you think the problem I ran into can be fixed, if not by eliminating the 
possibility for sign extension the way I did ?

About the other note, I understood from you and John that there are some cases 
where negative time intervals are valid. What I meant by 'maintaining the type 
of the nsec variable as signed' was, that for the other functions that call the 
function I've changed, they do define a variable named nsec, and they define it 
as signed. In their implementation they assign him a value that is returned 
from the function I've changed. While the nsec variable is unsigned now in the 
function that calculates it, it can still return a value with an MSB that is 
set, which will be handled as negative in the caller function, where it was 
defined as signed. In that case, the change I added just removes the 
possibility of a sign extension, but the nsec variable will still be viewed as 
negative on the caller functions where it was supposed to return a negative 
value in the function I've changed.


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-09-26 Thread Thomas Gleixner
On Mon, 26 Sep 2016, Liav Rehana wrote:
> During the calculation of the nsec variable in the inline function
> timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> is set just before the shift. The sign extension may, in some cases,
> gain it a value near the maximum value of the 64-bit range. This is
> bad when it is later used in a division function, such as
> __iter_div_u64_rem, where the amount of loops it will go through to
> calculate the division will be too large.
> The following commit fixes that chance of sign extension,

Again. This does not fix anything, it papers over the underlying problem
that the calling code hands in a delta which is big enough to overflow the
multiplication into the negative space. You just extend the range of deltas
which are handled gracefully, but that does not fix the underlying problem
as we still can run into the multiplication overflow. It won't cause the
result to be negative, but it will be crap nevertheless.

> while maintaining the type of the nsec variable as signed for other
> functions that use this variable, for possible legit negative time
> intervals.

What is this maintaining? The type of this variable changes to u64 and
other functions cannot use this variable at all because it's local to that
function. This sentence makes no sense at all.

Thanks,

tglx


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-09-26 Thread Thomas Gleixner
On Mon, 26 Sep 2016, Liav Rehana wrote:
> During the calculation of the nsec variable in the inline function
> timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> is set just before the shift. The sign extension may, in some cases,
> gain it a value near the maximum value of the 64-bit range. This is
> bad when it is later used in a division function, such as
> __iter_div_u64_rem, where the amount of loops it will go through to
> calculate the division will be too large.
> The following commit fixes that chance of sign extension,

Again. This does not fix anything, it papers over the underlying problem
that the calling code hands in a delta which is big enough to overflow the
multiplication into the negative space. You just extend the range of deltas
which are handled gracefully, but that does not fix the underlying problem
as we still can run into the multiplication overflow. It won't cause the
result to be negative, but it will be crap nevertheless.

> while maintaining the type of the nsec variable as signed for other
> functions that use this variable, for possible legit negative time
> intervals.

What is this maintaining? The type of this variable changes to u64 and
other functions cannot use this variable at all because it's local to that
function. This sentence makes no sense at all.

Thanks,

tglx


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-09-26 Thread John Stultz
On Sun, Sep 25, 2016 at 10:45 PM, Liav Rehana  wrote:
> From: Liav Rehana 
>
> During the calculation of the nsec variable in the inline function
> timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> is set just before the shift. The sign extension may, in some cases,
> gain it a value near the maximum value of the 64-bit range. This is
> bad when it is later used in a division function, such as
> __iter_div_u64_rem, where the amount of loops it will go through to
> calculate the division will be too large.
> The following commit fixes that chance of sign extension, while
> maintaining the type of the nsec variable as signed for other
> functions that use this variable, for possible legit negative
> time intervals.

Apologies for my foggy memory here (just got off a plane).  But remind
me, is this something that you actually ran into, or is it more of
just a theoretical concern?

If it is something that you did trip on, please describe your case in
the changelog so that the proper urgency is applied when weighing if
such a commit should go into -stable.

thanks
-john


Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

2016-09-26 Thread John Stultz
On Sun, Sep 25, 2016 at 10:45 PM, Liav Rehana  wrote:
> From: Liav Rehana 
>
> During the calculation of the nsec variable in the inline function
> timekeeping_delta_to_ns, it may undergo a sign extension if its msb
> is set just before the shift. The sign extension may, in some cases,
> gain it a value near the maximum value of the 64-bit range. This is
> bad when it is later used in a division function, such as
> __iter_div_u64_rem, where the amount of loops it will go through to
> calculate the division will be too large.
> The following commit fixes that chance of sign extension, while
> maintaining the type of the nsec variable as signed for other
> functions that use this variable, for possible legit negative
> time intervals.

Apologies for my foggy memory here (just got off a plane).  But remind
me, is this something that you actually ran into, or is it more of
just a theoretical concern?

If it is something that you did trip on, please describe your case in
the changelog so that the proper urgency is applied when weighing if
such a commit should go into -stable.

thanks
-john