Re: PSA: If you run -current, beware!

2015-02-05 Thread Ryan Stone
On Wed, Feb 4, 2015 at 6:15 PM, Peter Wemm  wrote:
> --- kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
> +++ kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
> @@ -410,6 +415,11 @@
>  #ifdef SW_WATCHDOG
> EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0);
>  #endif
> +   /*
> +* Arrange for ticks to go negative just 5 minutes after boot
> +* to help catch sign problems sooner.
> +*/
> +   ticks = INT_MAX - (hz * 5 * 60);
>  }

Should we just commit this under #ifdef INVARIANTS?
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-05 Thread Alfred Perlstein



On 2/5/15 11:00 AM, Peter Wemm wrote:

On Thursday, February 05, 2015 10:48:54 AM John Baldwin wrote:

On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:

On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:

On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:

...


It is fixed (in the proper meaning of the word, not like worked
around,
covered by paper) by the patch at the end of the mail.

We already have a story trying to enable much less ambitious
option
-fno-strict-overflow, see r259045 and the revert in r259422.  I do
not
see other way than try one more time.  Too many places in kernel
depend on the correctly wrapping 2-complement arithmetic, among
others
are callweel and scheduler.


Rather than depending on a compiler option, wouldn't it be better/more
robust to change ticks to unsigned, which has specified wrapping
behavior?


Yes, but non-trivial.  It's also not limited to ticks.  Since the
compiler
knows when it would apply these optimizations, it would be nice if it
could
warn instead (GCC apparently has a warning, but clang does not).  Having
people do a manual audit of every signed integer expression in the tree
will take a long time.


I think I misunderstood the problem as being limited to ticks,
which is probably only one symptom of a fundamental change in behaviour
of the compiler.
Still, it might be worthwhile start looking at ints that ought to be
implemented as u_int


I actually agree, I just think we are stuck with -fwrapv in the interval,
but it's probably not a short interval.  I think converting ticks to
unsigned would be a good first start.


For the record, I agree.  However, I suspect that attempts to do so will have
a non trivial number of bugs introduced.  We have a track record of recurring
problems with tcp sequence number space arithmetic and tcp timing, partly
because the wraparounds happens infrequently.

In the mean time, I feel that telling the compiler that it's OK to let it
behave the way we expect (vs actively sabotaging it) is a viable stopgap.



Seems like it would make sense to move these functions into files that 
can be easily compiled outside of kernel and then adding unit tests.


I've done this before, to prove that larger pcb hashes help performance 
on large workloads.


-Alfred


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-05 Thread Peter Wemm
On Thursday, February 05, 2015 11:00:46 AM Peter Wemm wrote:
> On Thursday, February 05, 2015 10:48:54 AM John Baldwin wrote:
> > On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:
> > > On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
> > > > On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
> > > ...
> > > 
> > > > > > > It is fixed (in the proper meaning of the word, not like worked
> > > > > > > around,
> > > > > > > covered by paper) by the patch at the end of the mail.
> > > > > > > 
> > > > > > > We already have a story trying to enable much less ambitious
> > > > > > > option
> > > > > > > -fno-strict-overflow, see r259045 and the revert in r259422.  I
> > > > > > > do
> > > > > > > not
> > > > > > > see other way than try one more time.  Too many places in kernel
> > > > > > > depend on the correctly wrapping 2-complement arithmetic, among
> > > > > > > others
> > > > > > > are callweel and scheduler.
> > > > > 
> > > > > Rather than depending on a compiler option, wouldn't it be
> > > > > better/more
> > > > > robust to change ticks to unsigned, which has specified wrapping
> > > > > behavior?
> > > > 
> > > > Yes, but non-trivial.  It's also not limited to ticks.  Since the
> > > > compiler
> > > > knows when it would apply these optimizations, it would be nice if it
> > > > could
> > > > warn instead (GCC apparently has a warning, but clang does not). 
> > > > Having
> > > > people do a manual audit of every signed integer expression in the
> > > > tree
> > > > will take a long time.
> > > 
> > > I think I misunderstood the problem as being limited to ticks,
> > > which is probably only one symptom of a fundamental change in behaviour
> > > of the compiler.
> > > Still, it might be worthwhile start looking at ints that ought to be
> > > implemented as u_int
> > 
> > I actually agree, I just think we are stuck with -fwrapv in the interval,
> > but it's probably not a short interval.  I think converting ticks to
> > unsigned would be a good first start.
> 
> For the record, I agree.  However, I suspect that attempts to do so will
> have a non trivial number of bugs introduced.  We have a track record of
> recurring problems with tcp sequence number space arithmetic and tcp
> timing, partly because the wraparounds happens infrequently.

BTW; anybody working on this will want to run with  kern.hz="10"  in 
loader.conf (or higher).  Having the clock tick 100 times faster speeds the 
rollover up from every ~25 days to every ~6 hours.  I don't know what the 
practical limit is but at some point it will cause sufficient pain due to 
contention that it won't be useful.

-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246

signature.asc
Description: This is a digitally signed message part.


Re: PSA: If you run -current, beware!

2015-02-05 Thread Peter Wemm
On Thursday, February 05, 2015 10:48:54 AM John Baldwin wrote:
> On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:
> > On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
> > > On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
> > ...
> > 
> > > > > > It is fixed (in the proper meaning of the word, not like worked
> > > > > > around,
> > > > > > covered by paper) by the patch at the end of the mail.
> > > > > > 
> > > > > > We already have a story trying to enable much less ambitious
> > > > > > option
> > > > > > -fno-strict-overflow, see r259045 and the revert in r259422.  I do
> > > > > > not
> > > > > > see other way than try one more time.  Too many places in kernel
> > > > > > depend on the correctly wrapping 2-complement arithmetic, among
> > > > > > others
> > > > > > are callweel and scheduler.
> > > > 
> > > > Rather than depending on a compiler option, wouldn't it be better/more
> > > > robust to change ticks to unsigned, which has specified wrapping
> > > > behavior?
> > > 
> > > Yes, but non-trivial.  It's also not limited to ticks.  Since the
> > > compiler
> > > knows when it would apply these optimizations, it would be nice if it
> > > could
> > > warn instead (GCC apparently has a warning, but clang does not).  Having
> > > people do a manual audit of every signed integer expression in the tree
> > > will take a long time.
> > 
> > I think I misunderstood the problem as being limited to ticks,
> > which is probably only one symptom of a fundamental change in behaviour
> > of the compiler.
> > Still, it might be worthwhile start looking at ints that ought to be
> > implemented as u_int
> 
> I actually agree, I just think we are stuck with -fwrapv in the interval,
> but it's probably not a short interval.  I think converting ticks to
> unsigned would be a good first start.

For the record, I agree.  However, I suspect that attempts to do so will have 
a non trivial number of bugs introduced.  We have a track record of recurring 
problems with tcp sequence number space arithmetic and tcp timing, partly 
because the wraparounds happens infrequently.

In the mean time, I feel that telling the compiler that it's OK to let it 
behave the way we expect (vs actively sabotaging it) is a viable stopgap.

-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246

signature.asc
Description: This is a digitally signed message part.


Re: PSA: If you run -current, beware!

2015-02-05 Thread Brooks Davis
On Thu, Feb 05, 2015 at 10:48:54AM -0500, John Baldwin wrote:
> On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:
> > On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
> > > On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
> > ...
> > 
> > > > > > It is fixed (in the proper meaning of the word, not like worked
> > > > > > around,
> > > > > > covered by paper) by the patch at the end of the mail.
> > > > > > 
> > > > > > We already have a story trying to enable much less ambitious option
> > > > > > -fno-strict-overflow, see r259045 and the revert in r259422.  I do
> > > > > > not
> > > > > > see other way than try one more time.  Too many places in kernel
> > > > > > depend on the correctly wrapping 2-complement arithmetic, among
> > > > > > others
> > > > > > are callweel and scheduler.
> > > > 
> > > > Rather than depending on a compiler option, wouldn't it be better/more
> > > > robust to change ticks to unsigned, which has specified wrapping
> > > > behavior?
> > > 
> > > Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler
> > > knows when it would apply these optimizations, it would be nice if it
> > > could
> > > warn instead (GCC apparently has a warning, but clang does not).  Having
> > > people do a manual audit of every signed integer expression in the tree
> > > will take a long time.
> > 
> > I think I misunderstood the problem as being limited to ticks,
> > which is probably only one symptom of a fundamental change in behaviour
> > of the compiler.
> > Still, it might be worthwhile start looking at ints that ought to be
> > implemented as u_int
> 
> I actually agree, I just think we are stuck with -fwrapv in the interval, but 
> it's probably not a short interval.  I think converting ticks to unsigned 
> would be a good first start.

In principle MIT's KINT tool should help here.  Unfortunatly, it's based
on LLVM 3.1 and appears to be unmaintained.

-- Brooks


pgp5kXYYo2QhR.pgp
Description: PGP signature


Re: PSA: If you run -current, beware!

2015-02-05 Thread John Baldwin
On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:
> On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
> > On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
> ...
> 
> > > > > It is fixed (in the proper meaning of the word, not like worked
> > > > > around,
> > > > > covered by paper) by the patch at the end of the mail.
> > > > > 
> > > > > We already have a story trying to enable much less ambitious option
> > > > > -fno-strict-overflow, see r259045 and the revert in r259422.  I do
> > > > > not
> > > > > see other way than try one more time.  Too many places in kernel
> > > > > depend on the correctly wrapping 2-complement arithmetic, among
> > > > > others
> > > > > are callweel and scheduler.
> > > 
> > > Rather than depending on a compiler option, wouldn't it be better/more
> > > robust to change ticks to unsigned, which has specified wrapping
> > > behavior?
> > 
> > Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler
> > knows when it would apply these optimizations, it would be nice if it
> > could
> > warn instead (GCC apparently has a warning, but clang does not).  Having
> > people do a manual audit of every signed integer expression in the tree
> > will take a long time.
> 
> I think I misunderstood the problem as being limited to ticks,
> which is probably only one symptom of a fundamental change in behaviour
> of the compiler.
> Still, it might be worthwhile start looking at ints that ought to be
> implemented as u_int

I actually agree, I just think we are stuck with -fwrapv in the interval, but 
it's probably not a short interval.  I think converting ticks to unsigned 
would be a good first start.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-05 Thread Luigi Rizzo
On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
> On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
...
> > > > It is fixed (in the proper meaning of the word, not like worked around,
> > > > covered by paper) by the patch at the end of the mail.
> > > > 
> > > > We already have a story trying to enable much less ambitious option
> > > > -fno-strict-overflow, see r259045 and the revert in r259422.  I do not
> > > > see other way than try one more time.  Too many places in kernel
> > > > depend on the correctly wrapping 2-complement arithmetic, among others
> > > > are callweel and scheduler.
> > 
> > Rather than depending on a compiler option, wouldn't it be better/more
> > robust to change ticks to unsigned, which has specified wrapping behavior?
> 
> Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler 
> knows when it would apply these optimizations, it would be nice if it could 
> warn instead (GCC apparently has a warning, but clang does not).  Having 
> people do a manual audit of every signed integer expression in the tree will 
> take a long time.


I think I misunderstood the problem as being limited to ticks,
which is probably only one symptom of a fundamental change in behaviour
of the compiler.
Still, it might be worthwhile start looking at ints that ought to be
implemented as u_int

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-05 Thread Ed Maste
On 5 February 2015 at 02:48, Luigi Rizzo  wrote:
>
> Rather than depending on a compiler option, wouldn't it be better/more
> robust to change ticks to unsigned, which has specified wrapping behavior?

I believe there are cases other than ticks that rely on 2s complement
signed wrap. We'd want to make sure we find such cases.  Newer GCC can
help with that.  The -Wstrict-overflow flag causes the compiler to
warn when implementing an optimization based on undefined behaviour
from signed overflow.

Correct C code should work with or without -fwrapv, so we can do both:
enable -fwrapv, and make changes to stop relying on undefined
behaviour.  For ticks specifically we have many examples over time of
incorrect calculations so we'll benefit from some work here,
independent of signed overflow.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-05 Thread John Baldwin
On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
> On Thursday, February 5, 2015, Peter Wemm  wrote:
> > On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote:
> > > On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
> > > > Sometime in the Dec 10th through Jan 7th timeframe a timing bug has
> > 
> > been
> > 
> > > > introduced to 11.x/head/-current.With HZ=1000 (the default for
> > > > bare
> > > > metal, not for a vm); the clocks stop just after 24 days of uptime.
> > 
> > This
> > 
> > > > means things like cron, sleep, timeouts etc stop working.  TCP/IP
> > > > won't
> > > > time out or retransmit, etc etc.  It can get ugly.
> > > > 
> > > > The problem is NOT in 10.x/-stable.
> > > > 
> > > > We hit this in the freebsd.org cluster, the builds that we used are:
> > > > FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
> > > > FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
> > > > 
> > > > If you are running -current in a situation where it'll accumulate
> > 
> > uptime,
> > 
> > > > you may want to take precautions.  A reboot prior to 24 days uptime
> > > > (as
> > > > horrible a workaround as that is) will avoid it.
> > > > 
> > > > Yes, this is being worked on.
> > > 
> > > So the issue is reproducable in 3 minutes after boot with the following
> > > change in kern_clock.c:
> > > volatile int  ticks = INT_MAX - (/*hz*/1000 * 3 * 60);
> > > 
> > > It is fixed (in the proper meaning of the word, not like worked around,
> > > covered by paper) by the patch at the end of the mail.
> > > 
> > > We already have a story trying to enable much less ambitious option
> > > -fno-strict-overflow, see r259045 and the revert in r259422.  I do not
> > > see other way than try one more time.  Too many places in kernel
> > > depend on the correctly wrapping 2-complement arithmetic, among others
> > > are callweel and scheduler.
> 
> Rather than depending on a compiler option, wouldn't it be better/more
> robust to change ticks to unsigned, which has specified wrapping behavior?

Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler 
knows when it would apply these optimizations, it would be nice if it could 
warn instead (GCC apparently has a warning, but clang does not).  Having 
people do a manual audit of every signed integer expression in the tree will 
take a long time.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-05 Thread David Chisnall
On 5 Feb 2015, at 07:48, Luigi Rizzo  wrote:
> 
> Rather than depending on a compiler option, wouldn't it be better/more
> robust to change ticks to unsigned, which has specified wrapping behavior?

Especially if we want to extend support for external toolchains.  gcc and clang 
support -fwrapv (though occasionally versions of both will not fully support 
it), but other compilers may well not have an equivalent.

Translating the code into C is a far more robust solution than the band-aid of 
telling the compiler to accept a language that is a bit like C and hoping that 
this will keep working across compiler implementations and versions.

Adding -fwrapv also defeats a number of compiler optimisations, so we are going 
to generate worse code for places where people used signed types correctly to 
work around places where they were used incorrectly.

David

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-04 Thread Luigi Rizzo
On Thursday, February 5, 2015, Peter Wemm  wrote:

> On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote:
> > On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
> > > Sometime in the Dec 10th through Jan 7th timeframe a timing bug has
> been
> > > introduced to 11.x/head/-current.With HZ=1000 (the default for bare
> > > metal, not for a vm); the clocks stop just after 24 days of uptime.
> This
> > > means things like cron, sleep, timeouts etc stop working.  TCP/IP won't
> > > time out or retransmit, etc etc.  It can get ugly.
> > >
> > > The problem is NOT in 10.x/-stable.
> > >
> > > We hit this in the freebsd.org cluster, the builds that we used are:
> > > FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
> > > FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
> > >
> > > If you are running -current in a situation where it'll accumulate
> uptime,
> > > you may want to take precautions.  A reboot prior to 24 days uptime (as
> > > horrible a workaround as that is) will avoid it.
> > >
> > > Yes, this is being worked on.
> >
> > So the issue is reproducable in 3 minutes after boot with the following
> > change in kern_clock.c:
> > volatile int  ticks = INT_MAX - (/*hz*/1000 * 3 * 60);
> >
> > It is fixed (in the proper meaning of the word, not like worked around,
> > covered by paper) by the patch at the end of the mail.
> >
> > We already have a story trying to enable much less ambitious option
> > -fno-strict-overflow, see r259045 and the revert in r259422.  I do not
> > see other way than try one more time.  Too many places in kernel
> > depend on the correctly wrapping 2-complement arithmetic, among others
> > are callweel and scheduler.
>
>
Rather than depending on a compiler option, wouldn't it be better/more
robust to change ticks to unsigned, which has specified wrapping behavior?

Cheers
Luigi

Ugh.
>
> I believe I have a smoking gun that suggests that the clock-stop problem is
> caused by the clang-3.5 import on Dec 31st.
>
> Backstory:
> http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
> http://www.airs.com/blog/archives/120
>
> I suspect that what has happened is that clang's optimizer got better at
> seeing the direct or indirect effects of integer overflow and clang (and
> gcc)
> take advantage of that.
>
> I have used a slightly different change for about 10 years:
>
> --- kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
> +++ kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
> @@ -410,6 +415,11 @@
>  #ifdef SW_WATCHDOG
> EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0);
>  #endif
> +   /*
> +* Arrange for ticks to go negative just 5 minutes after boot
> +* to help catch sign problems sooner.
> +*/
> +   ticks = INT_MAX - (hz * 5 * 60);
>  }
>
>  /*
>
> This came about from when we had problems with integer overflow arithmetic
> in
> the tcp stack.
>
> In any case, I'm in the process of adding -fwrapv and the early wraparound
> to
> the freebsd.org cluster builds to give it some wider exercise.
>
> --
> Peter Wemm - pe...@wemm.org ; pe...@freebsd.org;
> pe...@yahoo-inc.com ; KI6FJV
> UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-04 Thread Peter Wemm
On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote:
> On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
> > Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been
> > introduced to 11.x/head/-current.With HZ=1000 (the default for bare
> > metal, not for a vm); the clocks stop just after 24 days of uptime.  This
> > means things like cron, sleep, timeouts etc stop working.  TCP/IP won't
> > time out or retransmit, etc etc.  It can get ugly.
> > 
> > The problem is NOT in 10.x/-stable.
> > 
> > We hit this in the freebsd.org cluster, the builds that we used are:
> > FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
> > FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
> > 
> > If you are running -current in a situation where it'll accumulate uptime,
> > you may want to take precautions.  A reboot prior to 24 days uptime (as
> > horrible a workaround as that is) will avoid it.
> > 
> > Yes, this is being worked on.
> 
> So the issue is reproducable in 3 minutes after boot with the following
> change in kern_clock.c:
> volatile int  ticks = INT_MAX - (/*hz*/1000 * 3 * 60);
> 
> It is fixed (in the proper meaning of the word, not like worked around,
> covered by paper) by the patch at the end of the mail.
> 
> We already have a story trying to enable much less ambitious option
> -fno-strict-overflow, see r259045 and the revert in r259422.  I do not
> see other way than try one more time.  Too many places in kernel
> depend on the correctly wrapping 2-complement arithmetic, among others
> are callweel and scheduler.

Ugh.

I believe I have a smoking gun that suggests that the clock-stop problem is 
caused by the clang-3.5 import on Dec 31st.

Backstory:
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
http://www.airs.com/blog/archives/120

I suspect that what has happened is that clang's optimizer got better at 
seeing the direct or indirect effects of integer overflow and clang (and gcc) 
take advantage of that.

I have used a slightly different change for about 10 years:

--- kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
+++ kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
@@ -410,6 +415,11 @@
 #ifdef SW_WATCHDOG
EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0);
 #endif
+   /*
+* Arrange for ticks to go negative just 5 minutes after boot
+* to help catch sign problems sooner.
+*/
+   ticks = INT_MAX - (hz * 5 * 60);
 }
 
 /*

This came about from when we had problems with integer overflow arithmetic in 
the tcp stack.

In any case, I'm in the process of adding -fwrapv and the early wraparound to 
the freebsd.org cluster builds to give it some wider exercise.

-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246

signature.asc
Description: This is a digitally signed message part.


Re: PSA: If you run -current, beware!

2015-02-04 Thread Ed Maste
On 4 February 2015 at 09:29, Konstantin Belousov  wrote:
>
> So the issue is reproducable in 3 minutes after boot with the following
> change in kern_clock.c:
> volatile intticks = INT_MAX - (/*hz*/1000 * 3 * 60);
>
> It is fixed (in the proper meaning of the word, not like worked around,
> covered by paper) by the patch at the end of the mail.
>
> We already have a story trying to enable much less ambitious option
> -fno-strict-overflow, see r259045 and the revert in r259422.

Note that -fno-strict-overflow and -fwrapv are equivalent as far as
Clang is concerned:

|  // -fno-strict-overflow implies -fwrapv if it isn't disabled, but
|  // -fstrict-overflow won't turn off an explicitly enabled -fwrapv.
|  if (Arg *A = Args.getLastArg(options::OPT_fwrapv,
|   options::OPT_fno_wrapv)) {
|if (A->getOption().matches(options::OPT_fwrapv))
|  CmdArgs.push_back("-fwrapv");
|  } else if (Arg *A = Args.getLastArg(options::OPT_fstrict_overflow,
|  options::OPT_fno_strict_overflow)) {
|if (A->getOption().matches(options::OPT_fno_strict_overflow))
|  CmdArgs.push_back("-fwrapv");
|  }

> I do not see other way than try one more time.

Agreed.

As you noted elsewhere the original issue that triggered the revert
was fixed by r259609, so we should be able to just re-apply r259045.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-04 Thread Konstantin Belousov
On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
> Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been 
> introduced to 11.x/head/-current.With HZ=1000 (the default for bare 
> metal, 
> not for a vm); the clocks stop just after 24 days of uptime.  This means 
> things like cron, sleep, timeouts etc stop working.  TCP/IP won't time out or 
> retransmit, etc etc.  It can get ugly.
> 
> The problem is NOT in 10.x/-stable.
> 
> We hit this in the freebsd.org cluster, the builds that we used are:
> FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
> FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
> 
> If you are running -current in a situation where it'll accumulate uptime, you 
> may want to take precautions.  A reboot prior to 24 days uptime (as horrible 
> a 
> workaround as that is) will avoid it.
> 
> Yes, this is being worked on.

So the issue is reproducable in 3 minutes after boot with the following
change in kern_clock.c:
volatile intticks = INT_MAX - (/*hz*/1000 * 3 * 60);

It is fixed (in the proper meaning of the word, not like worked around,
covered by paper) by the patch at the end of the mail.

We already have a story trying to enable much less ambitious option
-fno-strict-overflow, see r259045 and the revert in r259422.  I do not
see other way than try one more time.  Too many places in kernel
depend on the correctly wrapping 2-complement arithmetic, among others
are callweel and scheduler.

diff --git a/sys/conf/kern.mk b/sys/conf/kern.mk
index c031b3a..eb7ce2f 100644
--- a/sys/conf/kern.mk
+++ b/sys/conf/kern.mk
@@ -158,6 +158,11 @@ INLINE_LIMIT?= 8000
 CFLAGS+=   -ffreestanding
 
 #
+# Make signed arithmetic wrap.
+#
+CFLAGS+=   -fwrapv
+
+#
 # GCC SSP support
 #
 .if ${MK_SSP} != "no" && \
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-03 Thread Ian Lepore
On Tue, 2015-02-03 at 13:33 -0800, Peter Wemm wrote:
> Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been 
> introduced to 11.x/head/-current.With HZ=1000 (the default for bare 
> metal, 
> not for a vm); the clocks stop just after 24 days of uptime.  This means 
> things like cron, sleep, timeouts etc stop working.  TCP/IP won't time out or 
> retransmit, etc etc.  It can get ugly.
> 
> The problem is NOT in 10.x/-stable.
> 
> We hit this in the freebsd.org cluster, the builds that we used are:
> FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
> FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
> 
> If you are running -current in a situation where it'll accumulate uptime, you 
> may want to take precautions.  A reboot prior to 24 days uptime (as horrible 
> a 
> workaround as that is) will avoid it.
> 
> Yes, this is being worked on.

FWIW, 24.8 days is the point at which an int32_t variable counting ticks
at 1khz rolls over from positive to negative numbers.

-- Ian


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: PSA: If you run -current, beware!

2015-02-03 Thread Luigi Rizzo
On Tuesday, February 3, 2015, Peter Wemm  wrote:

> Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been
> introduced to 11.x/head/-current.With HZ=1000 (the default for bare
> metal,
> not for a vm); the clocks stop just after 24 days of uptime.

  This means
>


Signed 32 bit overflow it seems from the numbers ? Wasn't that a windows
feature in the old days ? :)

Cheers
Luigi



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


PSA: If you run -current, beware!

2015-02-03 Thread Peter Wemm
Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been 
introduced to 11.x/head/-current.With HZ=1000 (the default for bare metal, 
not for a vm); the clocks stop just after 24 days of uptime.  This means 
things like cron, sleep, timeouts etc stop working.  TCP/IP won't time out or 
retransmit, etc etc.  It can get ugly.

The problem is NOT in 10.x/-stable.

We hit this in the freebsd.org cluster, the builds that we used are:
FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken

If you are running -current in a situation where it'll accumulate uptime, you 
may want to take precautions.  A reboot prior to 24 days uptime (as horrible a 
workaround as that is) will avoid it.

Yes, this is being worked on.
-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246

signature.asc
Description: This is a digitally signed message part.