Re: PSA: If you run -current, beware!

2015-02-05 Thread John Baldwin
On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
 On Thursday, February 5, 2015, Peter Wemm pe...@wemm.org wrote:
  On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote:
   On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
Sometime in the Dec 10th through Jan 7th timeframe a timing bug has
  
  been
  
introduced to 11.x/head/-current.With HZ=1000 (the default for
bare
metal, not for a vm); the clocks stop just after 24 days of uptime.
  
  This
  
means things like cron, sleep, timeouts etc stop working.  TCP/IP
won't
time out or retransmit, etc etc.  It can get ugly.

The problem is NOT in 10.x/-stable.

We hit this in the freebsd.org cluster, the builds that we used are:
FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken

If you are running -current in a situation where it'll accumulate
  
  uptime,
  
you may want to take precautions.  A reboot prior to 24 days uptime
(as
horrible a workaround as that is) will avoid it.

Yes, this is being worked on.
   
   So the issue is reproducable in 3 minutes after boot with the following
   change in kern_clock.c:
   volatile int  ticks = INT_MAX - (/*hz*/1000 * 3 * 60);
   
   It is fixed (in the proper meaning of the word, not like worked around,
   covered by paper) by the patch at the end of the mail.
   
   We already have a story trying to enable much less ambitious option
   -fno-strict-overflow, see r259045 and the revert in r259422.  I do not
   see other way than try one more time.  Too many places in kernel
   depend on the correctly wrapping 2-complement arithmetic, among others
   are callweel and scheduler.
 
 Rather than depending on a compiler option, wouldn't it be better/more
 robust to change ticks to unsigned, which has specified wrapping behavior?

Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler 
knows when it would apply these optimizations, it would be nice if it could 
warn instead (GCC apparently has a warning, but clang does not).  Having 
people do a manual audit of every signed integer expression in the tree will 
take a long time.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-05 Thread Ed Maste
On 5 February 2015 at 02:48, Luigi Rizzo ri...@iet.unipi.it wrote:

 Rather than depending on a compiler option, wouldn't it be better/more
 robust to change ticks to unsigned, which has specified wrapping behavior?

I believe there are cases other than ticks that rely on 2s complement
signed wrap. We'd want to make sure we find such cases.  Newer GCC can
help with that.  The -Wstrict-overflow flag causes the compiler to
warn when implementing an optimization based on undefined behaviour
from signed overflow.

Correct C code should work with or without -fwrapv, so we can do both:
enable -fwrapv, and make changes to stop relying on undefined
behaviour.  For ticks specifically we have many examples over time of
incorrect calculations so we'll benefit from some work here,
independent of signed overflow.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-05 Thread David Chisnall
On 5 Feb 2015, at 07:48, Luigi Rizzo ri...@iet.unipi.it wrote:
 
 Rather than depending on a compiler option, wouldn't it be better/more
 robust to change ticks to unsigned, which has specified wrapping behavior?

Especially if we want to extend support for external toolchains.  gcc and clang 
support -fwrapv (though occasionally versions of both will not fully support 
it), but other compilers may well not have an equivalent.

Translating the code into C is a far more robust solution than the band-aid of 
telling the compiler to accept a language that is a bit like C and hoping that 
this will keep working across compiler implementations and versions.

Adding -fwrapv also defeats a number of compiler optimisations, so we are going 
to generate worse code for places where people used signed types correctly to 
work around places where they were used incorrectly.

David

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-05 Thread Alfred Perlstein



On 2/5/15 11:00 AM, Peter Wemm wrote:

On Thursday, February 05, 2015 10:48:54 AM John Baldwin wrote:

On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:

On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:

On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:

...


It is fixed (in the proper meaning of the word, not like worked
around,
covered by paper) by the patch at the end of the mail.

We already have a story trying to enable much less ambitious
option
-fno-strict-overflow, see r259045 and the revert in r259422.  I do
not
see other way than try one more time.  Too many places in kernel
depend on the correctly wrapping 2-complement arithmetic, among
others
are callweel and scheduler.


Rather than depending on a compiler option, wouldn't it be better/more
robust to change ticks to unsigned, which has specified wrapping
behavior?


Yes, but non-trivial.  It's also not limited to ticks.  Since the
compiler
knows when it would apply these optimizations, it would be nice if it
could
warn instead (GCC apparently has a warning, but clang does not).  Having
people do a manual audit of every signed integer expression in the tree
will take a long time.


I think I misunderstood the problem as being limited to ticks,
which is probably only one symptom of a fundamental change in behaviour
of the compiler.
Still, it might be worthwhile start looking at ints that ought to be
implemented as u_int


I actually agree, I just think we are stuck with -fwrapv in the interval,
but it's probably not a short interval.  I think converting ticks to
unsigned would be a good first start.


For the record, I agree.  However, I suspect that attempts to do so will have
a non trivial number of bugs introduced.  We have a track record of recurring
problems with tcp sequence number space arithmetic and tcp timing, partly
because the wraparounds happens infrequently.

In the mean time, I feel that telling the compiler that it's OK to let it
behave the way we expect (vs actively sabotaging it) is a viable stopgap.



Seems like it would make sense to move these functions into files that 
can be easily compiled outside of kernel and then adding unit tests.


I've done this before, to prove that larger pcb hashes help performance 
on large workloads.


-Alfred


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-05 Thread Peter Wemm
On Thursday, February 05, 2015 11:00:46 AM Peter Wemm wrote:
 On Thursday, February 05, 2015 10:48:54 AM John Baldwin wrote:
  On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:
   On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
   ...
   
   It is fixed (in the proper meaning of the word, not like worked
   around,
   covered by paper) by the patch at the end of the mail.
   
   We already have a story trying to enable much less ambitious
   option
   -fno-strict-overflow, see r259045 and the revert in r259422.  I
   do
   not
   see other way than try one more time.  Too many places in kernel
   depend on the correctly wrapping 2-complement arithmetic, among
   others
   are callweel and scheduler.
 
 Rather than depending on a compiler option, wouldn't it be
 better/more
 robust to change ticks to unsigned, which has specified wrapping
 behavior?

Yes, but non-trivial.  It's also not limited to ticks.  Since the
compiler
knows when it would apply these optimizations, it would be nice if it
could
warn instead (GCC apparently has a warning, but clang does not). 
Having
people do a manual audit of every signed integer expression in the
tree
will take a long time.
   
   I think I misunderstood the problem as being limited to ticks,
   which is probably only one symptom of a fundamental change in behaviour
   of the compiler.
   Still, it might be worthwhile start looking at ints that ought to be
   implemented as u_int
  
  I actually agree, I just think we are stuck with -fwrapv in the interval,
  but it's probably not a short interval.  I think converting ticks to
  unsigned would be a good first start.
 
 For the record, I agree.  However, I suspect that attempts to do so will
 have a non trivial number of bugs introduced.  We have a track record of
 recurring problems with tcp sequence number space arithmetic and tcp
 timing, partly because the wraparounds happens infrequently.

BTW; anybody working on this will want to run with  kern.hz=10  in 
loader.conf (or higher).  Having the clock tick 100 times faster speeds the 
rollover up from every ~25 days to every ~6 hours.  I don't know what the 
practical limit is but at some point it will cause sufficient pain due to 
contention that it won't be useful.

-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246

signature.asc
Description: This is a digitally signed message part.


Re: PSA: If you run -current, beware!

2015-02-05 Thread Luigi Rizzo
On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
 On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
...
It is fixed (in the proper meaning of the word, not like worked around,
covered by paper) by the patch at the end of the mail.

We already have a story trying to enable much less ambitious option
-fno-strict-overflow, see r259045 and the revert in r259422.  I do not
see other way than try one more time.  Too many places in kernel
depend on the correctly wrapping 2-complement arithmetic, among others
are callweel and scheduler.
  
  Rather than depending on a compiler option, wouldn't it be better/more
  robust to change ticks to unsigned, which has specified wrapping behavior?
 
 Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler 
 knows when it would apply these optimizations, it would be nice if it could 
 warn instead (GCC apparently has a warning, but clang does not).  Having 
 people do a manual audit of every signed integer expression in the tree will 
 take a long time.


I think I misunderstood the problem as being limited to ticks,
which is probably only one symptom of a fundamental change in behaviour
of the compiler.
Still, it might be worthwhile start looking at ints that ought to be
implemented as u_int

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-05 Thread Ryan Stone
On Wed, Feb 4, 2015 at 6:15 PM, Peter Wemm pe...@wemm.org wrote:
 --- kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
 +++ kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
 @@ -410,6 +415,11 @@
  #ifdef SW_WATCHDOG
 EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0);
  #endif
 +   /*
 +* Arrange for ticks to go negative just 5 minutes after boot
 +* to help catch sign problems sooner.
 +*/
 +   ticks = INT_MAX - (hz * 5 * 60);
  }

Should we just commit this under #ifdef INVARIANTS?
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-05 Thread Brooks Davis
On Thu, Feb 05, 2015 at 10:48:54AM -0500, John Baldwin wrote:
 On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:
  On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
   On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
  ...
  
  It is fixed (in the proper meaning of the word, not like worked
  around,
  covered by paper) by the patch at the end of the mail.
  
  We already have a story trying to enable much less ambitious option
  -fno-strict-overflow, see r259045 and the revert in r259422.  I do
  not
  see other way than try one more time.  Too many places in kernel
  depend on the correctly wrapping 2-complement arithmetic, among
  others
  are callweel and scheduler.

Rather than depending on a compiler option, wouldn't it be better/more
robust to change ticks to unsigned, which has specified wrapping
behavior?
   
   Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler
   knows when it would apply these optimizations, it would be nice if it
   could
   warn instead (GCC apparently has a warning, but clang does not).  Having
   people do a manual audit of every signed integer expression in the tree
   will take a long time.
  
  I think I misunderstood the problem as being limited to ticks,
  which is probably only one symptom of a fundamental change in behaviour
  of the compiler.
  Still, it might be worthwhile start looking at ints that ought to be
  implemented as u_int
 
 I actually agree, I just think we are stuck with -fwrapv in the interval, but 
 it's probably not a short interval.  I think converting ticks to unsigned 
 would be a good first start.

In principle MIT's KINT tool should help here.  Unfortunatly, it's based
on LLVM 3.1 and appears to be unmaintained.

-- Brooks


pgp5kXYYo2QhR.pgp
Description: PGP signature


Re: PSA: If you run -current, beware!

2015-02-05 Thread John Baldwin
On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:
 On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
  On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
 ...
 
 It is fixed (in the proper meaning of the word, not like worked
 around,
 covered by paper) by the patch at the end of the mail.
 
 We already have a story trying to enable much less ambitious option
 -fno-strict-overflow, see r259045 and the revert in r259422.  I do
 not
 see other way than try one more time.  Too many places in kernel
 depend on the correctly wrapping 2-complement arithmetic, among
 others
 are callweel and scheduler.
   
   Rather than depending on a compiler option, wouldn't it be better/more
   robust to change ticks to unsigned, which has specified wrapping
   behavior?
  
  Yes, but non-trivial.  It's also not limited to ticks.  Since the compiler
  knows when it would apply these optimizations, it would be nice if it
  could
  warn instead (GCC apparently has a warning, but clang does not).  Having
  people do a manual audit of every signed integer expression in the tree
  will take a long time.
 
 I think I misunderstood the problem as being limited to ticks,
 which is probably only one symptom of a fundamental change in behaviour
 of the compiler.
 Still, it might be worthwhile start looking at ints that ought to be
 implemented as u_int

I actually agree, I just think we are stuck with -fwrapv in the interval, but 
it's probably not a short interval.  I think converting ticks to unsigned 
would be a good first start.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-05 Thread Peter Wemm
On Thursday, February 05, 2015 10:48:54 AM John Baldwin wrote:
 On Thursday, February 05, 2015 04:22:23 PM Luigi Rizzo wrote:
  On Thu, Feb 05, 2015 at 08:21:45AM -0500, John Baldwin wrote:
   On Thursday, February 05, 2015 08:48:33 AM Luigi Rizzo wrote:
  ...
  
  It is fixed (in the proper meaning of the word, not like worked
  around,
  covered by paper) by the patch at the end of the mail.
  
  We already have a story trying to enable much less ambitious
  option
  -fno-strict-overflow, see r259045 and the revert in r259422.  I do
  not
  see other way than try one more time.  Too many places in kernel
  depend on the correctly wrapping 2-complement arithmetic, among
  others
  are callweel and scheduler.

Rather than depending on a compiler option, wouldn't it be better/more
robust to change ticks to unsigned, which has specified wrapping
behavior?
   
   Yes, but non-trivial.  It's also not limited to ticks.  Since the
   compiler
   knows when it would apply these optimizations, it would be nice if it
   could
   warn instead (GCC apparently has a warning, but clang does not).  Having
   people do a manual audit of every signed integer expression in the tree
   will take a long time.
  
  I think I misunderstood the problem as being limited to ticks,
  which is probably only one symptom of a fundamental change in behaviour
  of the compiler.
  Still, it might be worthwhile start looking at ints that ought to be
  implemented as u_int
 
 I actually agree, I just think we are stuck with -fwrapv in the interval,
 but it's probably not a short interval.  I think converting ticks to
 unsigned would be a good first start.

For the record, I agree.  However, I suspect that attempts to do so will have 
a non trivial number of bugs introduced.  We have a track record of recurring 
problems with tcp sequence number space arithmetic and tcp timing, partly 
because the wraparounds happens infrequently.

In the mean time, I feel that telling the compiler that it's OK to let it 
behave the way we expect (vs actively sabotaging it) is a viable stopgap.

-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246

signature.asc
Description: This is a digitally signed message part.


Re: PSA: If you run -current, beware!

2015-02-04 Thread Ed Maste
On 4 February 2015 at 09:29, Konstantin Belousov kostik...@gmail.com wrote:

 So the issue is reproducable in 3 minutes after boot with the following
 change in kern_clock.c:
 volatile intticks = INT_MAX - (/*hz*/1000 * 3 * 60);

 It is fixed (in the proper meaning of the word, not like worked around,
 covered by paper) by the patch at the end of the mail.

 We already have a story trying to enable much less ambitious option
 -fno-strict-overflow, see r259045 and the revert in r259422.

Note that -fno-strict-overflow and -fwrapv are equivalent as far as
Clang is concerned:

|  // -fno-strict-overflow implies -fwrapv if it isn't disabled, but
|  // -fstrict-overflow won't turn off an explicitly enabled -fwrapv.
|  if (Arg *A = Args.getLastArg(options::OPT_fwrapv,
|   options::OPT_fno_wrapv)) {
|if (A-getOption().matches(options::OPT_fwrapv))
|  CmdArgs.push_back(-fwrapv);
|  } else if (Arg *A = Args.getLastArg(options::OPT_fstrict_overflow,
|  options::OPT_fno_strict_overflow)) {
|if (A-getOption().matches(options::OPT_fno_strict_overflow))
|  CmdArgs.push_back(-fwrapv);
|  }

 I do not see other way than try one more time.

Agreed.

As you noted elsewhere the original issue that triggered the revert
was fixed by r259609, so we should be able to just re-apply r259045.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-04 Thread Peter Wemm
On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote:
 On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
  Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been
  introduced to 11.x/head/-current.With HZ=1000 (the default for bare
  metal, not for a vm); the clocks stop just after 24 days of uptime.  This
  means things like cron, sleep, timeouts etc stop working.  TCP/IP won't
  time out or retransmit, etc etc.  It can get ugly.
  
  The problem is NOT in 10.x/-stable.
  
  We hit this in the freebsd.org cluster, the builds that we used are:
  FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
  FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
  
  If you are running -current in a situation where it'll accumulate uptime,
  you may want to take precautions.  A reboot prior to 24 days uptime (as
  horrible a workaround as that is) will avoid it.
  
  Yes, this is being worked on.
 
 So the issue is reproducable in 3 minutes after boot with the following
 change in kern_clock.c:
 volatile int  ticks = INT_MAX - (/*hz*/1000 * 3 * 60);
 
 It is fixed (in the proper meaning of the word, not like worked around,
 covered by paper) by the patch at the end of the mail.
 
 We already have a story trying to enable much less ambitious option
 -fno-strict-overflow, see r259045 and the revert in r259422.  I do not
 see other way than try one more time.  Too many places in kernel
 depend on the correctly wrapping 2-complement arithmetic, among others
 are callweel and scheduler.

Ugh.

I believe I have a smoking gun that suggests that the clock-stop problem is 
caused by the clang-3.5 import on Dec 31st.

Backstory:
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
http://www.airs.com/blog/archives/120

I suspect that what has happened is that clang's optimizer got better at 
seeing the direct or indirect effects of integer overflow and clang (and gcc) 
take advantage of that.

I have used a slightly different change for about 10 years:

--- kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
+++ kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
@@ -410,6 +415,11 @@
 #ifdef SW_WATCHDOG
EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0);
 #endif
+   /*
+* Arrange for ticks to go negative just 5 minutes after boot
+* to help catch sign problems sooner.
+*/
+   ticks = INT_MAX - (hz * 5 * 60);
 }
 
 /*

This came about from when we had problems with integer overflow arithmetic in 
the tcp stack.

In any case, I'm in the process of adding -fwrapv and the early wraparound to 
the freebsd.org cluster builds to give it some wider exercise.

-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246

signature.asc
Description: This is a digitally signed message part.


Re: PSA: If you run -current, beware!

2015-02-04 Thread Luigi Rizzo
On Thursday, February 5, 2015, Peter Wemm pe...@wemm.org wrote:

 On Wednesday, February 04, 2015 04:29:41 PM Konstantin Belousov wrote:
  On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
   Sometime in the Dec 10th through Jan 7th timeframe a timing bug has
 been
   introduced to 11.x/head/-current.With HZ=1000 (the default for bare
   metal, not for a vm); the clocks stop just after 24 days of uptime.
 This
   means things like cron, sleep, timeouts etc stop working.  TCP/IP won't
   time out or retransmit, etc etc.  It can get ugly.
  
   The problem is NOT in 10.x/-stable.
  
   We hit this in the freebsd.org cluster, the builds that we used are:
   FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
   FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
  
   If you are running -current in a situation where it'll accumulate
 uptime,
   you may want to take precautions.  A reboot prior to 24 days uptime (as
   horrible a workaround as that is) will avoid it.
  
   Yes, this is being worked on.
 
  So the issue is reproducable in 3 minutes after boot with the following
  change in kern_clock.c:
  volatile int  ticks = INT_MAX - (/*hz*/1000 * 3 * 60);
 
  It is fixed (in the proper meaning of the word, not like worked around,
  covered by paper) by the patch at the end of the mail.
 
  We already have a story trying to enable much less ambitious option
  -fno-strict-overflow, see r259045 and the revert in r259422.  I do not
  see other way than try one more time.  Too many places in kernel
  depend on the correctly wrapping 2-complement arithmetic, among others
  are callweel and scheduler.


Rather than depending on a compiler option, wouldn't it be better/more
robust to change ticks to unsigned, which has specified wrapping behavior?

Cheers
Luigi

Ugh.

 I believe I have a smoking gun that suggests that the clock-stop problem is
 caused by the clang-3.5 import on Dec 31st.

 Backstory:
 http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
 http://www.airs.com/blog/archives/120

 I suspect that what has happened is that clang's optimizer got better at
 seeing the direct or indirect effects of integer overflow and clang (and
 gcc)
 take advantage of that.

 I have used a slightly different change for about 10 years:

 --- kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
 +++ kern/kern_clock.c   2014-12-01 15:42:21.707911656 -0800
 @@ -410,6 +415,11 @@
  #ifdef SW_WATCHDOG
 EVENTHANDLER_REGISTER(watchdog_list, watchdog_config, NULL, 0);
  #endif
 +   /*
 +* Arrange for ticks to go negative just 5 minutes after boot
 +* to help catch sign problems sooner.
 +*/
 +   ticks = INT_MAX - (hz * 5 * 60);
  }

  /*

 This came about from when we had problems with integer overflow arithmetic
 in
 the tcp stack.

 In any case, I'm in the process of adding -fwrapv and the early wraparound
 to
 the freebsd.org cluster builds to give it some wider exercise.

 --
 Peter Wemm - pe...@wemm.org javascript:;; pe...@freebsd.org;
 pe...@yahoo-inc.com javascript:;; KI6FJV
 UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-04 Thread Konstantin Belousov
On Tue, Feb 03, 2015 at 01:33:15PM -0800, Peter Wemm wrote:
 Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been 
 introduced to 11.x/head/-current.With HZ=1000 (the default for bare 
 metal, 
 not for a vm); the clocks stop just after 24 days of uptime.  This means 
 things like cron, sleep, timeouts etc stop working.  TCP/IP won't time out or 
 retransmit, etc etc.  It can get ugly.
 
 The problem is NOT in 10.x/-stable.
 
 We hit this in the freebsd.org cluster, the builds that we used are:
 FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
 FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
 
 If you are running -current in a situation where it'll accumulate uptime, you 
 may want to take precautions.  A reboot prior to 24 days uptime (as horrible 
 a 
 workaround as that is) will avoid it.
 
 Yes, this is being worked on.

So the issue is reproducable in 3 minutes after boot with the following
change in kern_clock.c:
volatile intticks = INT_MAX - (/*hz*/1000 * 3 * 60);

It is fixed (in the proper meaning of the word, not like worked around,
covered by paper) by the patch at the end of the mail.

We already have a story trying to enable much less ambitious option
-fno-strict-overflow, see r259045 and the revert in r259422.  I do not
see other way than try one more time.  Too many places in kernel
depend on the correctly wrapping 2-complement arithmetic, among others
are callweel and scheduler.

diff --git a/sys/conf/kern.mk b/sys/conf/kern.mk
index c031b3a..eb7ce2f 100644
--- a/sys/conf/kern.mk
+++ b/sys/conf/kern.mk
@@ -158,6 +158,11 @@ INLINE_LIMIT?= 8000
 CFLAGS+=   -ffreestanding
 
 #
+# Make signed arithmetic wrap.
+#
+CFLAGS+=   -fwrapv
+
+#
 # GCC SSP support
 #
 .if ${MK_SSP} != no  \
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-03 Thread Luigi Rizzo
On Tuesday, February 3, 2015, Peter Wemm pe...@wemm.org wrote:

 Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been
 introduced to 11.x/head/-current.With HZ=1000 (the default for bare
 metal,
 not for a vm); the clocks stop just after 24 days of uptime.

  This means



Signed 32 bit overflow it seems from the numbers ? Wasn't that a windows
feature in the old days ? :)

Cheers
Luigi



-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: PSA: If you run -current, beware!

2015-02-03 Thread Ian Lepore
On Tue, 2015-02-03 at 13:33 -0800, Peter Wemm wrote:
 Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been 
 introduced to 11.x/head/-current.With HZ=1000 (the default for bare 
 metal, 
 not for a vm); the clocks stop just after 24 days of uptime.  This means 
 things like cron, sleep, timeouts etc stop working.  TCP/IP won't time out or 
 retransmit, etc etc.  It can get ugly.
 
 The problem is NOT in 10.x/-stable.
 
 We hit this in the freebsd.org cluster, the builds that we used are:
 FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
 FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken
 
 If you are running -current in a situation where it'll accumulate uptime, you 
 may want to take precautions.  A reboot prior to 24 days uptime (as horrible 
 a 
 workaround as that is) will avoid it.
 
 Yes, this is being worked on.

FWIW, 24.8 days is the point at which an int32_t variable counting ticks
at 1khz rolls over from positive to negative numbers.

-- Ian


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


PSA: If you run -current, beware!

2015-02-03 Thread Peter Wemm
Sometime in the Dec 10th through Jan 7th timeframe a timing bug has been 
introduced to 11.x/head/-current.With HZ=1000 (the default for bare metal, 
not for a vm); the clocks stop just after 24 days of uptime.  This means 
things like cron, sleep, timeouts etc stop working.  TCP/IP won't time out or 
retransmit, etc etc.  It can get ugly.

The problem is NOT in 10.x/-stable.

We hit this in the freebsd.org cluster, the builds that we used are:
FreeBSD 11.0-CURRENT #0 r275684: Wed Dec 10 20:38:43 UTC 2014 - fine
FreeBSD 11.0-CURRENT #0 r276779: Wed Jan  7 18:47:09 UTC 2015 - broken

If you are running -current in a situation where it'll accumulate uptime, you 
may want to take precautions.  A reboot prior to 24 days uptime (as horrible a 
workaround as that is) will avoid it.

Yes, this is being worked on.
-- 
Peter Wemm - pe...@wemm.org; pe...@freebsd.org; pe...@yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246

signature.asc
Description: This is a digitally signed message part.