Re: panic(9): set panicstr atomically

2021-05-25 Thread Theo de Raadt
gwes  wrote:

> On 5/25/21 10:26 AM, Theo de Raadt wrote:
> > Alexander Bluhm  wrote:
> >
> >> On Tue, May 25, 2021 at 04:15:26PM +0200, Mark Kettenis wrote:
> >>> Wouldn't be too hard.  But unless you're on a serial console, that
> >>> will probably be more than a screenful of information, so not terribly
> >>> useful.
> >> The most important things must fit on the first VGA screen.  Then
> >> user can make a photo.  We ship products to customers who only
> >> understand this simple step.  And when the machine panics, the admin
> >> might also panic.
> > Matches my experience.
> >
> > It will be very difficult to show state of multiple cpus on a small
> > screen.  On the other hand, only showing the first-panic can hide so
> > much.  These garbled reports are actually highlighting a not-simple
> > case.  There is big information in the garble.
> >
> Jumping in, sorry,
> 
> Some sort of pseudo-ordering where each CPU -mostly- gets to put out
> more than one character at a time - 2, 3, ... one line max?
> No guaranteed separation/ordering but might make decipherment easier.
> 
> No communication or dependence between CPUs needed.
> 
> Each CPU backs off for a -very- short time after N characters.
> Nanoseconds or microseconds should work. Initial delay M * CPU ?
> Spin after N? If clock is ticking, spin for x ticks?
> Formatting/output time per burst could (maybe) give a little fairness.
> 
> Apologies if this is nonsense or has been covered already.

That is not the kind of code you want anywhere near the "heading to ddb"
situation.

We are heading to ddb because something in the 'software machine state'
is busted.  We need to rely upon FEWER COMPONENTS still operating
normally, not more!



Re: panic(9): set panicstr atomically

2021-05-25 Thread gwes

On 5/25/21 10:26 AM, Theo de Raadt wrote:

Alexander Bluhm  wrote:


On Tue, May 25, 2021 at 04:15:26PM +0200, Mark Kettenis wrote:

Wouldn't be too hard.  But unless you're on a serial console, that
will probably be more than a screenful of information, so not terribly
useful.

The most important things must fit on the first VGA screen.  Then
user can make a photo.  We ship products to customers who only
understand this simple step.  And when the machine panics, the admin
might also panic.

Matches my experience.

It will be very difficult to show state of multiple cpus on a small
screen.  On the other hand, only showing the first-panic can hide so
much.  These garbled reports are actually highlighting a not-simple
case.  There is big information in the garble.


Jumping in, sorry,

Some sort of pseudo-ordering where each CPU -mostly- gets to put out
more than one character at a time - 2, 3, ... one line max?
No guaranteed separation/ordering but might make decipherment easier.

No communication or dependence between CPUs needed.

Each CPU backs off for a -very- short time after N characters.
Nanoseconds or microseconds should work. Initial delay M * CPU ?
Spin after N? If clock is ticking, spin for x ticks?
Formatting/output time per burst could (maybe) give a little fairness.

Apologies if this is nonsense or has been covered already.

Geoff Steckel



Re: panic(9): set panicstr atomically

2021-05-25 Thread Theo de Raadt
Alexander Bluhm  wrote:

> On Tue, May 25, 2021 at 04:15:26PM +0200, Mark Kettenis wrote:
> > Wouldn't be too hard.  But unless you're on a serial console, that
> > will probably be more than a screenful of information, so not terribly
> > useful.
> 
> The most important things must fit on the first VGA screen.  Then
> user can make a photo.  We ship products to customers who only
> understand this simple step.  And when the machine panics, the admin
> might also panic.

Matches my experience.

It will be very difficult to show state of multiple cpus on a small
screen.  On the other hand, only showing the first-panic can hide so
much.  These garbled reports are actually highlighting a not-simple
case.  There is big information in the garble.



Re: panic(9): set panicstr atomically

2021-05-25 Thread Alexander Bluhm
On Tue, May 25, 2021 at 04:15:26PM +0200, Mark Kettenis wrote:
> Wouldn't be too hard.  But unless you're on a serial console, that
> will probably be more than a screenful of information, so not terribly
> useful.

The most important things must fit on the first VGA screen.  Then
user can make a photo.  We ship products to customers who only
understand this simple step.  And when the machine panics, the admin
might also panic.

bluhm



Re: panic(9): set panicstr atomically

2021-05-25 Thread Mark Kettenis
> From: Scott Cheloha 
> Date: Tue, 25 May 2021 08:42:05 -0500
> 
> > On May 25, 2021, at 08:20, Theo de Raadt  wrote:
> > 
> > Scott Cheloha  wrote:
> > 
> >>> On Mon, May 24, 2021 at 10:12:53PM -0500, Scott Cheloha wrote:
> >>> On Sat, May 22, 2021 at 01:35:53AM +0200, Alexander Bluhm wrote:
>  On Fri, May 21, 2021 at 02:00:54PM -0500, Scott Cheloha wrote:
> > Given all of this, would it be better if secondary CPUs spin in
> > panic(9) instead of trying to print anything?
>  
>  The panic code should be as primitive as possible.  The garbled
>  output also tells me something.  Two CPUs are failing simultaneosly.
>  Please don't suppress that information.
>  
>  The crash is the problem, not the ugly printing.
> >>> 
> >>> I get where you're coming from in principle (simpler is better) but I
> >>> think you're prioritizing a minor concern over the bigger picture.
> >> 
> >> To be perfectly clear, I'm not talking about the garbled printing
> >> anymore.  I'm talking about *all* the code that can run from panic().
> >> There is a lot of code.  I think it would be better if we prevented
> >> multiple CPUs from running that code simultaneously by having
> >> secondary CPUs spin in panic() as visa@ suggested.
> > 
> > I think that is incorrect.  There are no "secondary CPUs", there are
> > only "not the first cpu to enter panic".
> 
> Right, exactly. Not sure what else to call
> them.  We are talking about the same thing.
> 
> > If the 2nd cpu to enter panic actually has a more relevant panic, now
> > it will be missed.
> 
> I said we could keep the printing.  I even
> kept the printing in my latest patch. So I
> don't know what you're arguing against
> here.

So the big risk with your diff is that you may end up spinning with
interrupts disabled at which point that CPU is dead.  If you try to
switch to it at that point, you'll lock up the machine.

> > Most of our users don't do the ddbcpu trace dance.  So I expect the
> > reports we get as a result of your change will be less enlightening.
> 
> This sounds like an argument for making the
> "ddbcpu trace dance" easier.  Or automating
> it entirely.
> 
> Personally I find it extremely annoying to flip
> from CPU to CPU to print all the traces.  It
> only gets more annoying as the core count
> grows.
> 
> How hard would it be to implement
> "show all trace" or something equivalent
> to simplify crash reporting for the end user?

Wouldn't be too hard.  But unless you're on a serial console, that
will probably be more than a screenful of information, so not terribly
useful.



Re: panic(9): set panicstr atomically

2021-05-25 Thread Theo de Raadt
Scott Cheloha  wrote:

> > If the 2nd cpu to enter panic actually has a more relevant panic, now
> > it will be missed.
> 
> I said we could keep the printing.  I even
> kept the printing in my latest patch. So I
> don't know what you're arguing against
> here.

You start spinning cpus inside panic.  I don't know if that breaks
the ddbcpu experience.

> Personally I find it extremely annoying to flip
> from CPU to CPU to print all the traces.  It
> only gets more annoying as the core count
> grows.
> 
> How hard would it be to implement
> "show all trace" or something equivalent
> to simplify crash reporting for the end user?

If there are too many lines if output, things may scroll out of view.
Excessively verbose context can be difficult to capture for reporting
purposes.

Here is a completely untested draft that would create a per-cpu panic
buffer, tries to make the top of of panic() operate on less global state,
and adjusts ddb "show panic" to show all the known panics.  This isn't
tested, and the db_show_panic_cmd change is half baked.

Index: arch/amd64/include/cpu.h
===
RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
retrieving revision 1.137
diff -u -p -u -r1.137 cpu.h
--- arch/amd64/include/cpu.h3 Jun 2020 06:54:04 -   1.137
+++ arch/amd64/include/cpu.h25 May 2021 13:25:13 -
@@ -208,6 +208,8 @@ struct cpu_info {
struct vmxon_region *ci_vmxon_region;
 
int64_t ci_tsc_skew;/* counter skew vs cpu0 */
+
+   charci_panicbuf[512];
 };
 
 #define CPUF_BSP   0x0001  /* CPU is the original BSP */
Index: kern/subr_prf.c
===
RCS file: /cvs/src/sys/kern/subr_prf.c,v
retrieving revision 1.103
diff -u -p -u -r1.103 subr_prf.c
--- kern/subr_prf.c 16 May 2021 15:10:20 -  1.103
+++ kern/subr_prf.c 25 May 2021 13:31:30 -
@@ -194,20 +194,19 @@ panic(const char *fmt, ...)
printf_flags |= TOCONS;
 
bootopt = RB_AUTOBOOT | RB_DUMP;
+
va_start(ap, fmt);
+   vsnprintf(curcpu()->ci_panicbuf, sizeof(curcpu()->ci_panicbuf), fmt, 
ap);
+   va_end(ap);
+
if (panicstr)
bootopt |= RB_NOSYNC;
else {
-   vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
panicstr = panicbuf;
+   strlcpy(panicbuf, curcpu()->ci_panicbuf, sizeof panicbuf);
}
-   va_end(ap);
 
-   printf("panic: ");
-   va_start(ap, fmt);
-   vprintf(fmt, ap);
-   printf("\n");
-   va_end(ap);
+   printf("panic: %s\n", curcpu()->ci_panicbuf);
 
 #ifdef DDB
if (db_panic)
Index: ddb/db_command.c
===
RCS file: /cvs/src/sys/ddb/db_command.c,v
retrieving revision 1.90
diff -u -p -u -r1.90 db_command.c
--- ddb/db_command.c26 Oct 2020 18:53:20 -  1.90
+++ ddb/db_command.c25 May 2021 13:39:11 -
@@ -468,14 +468,23 @@ db_nfsnode_print_cmd(db_expr_t addr, int
 void
 db_show_panic_cmd(db_expr_t addr, int have_addr, db_expr_t count, char *modif)
 {
-   if (panicstr)
+   if (panicstr) {
+#ifdef MULTIPROCESSOR
+   int i;
+
+   for (i = 0; i < MAXCPUS; i++) {
+   if (cpu_info[i] != NULL && curcpu()->ci_panicbuf[0])
+   db_printf("%d: %s\n", CPU_INFO_UNIT(curcpu()),
+   curcpu()->ci_panicbuf);
+   }
+#else
db_printf("%s\n", panicstr);
-   else if (faultstr) {
+#endif
+   } else if (faultstr) {
db_printf("kernel page fault\n");
db_printf("%s\n", faultstr);
db_stack_trace_print(addr, have_addr, 1, modif, db_printf);
-   }
-   else
+   } else
db_printf("the kernel did not panic\n");/* yet */
 }
 



Re: panic(9): set panicstr atomically

2021-05-25 Thread Scott Cheloha
> On May 25, 2021, at 08:20, Theo de Raadt  wrote:
> 
> Scott Cheloha  wrote:
> 
>>> On Mon, May 24, 2021 at 10:12:53PM -0500, Scott Cheloha wrote:
>>> On Sat, May 22, 2021 at 01:35:53AM +0200, Alexander Bluhm wrote:
 On Fri, May 21, 2021 at 02:00:54PM -0500, Scott Cheloha wrote:
> Given all of this, would it be better if secondary CPUs spin in
> panic(9) instead of trying to print anything?
 
 The panic code should be as primitive as possible.  The garbled
 output also tells me something.  Two CPUs are failing simultaneosly.
 Please don't suppress that information.
 
 The crash is the problem, not the ugly printing.
>>> 
>>> I get where you're coming from in principle (simpler is better) but I
>>> think you're prioritizing a minor concern over the bigger picture.
>> 
>> To be perfectly clear, I'm not talking about the garbled printing
>> anymore.  I'm talking about *all* the code that can run from panic().
>> There is a lot of code.  I think it would be better if we prevented
>> multiple CPUs from running that code simultaneously by having
>> secondary CPUs spin in panic() as visa@ suggested.
> 
> I think that is incorrect.  There are no "secondary CPUs", there are
> only "not the first cpu to enter panic".

Right, exactly. Not sure what else to call
them.  We are talking about the same thing.

> If the 2nd cpu to enter panic actually has a more relevant panic, now
> it will be missed.

I said we could keep the printing.  I even
kept the printing in my latest patch. So I
don't know what you're arguing against
here.

> Most of our users don't do the ddbcpu trace dance.  So I expect the
> reports we get as a result of your change will be less enlightening.

This sounds like an argument for making the
"ddbcpu trace dance" easier.  Or automating
it entirely.

Personally I find it extremely annoying to flip
from CPU to CPU to print all the traces.  It
only gets more annoying as the core count
grows.

How hard would it be to implement
"show all trace" or something equivalent
to simplify crash reporting for the end user?



Re: panic(9): set panicstr atomically

2021-05-25 Thread Theo de Raadt
Scott Cheloha  wrote:

> On Mon, May 24, 2021 at 10:12:53PM -0500, Scott Cheloha wrote:
> > On Sat, May 22, 2021 at 01:35:53AM +0200, Alexander Bluhm wrote:
> > > On Fri, May 21, 2021 at 02:00:54PM -0500, Scott Cheloha wrote:
> > > > Given all of this, would it be better if secondary CPUs spin in
> > > > panic(9) instead of trying to print anything?
> > > 
> > > The panic code should be as primitive as possible.  The garbled
> > > output also tells me something.  Two CPUs are failing simultaneosly.
> > > Please don't suppress that information.
> > > 
> > > The crash is the problem, not the ugly printing.
> > 
> > I get where you're coming from in principle (simpler is better) but I
> > think you're prioritizing a minor concern over the bigger picture.
> 
> To be perfectly clear, I'm not talking about the garbled printing
> anymore.  I'm talking about *all* the code that can run from panic().
> There is a lot of code.  I think it would be better if we prevented
> multiple CPUs from running that code simultaneously by having
> secondary CPUs spin in panic() as visa@ suggested.

I think that is incorrect.  There are no "secondary CPUs", there are
only "not the first cpu to enter panic".

If the 2nd cpu to enter panic actually has a more relevant panic, now
it will be missed.

Most of our users don't do the ddbcpu trace dance.  So I expect the
reports we get as a result of your change will be less enlightening.



Re: panic(9): set panicstr atomically

2021-05-24 Thread Theo de Raadt
Scott Cheloha  wrote:

> On Sat, May 22, 2021 at 01:35:53AM +0200, Alexander Bluhm wrote:
> > On Fri, May 21, 2021 at 02:00:54PM -0500, Scott Cheloha wrote:
> > > Given all of this, would it be better if secondary CPUs spin in
> > > panic(9) instead of trying to print anything?
> > 
> > The panic code should be as primitive as possible.  The garbled
> > output also tells me something.  Two CPUs are failing simultaneosly.
> > Please don't suppress that information.
> > 
> > The crash is the problem, not the ugly printing.
> 
> I get where you're coming from in principle (simpler is better) but I
> think you're prioritizing a minor concern over the bigger picture.

Actually, I think you've got it wrong, and risk capturing the least
important panic.

I suspect we can check for not now, by looking at what the first character
in the panics looks like, to decode which panic happened first.  I think
I did this before, and the first one through the gate wasn't the important one.

Doesn't it break parts of ddb with (;;) CPU_BUSY_CYCLE() ?

> I think it is strictly better for one CPU to run the panic code than
> to have two CPUs doing the same.  If the panic code is primitive then
> it probably isn't MP-safe.

But it is primitive. I suspect parallel panic's are not corrupting a
large amount of state.

> Further, what about kernels without ddb(4)?  What happens if two or
> more CPUs all simultaneously call reboot() at the end of panic(9)?  It
> would vary by platform at a minimum.  Wouldn't deterministic behavior
> be better in that case?

Which kernels are those?  Just ramdisks, which are MP.  This situation
does not happen.

> We can keep the garbled printing if you like, I can see how it works
> as a diagnostic feature, but I think the losing CPUs should spin in
> panic.  Fewer things can go wrong.

How many times must it be repeated:  Then you only see the first cpu
which manages to reach panic().  Any other cpu no longer provides a
diagnostic.  They'll be spinning and noone knows why, the message is
entirely lost.


I would like to reiterate that I would be very happy if every cpuinfo
has an independet panic buffer, and *THAT* buffer can be updated async
by every cpu first, before deciding to print.  Heck, in that case the
printing code can show only "one" panic message, and then say
"more panics: use show panic".

At which point the garble is gone, we can hopefully gravitate
towards people knowing to run "show panic" to see multiple reasons.

> What about something like this?
> 
> Index: subr_prf.c
> ===
> RCS file: /cvs/src/sys/kern/subr_prf.c,v
> retrieving revision 1.103
> diff -u -p -r1.103 subr_prf.c
> --- subr_prf.c16 May 2021 15:10:20 -  1.103
> +++ subr_prf.c25 May 2021 03:09:19 -
> @@ -184,31 +184,50 @@ void
>  panic(const char *fmt, ...)
>  {
>   static char panicbuf[512];
> - int bootopt;
> + static struct cpu_info *paniccpu;
> + struct cpu_info *ci;
> + int bootopt, recursive, spin;
>   va_list ap;
>  
> - /* do not trigger assertions, we know that we are inconsistent */
> - splassert_ctl = 0;
> + ci = curcpu();
> + recursive = 0;
> + spin = 0;
>  
> - /* make sure we see kernel printf output */
> - printf_flags |= TOCONS;
> + if (atomic_cas_ptr(, NULL, ci) != NULL) {
> + if (paniccpu != ci)
> + spin = 1;
> + else
> + recursive = 1;
> + } else {
> + panicstr = panicbuf;
> +
> + /* do not trigger assertions, we know that we are inconsistent 
> */
> + splassert_ctl = 0;
> +
> + /* make sure we see kernel printf output */
> + printf_flags |= TOCONS;
> + }
>  
>   bootopt = RB_AUTOBOOT | RB_DUMP;
> - va_start(ap, fmt);
> - if (panicstr)
> +
> + if (recursive || spin) {
>   bootopt |= RB_NOSYNC;
> - else {
> + printf("panic: ");
> + va_start(ap, fmt);
> + vprintf(fmt, ap);
> + va_end(ap);
> + printf("\n");
> + } else {
> + va_start(ap, fmt);
>   vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
> - panicstr = panicbuf;
> + va_end(ap);
> + printf("panic: %s\n", panicbuf);
>   }
> - va_end(ap);
> -
> - printf("panic: ");
> - va_start(ap, fmt);
> - vprintf(fmt, ap);
> - printf("\n");
> - va_end(ap);
>  
> + if (spin) {
> + for (;;)
> + CPU_BUSY_CYCLE();
> + }
>  #ifdef DDB
>   if (db_panic)
>   db_enter();
> 



Re: panic(9): set panicstr atomically

2021-05-24 Thread Scott Cheloha
On Mon, May 24, 2021 at 10:12:53PM -0500, Scott Cheloha wrote:
> On Sat, May 22, 2021 at 01:35:53AM +0200, Alexander Bluhm wrote:
> > On Fri, May 21, 2021 at 02:00:54PM -0500, Scott Cheloha wrote:
> > > Given all of this, would it be better if secondary CPUs spin in
> > > panic(9) instead of trying to print anything?
> > 
> > The panic code should be as primitive as possible.  The garbled
> > output also tells me something.  Two CPUs are failing simultaneosly.
> > Please don't suppress that information.
> > 
> > The crash is the problem, not the ugly printing.
> 
> I get where you're coming from in principle (simpler is better) but I
> think you're prioritizing a minor concern over the bigger picture.

To be perfectly clear, I'm not talking about the garbled printing
anymore.  I'm talking about *all* the code that can run from panic().
There is a lot of code.  I think it would be better if we prevented
multiple CPUs from running that code simultaneously by having
secondary CPUs spin in panic() as visa@ suggested.



Re: panic(9): set panicstr atomically

2021-05-24 Thread Scott Cheloha
On Sat, May 22, 2021 at 01:35:53AM +0200, Alexander Bluhm wrote:
> On Fri, May 21, 2021 at 02:00:54PM -0500, Scott Cheloha wrote:
> > Given all of this, would it be better if secondary CPUs spin in
> > panic(9) instead of trying to print anything?
> 
> The panic code should be as primitive as possible.  The garbled
> output also tells me something.  Two CPUs are failing simultaneosly.
> Please don't suppress that information.
> 
> The crash is the problem, not the ugly printing.

I get where you're coming from in principle (simpler is better) but I
think you're prioritizing a minor concern over the bigger picture.

I think it is strictly better for one CPU to run the panic code than
to have two CPUs doing the same.  If the panic code is primitive then
it probably isn't MP-safe.

Further, what about kernels without ddb(4)?  What happens if two or
more CPUs all simultaneously call reboot() at the end of panic(9)?  It
would vary by platform at a minimum.  Wouldn't deterministic behavior
be better in that case?

We can keep the garbled printing if you like, I can see how it works
as a diagnostic feature, but I think the losing CPUs should spin in
panic.  Fewer things can go wrong.

What about something like this?

Index: subr_prf.c
===
RCS file: /cvs/src/sys/kern/subr_prf.c,v
retrieving revision 1.103
diff -u -p -r1.103 subr_prf.c
--- subr_prf.c  16 May 2021 15:10:20 -  1.103
+++ subr_prf.c  25 May 2021 03:09:19 -
@@ -184,31 +184,50 @@ void
 panic(const char *fmt, ...)
 {
static char panicbuf[512];
-   int bootopt;
+   static struct cpu_info *paniccpu;
+   struct cpu_info *ci;
+   int bootopt, recursive, spin;
va_list ap;
 
-   /* do not trigger assertions, we know that we are inconsistent */
-   splassert_ctl = 0;
+   ci = curcpu();
+   recursive = 0;
+   spin = 0;
 
-   /* make sure we see kernel printf output */
-   printf_flags |= TOCONS;
+   if (atomic_cas_ptr(, NULL, ci) != NULL) {
+   if (paniccpu != ci)
+   spin = 1;
+   else
+   recursive = 1;
+   } else {
+   panicstr = panicbuf;
+
+   /* do not trigger assertions, we know that we are inconsistent 
*/
+   splassert_ctl = 0;
+
+   /* make sure we see kernel printf output */
+   printf_flags |= TOCONS;
+   }
 
bootopt = RB_AUTOBOOT | RB_DUMP;
-   va_start(ap, fmt);
-   if (panicstr)
+
+   if (recursive || spin) {
bootopt |= RB_NOSYNC;
-   else {
+   printf("panic: ");
+   va_start(ap, fmt);
+   vprintf(fmt, ap);
+   va_end(ap);
+   printf("\n");
+   } else {
+   va_start(ap, fmt);
vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
-   panicstr = panicbuf;
+   va_end(ap);
+   printf("panic: %s\n", panicbuf);
}
-   va_end(ap);
-
-   printf("panic: ");
-   va_start(ap, fmt);
-   vprintf(fmt, ap);
-   printf("\n");
-   va_end(ap);
 
+   if (spin) {
+   for (;;)
+   CPU_BUSY_CYCLE();
+   }
 #ifdef DDB
if (db_panic)
db_enter();



Re: panic(9): set panicstr atomically

2021-05-21 Thread Alexander Bluhm
On Fri, May 21, 2021 at 02:00:54PM -0500, Scott Cheloha wrote:
> Given all of this, would it be better if secondary CPUs spin in
> panic(9) instead of trying to print anything?

The panic code should be as primitive as possible.  The garbled
output also tells me something.  Two CPUs are failing simultaneosly.
Please don't suppress that information.

The crash is the problem, not the ugly printing.

bluhm



Re: panic(9): set panicstr atomically

2021-05-21 Thread Scott Cheloha
On Sat, May 15, 2021 at 01:15:28PM +0200, Mark Kettenis wrote:
> > Date: Sat, 15 May 2021 11:06:39 +
> > From: Visa Hankala 
> > 
> > On Wed, May 12, 2021 at 07:08:39PM -0500, Scott Cheloha wrote:
> > > In a separate mail thread, bluhm@ mentioned that panic(9) does not
> > > cleanly handle multiple CPUs entering it simultaneously:
> > > 
> > > https://marc.info/?l=openbsd-tech=161908805925325=2
> > > 
> > > I'm unsure which part of panic(9) is causing the problem he mentions,
> > > but one obvious issue I see is that panicstr is not set atomically,
> > > so two CPUs entering panic(9) simultaneously may clobber panicbuf.
> > > 
> > > If we set panicstr atomically only one CPU will write panicbuf.
> > 
> > I think most of the clobbering is explained by more than one CPU writing
> > to the console at the same time. The vsnprintf() and setting of panicstr
> > usually happen quickly, so the kind of garbling occasionally seen with
> > nearly simultaneous panicking is not likely to arise there. Console I/O,
> > on the other hand, can be orders of magnitude slower. That, and the fact
> > that mutexes become no-ops once panicstr is set, create a slow phase
> > where multiple CPUs can easily be concurrently even if the initial
> > timings were not so close after all.
> > 
> > I feel that panic() should let only the first panicker run the panic
> > code and stop any other CPUs, like NetBSD does. Another option is to
> > serialize panic() in a more proper way. Or maybe secondary panickers
> > should just delay a little at the start of panic()...
> 
> The problem with serializing panics is that you increase the risk that
> you deadlock and don't see any messages at all...

Given all of this, would it be better if secondary CPUs spin in
panic(9) instead of trying to print anything?

Serializing passage through panic(9) is probably impossible and I'm
not even sure we'd want to do so.

We _could_ add a delay for secondary panickers, but that seems to work
around the problem instead of solving it.

However, if we just designate a "winner" in panic(9) and have the
losers spin then we solve the garbled output problem completely.  The
winning CPU can still check if they have recursively panicked and act
accordingly.

Then if you make it into the debugger you can still check the stack
traces on other CPUs and discover simulpanics, so we don't really lose
much by having the secondary CPUs not print anything.

Index: subr_prf.c
===
RCS file: /cvs/src/sys/kern/subr_prf.c,v
retrieving revision 1.103
diff -u -p -r1.103 subr_prf.c
--- subr_prf.c  16 May 2021 15:10:20 -  1.103
+++ subr_prf.c  21 May 2021 18:53:54 -
@@ -184,9 +184,22 @@ void
 panic(const char *fmt, ...)
 {
static char panicbuf[512];
-   int bootopt;
+   static struct cpu_info *paniccpu;
+   struct cpu_info *ci;
+   int bootopt, recursive;
va_list ap;
 
+   ci = curcpu();
+   recursive = 0;
+
+   if (atomic_cas_ptr(, NULL, ci) != NULL) {
+   if (paniccpu != ci) {
+   for (;;)
+   CPU_BUSY_CYCLE();
+   }
+   recursive = 1;
+   }
+
/* do not trigger assertions, we know that we are inconsistent */
splassert_ctl = 0;
 
@@ -194,20 +207,21 @@ panic(const char *fmt, ...)
printf_flags |= TOCONS;
 
bootopt = RB_AUTOBOOT | RB_DUMP;
-   va_start(ap, fmt);
-   if (panicstr)
+
+   if (recursive) {
bootopt |= RB_NOSYNC;
-   else {
+   printf("panic: ");
+   va_start(ap, fmt);
+   vprintf(fmt, ap);
+   va_end(ap);
+   printf("\n");
+   } else {
+   va_start(ap, fmt);
vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
+   va_end(ap);
panicstr = panicbuf;
+   printf("panic: %s\n", panicbuf);
}
-   va_end(ap);
-
-   printf("panic: ");
-   va_start(ap, fmt);
-   vprintf(fmt, ap);
-   printf("\n");
-   va_end(ap);
 
 #ifdef DDB
if (db_panic)



Re: panic(9): set panicstr atomically

2021-05-15 Thread Mark Kettenis
> Date: Sat, 15 May 2021 11:06:39 +
> From: Visa Hankala 
> 
> On Wed, May 12, 2021 at 07:08:39PM -0500, Scott Cheloha wrote:
> > In a separate mail thread, bluhm@ mentioned that panic(9) does not
> > cleanly handle multiple CPUs entering it simultaneously:
> > 
> > https://marc.info/?l=openbsd-tech=161908805925325=2
> > 
> > I'm unsure which part of panic(9) is causing the problem he mentions,
> > but one obvious issue I see is that panicstr is not set atomically,
> > so two CPUs entering panic(9) simultaneously may clobber panicbuf.
> > 
> > If we set panicstr atomically only one CPU will write panicbuf.
> 
> I think most of the clobbering is explained by more than one CPU writing
> to the console at the same time. The vsnprintf() and setting of panicstr
> usually happen quickly, so the kind of garbling occasionally seen with
> nearly simultaneous panicking is not likely to arise there. Console I/O,
> on the other hand, can be orders of magnitude slower. That, and the fact
> that mutexes become no-ops once panicstr is set, create a slow phase
> where multiple CPUs can easily be concurrently even if the initial
> timings were not so close after all.
> 
> I feel that panic() should let only the first panicker run the panic
> code and stop any other CPUs, like NetBSD does. Another option is to
> serialize panic() in a more proper way. Or maybe secondary panickers
> should just delay a little at the start of panic()...

The problem with serializing panics is that you increase the risk that
you deadlock and don't see any messages at all...



Re: panic(9): set panicstr atomically

2021-05-15 Thread Visa Hankala
On Wed, May 12, 2021 at 07:08:39PM -0500, Scott Cheloha wrote:
> In a separate mail thread, bluhm@ mentioned that panic(9) does not
> cleanly handle multiple CPUs entering it simultaneously:
> 
> https://marc.info/?l=openbsd-tech=161908805925325=2
> 
> I'm unsure which part of panic(9) is causing the problem he mentions,
> but one obvious issue I see is that panicstr is not set atomically,
> so two CPUs entering panic(9) simultaneously may clobber panicbuf.
> 
> If we set panicstr atomically only one CPU will write panicbuf.

I think most of the clobbering is explained by more than one CPU writing
to the console at the same time. The vsnprintf() and setting of panicstr
usually happen quickly, so the kind of garbling occasionally seen with
nearly simultaneous panicking is not likely to arise there. Console I/O,
on the other hand, can be orders of magnitude slower. That, and the fact
that mutexes become no-ops once panicstr is set, create a slow phase
where multiple CPUs can easily be concurrently even if the initial
timings were not so close after all.

I feel that panic() should let only the first panicker run the panic
code and stop any other CPUs, like NetBSD does. Another option is to
serialize panic() in a more proper way. Or maybe secondary panickers
should just delay a little at the start of panic()...



Re: panic(9): set panicstr atomically

2021-05-13 Thread Jeremie Courreges-Anglas
On Thu, May 13 2021, Jeremie Courreges-Anglas  wrote:
> On Wed, May 12 2021, Scott Cheloha  wrote:
>> Hi,
>>
>> In a separate mail thread, bluhm@ mentioned that panic(9) does not
>> cleanly handle multiple CPUs entering it simultaneously:
>>
>> https://marc.info/?l=openbsd-tech=161908805925325=2
>>
>> I'm unsure which part of panic(9) is causing the problem he mentions,
>> but one obvious issue I see is that panicstr is not set atomically,
>> so two CPUs entering panic(9) simultaneously may clobber panicbuf.
>>
>> If we set panicstr atomically only one CPU will write panicbuf.
>>
>> Thoughts?
>
> My understanding is that to set panicstr atomically and avoid clobbering,
> you only need atomic_cas_ptr.
>
> Your diff makes panicstr point to volatile data.  If intended, what's
> the rationale?  I don't see how it helps avoid races when one cpu
> "acquires" panicstr and then writes to panicbuf, and another CPU also
> enters panic but just tries to print panicbuf.

Maybe I should stress more that your diff looks like an improvement
overall and it solves a real world problem.  Even if I don't understand
the volatile addition, it probably can't hurt as is, so feel free to go
with other folks' oks.  :)

> If the goal was to mark panicstr itself as volatile, then you're
> probably trying to solve another problem (making writes to panicstr
> more instantly/widely visible?) and I *do not know* if that's the right
> approach (or even needed), but volatile comes with a cost.
>
> Take all of this with a grain of salt, I'm not in familiar territory
> here.
>
>> Index: kern/subr_prf.c
>> ===
>> RCS file: /cvs/src/sys/kern/subr_prf.c,v
>> retrieving revision 1.102
>> diff -u -p -r1.102 subr_prf.c
>> --- kern/subr_prf.c  28 Nov 2020 17:53:05 -  1.102
>> +++ kern/subr_prf.c  13 May 2021 00:04:28 -
>> @@ -97,7 +97,7 @@ struct mutex kprintf_mutex =
>>   */
>>  
>>  extern  int log_open;   /* subr_log: is /dev/klog open? */
>> -const   char *panicstr; /* arg to first call to panic (used as a flag
>> +volatile const char *panicstr; /* arg to first call to panic (used as a flag
>> to indicate that panic has already been called). */
>>  const   char *faultstr; /* page fault string */
>>  #ifdef DDB
>> @@ -195,12 +195,10 @@ panic(const char *fmt, ...)
>>  
>>  bootopt = RB_AUTOBOOT | RB_DUMP;
>>  va_start(ap, fmt);
>> -if (panicstr)
>> +if (atomic_cas_ptr(, NULL, panicbuf) != NULL)
>>  bootopt |= RB_NOSYNC;
>> -else {
>> +else
>>  vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
>> -panicstr = panicbuf;
>> -}
>>  va_end(ap);
>>  
>>  printf("panic: ");
>> Index: sys/systm.h
>> ===
>> RCS file: /cvs/src/sys/sys/systm.h,v
>> retrieving revision 1.153
>> diff -u -p -r1.153 systm.h
>> --- sys/systm.h  28 Apr 2021 09:42:04 -  1.153
>> +++ sys/systm.h  13 May 2021 00:04:28 -
>> @@ -71,7 +71,7 @@
>>   * patched by a stalking hacker.
>>   */
>>  extern int securelevel; /* system security level */
>> -extern const char *panicstr;/* panic message */
>> +extern volatile const char *panicstr;   /* panic message */
>>  extern const char *faultstr;/* fault message */
>>  extern const char version[];/* system version */
>>  extern const char copyright[];  /* system copyright */
>>

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: panic(9): set panicstr atomically

2021-05-13 Thread Jeremie Courreges-Anglas
On Wed, May 12 2021, Scott Cheloha  wrote:
> Hi,
>
> In a separate mail thread, bluhm@ mentioned that panic(9) does not
> cleanly handle multiple CPUs entering it simultaneously:
>
> https://marc.info/?l=openbsd-tech=161908805925325=2
>
> I'm unsure which part of panic(9) is causing the problem he mentions,
> but one obvious issue I see is that panicstr is not set atomically,
> so two CPUs entering panic(9) simultaneously may clobber panicbuf.
>
> If we set panicstr atomically only one CPU will write panicbuf.
>
> Thoughts?

My understanding is that to set panicstr atomically and avoid clobbering,
you only need atomic_cas_ptr.

Your diff makes panicstr point to volatile data.  If intended, what's
the rationale?  I don't see how it helps avoid races when one cpu
"acquires" panicstr and then writes to panicbuf, and another CPU also
enters panic but just tries to print panicbuf.

If the goal was to mark panicstr itself as volatile, then you're
probably trying to solve another problem (making writes to panicstr
more instantly/widely visible?) and I *do not know* if that's the right
approach (or even needed), but volatile comes with a cost.

Take all of this with a grain of salt, I'm not in familiar territory
here.

> Index: kern/subr_prf.c
> ===
> RCS file: /cvs/src/sys/kern/subr_prf.c,v
> retrieving revision 1.102
> diff -u -p -r1.102 subr_prf.c
> --- kern/subr_prf.c   28 Nov 2020 17:53:05 -  1.102
> +++ kern/subr_prf.c   13 May 2021 00:04:28 -
> @@ -97,7 +97,7 @@ struct mutex kprintf_mutex =
>   */
>  
>  extern   int log_open;   /* subr_log: is /dev/klog open? */
> -constchar *panicstr; /* arg to first call to panic (used as a flag
> +volatile const char *panicstr; /* arg to first call to panic (used as a flag
>  to indicate that panic has already been called). */
>  constchar *faultstr; /* page fault string */
>  #ifdef DDB
> @@ -195,12 +195,10 @@ panic(const char *fmt, ...)
>  
>   bootopt = RB_AUTOBOOT | RB_DUMP;
>   va_start(ap, fmt);
> - if (panicstr)
> + if (atomic_cas_ptr(, NULL, panicbuf) != NULL)
>   bootopt |= RB_NOSYNC;
> - else {
> + else
>   vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
> - panicstr = panicbuf;
> - }
>   va_end(ap);
>  
>   printf("panic: ");
> Index: sys/systm.h
> ===
> RCS file: /cvs/src/sys/sys/systm.h,v
> retrieving revision 1.153
> diff -u -p -r1.153 systm.h
> --- sys/systm.h   28 Apr 2021 09:42:04 -  1.153
> +++ sys/systm.h   13 May 2021 00:04:28 -
> @@ -71,7 +71,7 @@
>   * patched by a stalking hacker.
>   */
>  extern int securelevel;  /* system security level */
> -extern const char *panicstr; /* panic message */
> +extern volatile const char *panicstr;/* panic message */
>  extern const char *faultstr; /* fault message */
>  extern const char version[]; /* system version */
>  extern const char copyright[];   /* system copyright */
>

-- 
jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF  DDCC 0DFA 74AE 1524 E7EE



Re: panic(9): set panicstr atomically

2021-05-13 Thread Alexander Bluhm
On Wed, May 12, 2021 at 07:08:39PM -0500, Scott Cheloha wrote:
> I'm unsure which part of panic(9) is causing the problem he mentions,

I was talking about this:

r620-1# papnpaiancini:cc :p :op
opooolo_llc_ac_caccahhceh_ei_eti_tieetmme_mm__amgamigacigci__cc_hccehhcekcekc::
k :m  bmubmfubfuppflp llc pc pcuup  uf rfferree eel el iilsitss tm tom
omddoidfiiifeifeidde:d ::i ti etietmme m
a  daddardd rd0 r0
xx0fxfddf88d08c0cc0c6c76afc9b3f04500400++01+61 610 6x0
fx0fxffdffdf88d08
00020720d72a8c0049703eb!ef!e==!0=x009x59x95995b9ebbaee3ae3ae344ef54f5a4bff7db07990a9

It is serial console output caused by the vprintf(fmt, ap) a few
lines below.  Calling printf() from mutliple CPUs.

> If we set panicstr atomically only one CPU will write panicbuf.

We were lucky and the panic string contains only one of them.
Probably the final one.

ddb{4}> show panic
pool_cache_item_magic_check: mbufpl cpu free list modified: item addr
0xfd80ccca9300+16 0xfd80027dc47e!=0x959bea3e4ffab79a

You diff guarantees that it constains the first panic.  This is
likely the root of the problem.  Setting RB_NOSYNC reliably for all
other CPU is also good.

OK bluhm@

> Index: kern/subr_prf.c
> ===
> RCS file: /cvs/src/sys/kern/subr_prf.c,v
> retrieving revision 1.102
> diff -u -p -r1.102 subr_prf.c
> --- kern/subr_prf.c   28 Nov 2020 17:53:05 -  1.102
> +++ kern/subr_prf.c   13 May 2021 00:04:28 -
> @@ -97,7 +97,7 @@ struct mutex kprintf_mutex =
>   */
>  
>  extern   int log_open;   /* subr_log: is /dev/klog open? */
> -constchar *panicstr; /* arg to first call to panic (used as a flag
> +volatile const char *panicstr; /* arg to first call to panic (used as a flag
>  to indicate that panic has already been called). */
>  constchar *faultstr; /* page fault string */
>  #ifdef DDB
> @@ -195,12 +195,10 @@ panic(const char *fmt, ...)
>  
>   bootopt = RB_AUTOBOOT | RB_DUMP;
>   va_start(ap, fmt);
> - if (panicstr)
> + if (atomic_cas_ptr(, NULL, panicbuf) != NULL)
>   bootopt |= RB_NOSYNC;
> - else {
> + else
>   vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
> - panicstr = panicbuf;
> - }
>   va_end(ap);
>  
>   printf("panic: ");
> Index: sys/systm.h
> ===
> RCS file: /cvs/src/sys/sys/systm.h,v
> retrieving revision 1.153
> diff -u -p -r1.153 systm.h
> --- sys/systm.h   28 Apr 2021 09:42:04 -  1.153
> +++ sys/systm.h   13 May 2021 00:04:28 -
> @@ -71,7 +71,7 @@
>   * patched by a stalking hacker.
>   */
>  extern int securelevel;  /* system security level */
> -extern const char *panicstr; /* panic message */
> +extern volatile const char *panicstr;/* panic message */
>  extern const char *faultstr; /* fault message */
>  extern const char version[]; /* system version */
>  extern const char copyright[];   /* system copyright */



Re: panic(9): set panicstr atomically

2021-05-13 Thread Anton Lindqvist
On Wed, May 12, 2021 at 07:08:39PM -0500, Scott Cheloha wrote:
> Hi,
> 
> In a separate mail thread, bluhm@ mentioned that panic(9) does not
> cleanly handle multiple CPUs entering it simultaneously:
> 
> https://marc.info/?l=openbsd-tech=161908805925325=2
> 
> I'm unsure which part of panic(9) is causing the problem he mentions,
> but one obvious issue I see is that panicstr is not set atomically,
> so two CPUs entering panic(9) simultaneously may clobber panicbuf.
> 
> If we set panicstr atomically only one CPU will write panicbuf.
> 
> Thoughts?

I've seen panics caused by syzkaller where panicbuf looks scrambled by
more than one thread writing to the same static buffer. Assigning
panicstr before the vsnprintf() call therefore makes sense. This is also
what NetBSD does, although not as an atomic operation.

ok anton@



Re: panic(9): set panicstr atomically

2021-05-12 Thread Theo de Raadt
Nicer than the garble...

It would be nice if we could see all the panics.  Could we also have a
per-cpu panic buffer, and then adapt ddb to show them all?

Scott Cheloha  wrote:

> In a separate mail thread, bluhm@ mentioned that panic(9) does not
> cleanly handle multiple CPUs entering it simultaneously:
> 
> https://marc.info/?l=openbsd-tech=161908805925325=2
> 
> I'm unsure which part of panic(9) is causing the problem he mentions,
> but one obvious issue I see is that panicstr is not set atomically,
> so two CPUs entering panic(9) simultaneously may clobber panicbuf.
> 
> If we set panicstr atomically only one CPU will write panicbuf.
> 
> Thoughts?
> 
> Index: kern/subr_prf.c
> ===
> RCS file: /cvs/src/sys/kern/subr_prf.c,v
> retrieving revision 1.102
> diff -u -p -r1.102 subr_prf.c
> --- kern/subr_prf.c   28 Nov 2020 17:53:05 -  1.102
> +++ kern/subr_prf.c   13 May 2021 00:04:28 -
> @@ -97,7 +97,7 @@ struct mutex kprintf_mutex =
>   */
>  
>  extern   int log_open;   /* subr_log: is /dev/klog open? */
> -constchar *panicstr; /* arg to first call to panic (used as a flag
> +volatile const char *panicstr; /* arg to first call to panic (used as a flag
>  to indicate that panic has already been called). */
>  constchar *faultstr; /* page fault string */
>  #ifdef DDB
> @@ -195,12 +195,10 @@ panic(const char *fmt, ...)
>  
>   bootopt = RB_AUTOBOOT | RB_DUMP;
>   va_start(ap, fmt);
> - if (panicstr)
> + if (atomic_cas_ptr(, NULL, panicbuf) != NULL)
>   bootopt |= RB_NOSYNC;
> - else {
> + else
>   vsnprintf(panicbuf, sizeof panicbuf, fmt, ap);
> - panicstr = panicbuf;
> - }
>   va_end(ap);
>  
>   printf("panic: ");
> Index: sys/systm.h
> ===
> RCS file: /cvs/src/sys/sys/systm.h,v
> retrieving revision 1.153
> diff -u -p -r1.153 systm.h
> --- sys/systm.h   28 Apr 2021 09:42:04 -  1.153
> +++ sys/systm.h   13 May 2021 00:04:28 -
> @@ -71,7 +71,7 @@
>   * patched by a stalking hacker.
>   */
>  extern int securelevel;  /* system security level */
> -extern const char *panicstr; /* panic message */
> +extern volatile const char *panicstr;/* panic message */
>  extern const char *faultstr; /* fault message */
>  extern const char version[]; /* system version */
>  extern const char copyright[];   /* system copyright */
>