Re: panic at reboot - tsc_test_sync_ap

2023-01-09 Thread Pedro Caetano
Hi Scott, all,

Thank you for the follow up.
I haven't tried this diff yet. (This routers support an application that
doesn't allow much downtime)

I'm planning on testing this whenever a new patch forces me to reboot.
Worst case on next version upgrade.

Best,
Pedro

A segunda, 9/01/2023, 16:26, Scott Cheloha 
escreveu:

> Pedro,
>
> On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote:
> > On Wed, Dec 14, 2022 at 11:37:14AM +, Pedro Caetano wrote:
> > > Hi bugs@
> > >
> > > In the process of upgrading a pair of servers to release 7.2, the
> following
> > > panic was triggered after sysupgrade reboot. (dell poweredge R740)
> > >
> > > One of the reboots happened before syspatch, the other happened after
> > > applying the release patches.
> > >
> > > After powercycling, both servers managed to boot successfully.
> > >
> > > Please keep me copied as I'm not subscribed to bugs@
> > >
> > >
> > > Screenshot of the panic uploaded attached to this email.
> >
> > For reference:
> >
> > cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB
> 64b/line 16-way L2 cache, 8MB 64b/line 11-way L3 cache
> > cpu2: smt 0, core 5, package 0
> > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1
> > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at
> mainbus0: apid 26 (application process
> >
> > Somehow your machine is violating one of the TSC sync test sanity
> > checks.  The idea behind this one is that there should only be one AP
> > in the sync test at a time.
> >
> > At the start of each test, in tsc_test_sync_ap(), the AP sets
> > tsc_ap_name to its dv_xname.  It does this with an atomic CAS
> > expecting NULL to ensure no other AP is still running the sync test.
> > You're hitting this panic:
> >
> >449  void
> >450  tsc_test_sync_ap(struct cpu_info *ci)
> >451  {
> >452  if (!tsc_is_invariant)
> >453  return;
> >454  #ifndef TSC_DEBUG
> >455  if (!tsc_is_synchronized)
> >456  return;
> >457  #endif
> >458  /* The BP needs our name in order to report any
> problems. */
> >459  if (atomic_cas_ptr(_ap_name, NULL,
> ci->ci_dev->dv_xname) != NULL) {
> >460  panic("%s: %s: tsc_ap_name is not NULL: %s",
> >461  __func__, ci->ci_dev->dv_xname, tsc_ap_name);
> >462  }
> >
> > The BP is supposed to reset tsc_ap_name to NULL at the conclusion of
> > every sync test, from tsc_test_sync_bp():
> >
> >415  /*
> >416   * Report what happened.  Adjust the TSC's
> quality
> >417   * if this is the first time we've failed the
> test.
> >418   */
> >419  tsc_report_test_results();
> >420  if (tsc_ap_status.lag_count ||
> tsc_bp_status.lag_count) {
> >421  if (tsc_is_synchronized) {
> >422  tsc_is_synchronized = 0;
> >423
> tc_reset_quality(_timecounter, -1000);
> >424  }
> >425  tsc_test_rounds = 0;
> >426  } else
> >427  tsc_test_rounds--;
> >428
> >429  /*
> >430   * Clean up for the next round.  It is safe to
> reset the
> >431   * ingress barrier because at this point we know
> the AP
> >432   * has reached the egress barrier.
> >433   */
> >434  memset(_ap_status, 0, sizeof tsc_ap_status);
> >435  memset(_bp_status, 0, sizeof tsc_bp_status);
> >436  tsc_ingress_barrier = 0;
> >437  if (tsc_test_rounds == 0)
> >438  tsc_ap_name = NULL;
> >
> > It's possible the BP's store:
> >
> >   tsc_ap_name = NULL;
> >
> > is not *always* globally visible by the time the next AP reaches the
> > tsc_ap_name CAS, triggering the panic.  If so, we could force the
> > store to complete with membar_producer().  tsc_ap_name should be
> > volatile, too.
> >
> > OTOH, it's possible this particular check is not the right thing here.
> > My intention is correct... we definitely don't want more than one AP
> > in the sync test at any given moment.  But this tsc_ap_name handshake
> > thing may be the wrong way to assert that.
>
> Just following up on this.  Have you seen the panic you reported again?
>


Re: panic at reboot - tsc_test_sync_ap

2023-01-09 Thread Scott Cheloha
Pedro,

On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote:
> On Wed, Dec 14, 2022 at 11:37:14AM +, Pedro Caetano wrote:
> > Hi bugs@
> > 
> > In the process of upgrading a pair of servers to release 7.2, the following
> > panic was triggered after sysupgrade reboot. (dell poweredge R740)
> > 
> > One of the reboots happened before syspatch, the other happened after
> > applying the release patches.
> > 
> > After powercycling, both servers managed to boot successfully.
> > 
> > Please keep me copied as I'm not subscribed to bugs@
> > 
> > 
> > Screenshot of the panic uploaded attached to this email.
> 
> For reference:
> 
> cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
> 16-way L2 cache, 8MB 64b/line 11-way L3 cache
> cpu2: smt 0, core 5, package 0
> panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1
> panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0: 
> apid 26 (application process
> 
> Somehow your machine is violating one of the TSC sync test sanity
> checks.  The idea behind this one is that there should only be one AP
> in the sync test at a time.
> 
> At the start of each test, in tsc_test_sync_ap(), the AP sets
> tsc_ap_name to its dv_xname.  It does this with an atomic CAS
> expecting NULL to ensure no other AP is still running the sync test.
> You're hitting this panic:
> 
>449  void
>450  tsc_test_sync_ap(struct cpu_info *ci)
>451  {
>452  if (!tsc_is_invariant)
>453  return;
>454  #ifndef TSC_DEBUG
>455  if (!tsc_is_synchronized)
>456  return;
>457  #endif
>458  /* The BP needs our name in order to report any problems. */
>459  if (atomic_cas_ptr(_ap_name, NULL, ci->ci_dev->dv_xname) 
> != NULL) {
>460  panic("%s: %s: tsc_ap_name is not NULL: %s",
>461  __func__, ci->ci_dev->dv_xname, tsc_ap_name);
>462  }
> 
> The BP is supposed to reset tsc_ap_name to NULL at the conclusion of
> every sync test, from tsc_test_sync_bp():
> 
>415  /*
>416   * Report what happened.  Adjust the TSC's quality
>417   * if this is the first time we've failed the test.
>418   */
>419  tsc_report_test_results();
>420  if (tsc_ap_status.lag_count || 
> tsc_bp_status.lag_count) {
>421  if (tsc_is_synchronized) {
>422  tsc_is_synchronized = 0;
>423  tc_reset_quality(_timecounter, 
> -1000);
>424  }
>425  tsc_test_rounds = 0;
>426  } else
>427  tsc_test_rounds--;
>428
>429  /*
>430   * Clean up for the next round.  It is safe to reset 
> the
>431   * ingress barrier because at this point we know the 
> AP
>432   * has reached the egress barrier.
>433   */
>434  memset(_ap_status, 0, sizeof tsc_ap_status);
>435  memset(_bp_status, 0, sizeof tsc_bp_status);
>436  tsc_ingress_barrier = 0;
>437  if (tsc_test_rounds == 0)
>438  tsc_ap_name = NULL;
> 
> It's possible the BP's store:
> 
>   tsc_ap_name = NULL;
> 
> is not *always* globally visible by the time the next AP reaches the
> tsc_ap_name CAS, triggering the panic.  If so, we could force the
> store to complete with membar_producer().  tsc_ap_name should be
> volatile, too.
> 
> OTOH, it's possible this particular check is not the right thing here.
> My intention is correct... we definitely don't want more than one AP
> in the sync test at any given moment.  But this tsc_ap_name handshake
> thing may be the wrong way to assert that.

Just following up on this.  Have you seen the panic you reported again?



Re: panic at reboot - tsc_test_sync_ap

2022-12-14 Thread Scott Cheloha
On Wed, Dec 14, 2022 at 11:37:14AM +, Pedro Caetano wrote:
> Hi bugs@
> 
> In the process of upgrading a pair of servers to release 7.2, the following
> panic was triggered after sysupgrade reboot. (dell poweredge R740)
> 
> One of the reboots happened before syspatch, the other happened after
> applying the release patches.
> 
> After powercycling, both servers managed to boot successfully.
> 
> Please keep me copied as I'm not subscribed to bugs@
> 
> 
> Screenshot of the panic uploaded attached to this email.

For reference:

cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 
16-way L2 cache, 8MB 64b/line 11-way L3 cache
cpu2: smt 0, core 5, package 0
panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1
panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0: 
apid 26 (application process

Somehow your machine is violating one of the TSC sync test sanity
checks.  The idea behind this one is that there should only be one AP
in the sync test at a time.

At the start of each test, in tsc_test_sync_ap(), the AP sets
tsc_ap_name to its dv_xname.  It does this with an atomic CAS
expecting NULL to ensure no other AP is still running the sync test.
You're hitting this panic:

   449  void
   450  tsc_test_sync_ap(struct cpu_info *ci)
   451  {
   452  if (!tsc_is_invariant)
   453  return;
   454  #ifndef TSC_DEBUG
   455  if (!tsc_is_synchronized)
   456  return;
   457  #endif
   458  /* The BP needs our name in order to report any problems. */
   459  if (atomic_cas_ptr(_ap_name, NULL, ci->ci_dev->dv_xname) != 
NULL) {
   460  panic("%s: %s: tsc_ap_name is not NULL: %s",
   461  __func__, ci->ci_dev->dv_xname, tsc_ap_name);
   462  }

The BP is supposed to reset tsc_ap_name to NULL at the conclusion of
every sync test, from tsc_test_sync_bp():

   415  /*
   416   * Report what happened.  Adjust the TSC's quality
   417   * if this is the first time we've failed the test.
   418   */
   419  tsc_report_test_results();
   420  if (tsc_ap_status.lag_count || tsc_bp_status.lag_count) 
{
   421  if (tsc_is_synchronized) {
   422  tsc_is_synchronized = 0;
   423  tc_reset_quality(_timecounter, 
-1000);
   424  }
   425  tsc_test_rounds = 0;
   426  } else
   427  tsc_test_rounds--;
   428
   429  /*
   430   * Clean up for the next round.  It is safe to reset the
   431   * ingress barrier because at this point we know the AP
   432   * has reached the egress barrier.
   433   */
   434  memset(_ap_status, 0, sizeof tsc_ap_status);
   435  memset(_bp_status, 0, sizeof tsc_bp_status);
   436  tsc_ingress_barrier = 0;
   437  if (tsc_test_rounds == 0)
   438  tsc_ap_name = NULL;

It's possible the BP's store:

tsc_ap_name = NULL;

is not *always* globally visible by the time the next AP reaches the
tsc_ap_name CAS, triggering the panic.  If so, we could force the
store to complete with membar_producer().  tsc_ap_name should be
volatile, too.

OTOH, it's possible this particular check is not the right thing here.
My intention is correct... we definitely don't want more than one AP
in the sync test at any given moment.  But this tsc_ap_name handshake
thing may be the wrong way to assert that.

Index: tsc.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
retrieving revision 1.30
diff -u -p -r1.30 tsc.c
--- tsc.c   24 Oct 2022 00:56:33 -  1.30
+++ tsc.c   14 Dec 2022 18:12:54 -
@@ -372,7 +372,7 @@ struct tsc_test_status {
 struct tsc_test_status tsc_ap_status;  /* Test results from AP */
 struct tsc_test_status tsc_bp_status;  /* Test results from BP */
 uint64_t tsc_test_cycles;  /* [p] TSC cycles per test round */
-const char *tsc_ap_name;   /* [b] Name of AP running test */
+volatile const char *tsc_ap_name;  /* [b] Name of AP running test */
 volatile u_int tsc_egress_barrier; /* [a] Test end barrier */
 volatile u_int tsc_ingress_barrier;/* [a] Test start barrier */
 volatile u_int tsc_test_rounds;/* [p] Remaining test rounds */
@@ -434,8 +434,10 @@ tsc_test_sync_bp(struct cpu_info *ci)
memset(_ap_status, 0, sizeof tsc_ap_status);
memset(_bp_status, 0, sizeof tsc_bp_status);
tsc_ingress_barrier = 0;
-   if (tsc_test_rounds == 0)
+   if (tsc_test_rounds == 0) {
tsc_ap_name =