Re: panic at reboot - tsc_test_sync_ap
Hi Scott, all, Thank you for the follow up. I haven't tried this diff yet. (This routers support an application that doesn't allow much downtime) I'm planning on testing this whenever a new patch forces me to reboot. Worst case on next version upgrade. Best, Pedro A segunda, 9/01/2023, 16:26, Scott Cheloha escreveu: > Pedro, > > On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote: > > On Wed, Dec 14, 2022 at 11:37:14AM +, Pedro Caetano wrote: > > > Hi bugs@ > > > > > > In the process of upgrading a pair of servers to release 7.2, the > following > > > panic was triggered after sysupgrade reboot. (dell poweredge R740) > > > > > > One of the reboots happened before syspatch, the other happened after > > > applying the release patches. > > > > > > After powercycling, both servers managed to boot successfully. > > > > > > Please keep me copied as I'm not subscribed to bugs@ > > > > > > > > > Screenshot of the panic uploaded attached to this email. > > > > For reference: > > > > cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB > 64b/line 16-way L2 cache, 8MB 64b/line 11-way L3 cache > > cpu2: smt 0, core 5, package 0 > > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1 > > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at > mainbus0: apid 26 (application process > > > > Somehow your machine is violating one of the TSC sync test sanity > > checks. The idea behind this one is that there should only be one AP > > in the sync test at a time. > > > > At the start of each test, in tsc_test_sync_ap(), the AP sets > > tsc_ap_name to its dv_xname. It does this with an atomic CAS > > expecting NULL to ensure no other AP is still running the sync test. > > You're hitting this panic: > > > >449 void > >450 tsc_test_sync_ap(struct cpu_info *ci) > >451 { > >452 if (!tsc_is_invariant) > >453 return; > >454 #ifndef TSC_DEBUG > >455 if (!tsc_is_synchronized) > >456 return; > >457 #endif > >458 /* The BP needs our name in order to report any > problems. */ > >459 if (atomic_cas_ptr(_ap_name, NULL, > ci->ci_dev->dv_xname) != NULL) { > >460 panic("%s: %s: tsc_ap_name is not NULL: %s", > >461 __func__, ci->ci_dev->dv_xname, tsc_ap_name); > >462 } > > > > The BP is supposed to reset tsc_ap_name to NULL at the conclusion of > > every sync test, from tsc_test_sync_bp(): > > > >415 /* > >416 * Report what happened. Adjust the TSC's > quality > >417 * if this is the first time we've failed the > test. > >418 */ > >419 tsc_report_test_results(); > >420 if (tsc_ap_status.lag_count || > tsc_bp_status.lag_count) { > >421 if (tsc_is_synchronized) { > >422 tsc_is_synchronized = 0; > >423 > tc_reset_quality(_timecounter, -1000); > >424 } > >425 tsc_test_rounds = 0; > >426 } else > >427 tsc_test_rounds--; > >428 > >429 /* > >430 * Clean up for the next round. It is safe to > reset the > >431 * ingress barrier because at this point we know > the AP > >432 * has reached the egress barrier. > >433 */ > >434 memset(_ap_status, 0, sizeof tsc_ap_status); > >435 memset(_bp_status, 0, sizeof tsc_bp_status); > >436 tsc_ingress_barrier = 0; > >437 if (tsc_test_rounds == 0) > >438 tsc_ap_name = NULL; > > > > It's possible the BP's store: > > > > tsc_ap_name = NULL; > > > > is not *always* globally visible by the time the next AP reaches the > > tsc_ap_name CAS, triggering the panic. If so, we could force the > > store to complete with membar_producer(). tsc_ap_name should be > > volatile, too. > > > > OTOH, it's possible this particular check is not the right thing here. > > My intention is correct... we definitely don't want more than one AP > > in the sync test at any given moment. But this tsc_ap_name handshake > > thing may be the wrong way to assert that. > > Just following up on this. Have you seen the panic you reported again? >
Re: panic at reboot - tsc_test_sync_ap
Pedro, On Wed, Dec 14, 2022 at 12:24:42PM -0600, Scott Cheloha wrote: > On Wed, Dec 14, 2022 at 11:37:14AM +, Pedro Caetano wrote: > > Hi bugs@ > > > > In the process of upgrading a pair of servers to release 7.2, the following > > panic was triggered after sysupgrade reboot. (dell poweredge R740) > > > > One of the reboots happened before syspatch, the other happened after > > applying the release patches. > > > > After powercycling, both servers managed to boot successfully. > > > > Please keep me copied as I'm not subscribed to bugs@ > > > > > > Screenshot of the panic uploaded attached to this email. > > For reference: > > cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line > 16-way L2 cache, 8MB 64b/line 11-way L3 cache > cpu2: smt 0, core 5, package 0 > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1 > panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0: > apid 26 (application process > > Somehow your machine is violating one of the TSC sync test sanity > checks. The idea behind this one is that there should only be one AP > in the sync test at a time. > > At the start of each test, in tsc_test_sync_ap(), the AP sets > tsc_ap_name to its dv_xname. It does this with an atomic CAS > expecting NULL to ensure no other AP is still running the sync test. > You're hitting this panic: > >449 void >450 tsc_test_sync_ap(struct cpu_info *ci) >451 { >452 if (!tsc_is_invariant) >453 return; >454 #ifndef TSC_DEBUG >455 if (!tsc_is_synchronized) >456 return; >457 #endif >458 /* The BP needs our name in order to report any problems. */ >459 if (atomic_cas_ptr(_ap_name, NULL, ci->ci_dev->dv_xname) > != NULL) { >460 panic("%s: %s: tsc_ap_name is not NULL: %s", >461 __func__, ci->ci_dev->dv_xname, tsc_ap_name); >462 } > > The BP is supposed to reset tsc_ap_name to NULL at the conclusion of > every sync test, from tsc_test_sync_bp(): > >415 /* >416 * Report what happened. Adjust the TSC's quality >417 * if this is the first time we've failed the test. >418 */ >419 tsc_report_test_results(); >420 if (tsc_ap_status.lag_count || > tsc_bp_status.lag_count) { >421 if (tsc_is_synchronized) { >422 tsc_is_synchronized = 0; >423 tc_reset_quality(_timecounter, > -1000); >424 } >425 tsc_test_rounds = 0; >426 } else >427 tsc_test_rounds--; >428 >429 /* >430 * Clean up for the next round. It is safe to reset > the >431 * ingress barrier because at this point we know the > AP >432 * has reached the egress barrier. >433 */ >434 memset(_ap_status, 0, sizeof tsc_ap_status); >435 memset(_bp_status, 0, sizeof tsc_bp_status); >436 tsc_ingress_barrier = 0; >437 if (tsc_test_rounds == 0) >438 tsc_ap_name = NULL; > > It's possible the BP's store: > > tsc_ap_name = NULL; > > is not *always* globally visible by the time the next AP reaches the > tsc_ap_name CAS, triggering the panic. If so, we could force the > store to complete with membar_producer(). tsc_ap_name should be > volatile, too. > > OTOH, it's possible this particular check is not the right thing here. > My intention is correct... we definitely don't want more than one AP > in the sync test at any given moment. But this tsc_ap_name handshake > thing may be the wrong way to assert that. Just following up on this. Have you seen the panic you reported again?
Re: panic at reboot - tsc_test_sync_ap
On Wed, Dec 14, 2022 at 11:37:14AM +, Pedro Caetano wrote: > Hi bugs@ > > In the process of upgrading a pair of servers to release 7.2, the following > panic was triggered after sysupgrade reboot. (dell poweredge R740) > > One of the reboots happened before syspatch, the other happened after > applying the release patches. > > After powercycling, both servers managed to boot successfully. > > Please keep me copied as I'm not subscribed to bugs@ > > > Screenshot of the panic uploaded attached to this email. For reference: cpu2: 32KB 64B/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 1MB 64b/line 16-way L2 cache, 8MB 64b/line 11-way L3 cache cpu2: smt 0, core 5, package 0 panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1 panic: tsc_test_sync_ap: cpu2: tsc_ap_name is not NULL: cpu1cpu3 at mainbus0: apid 26 (application process Somehow your machine is violating one of the TSC sync test sanity checks. The idea behind this one is that there should only be one AP in the sync test at a time. At the start of each test, in tsc_test_sync_ap(), the AP sets tsc_ap_name to its dv_xname. It does this with an atomic CAS expecting NULL to ensure no other AP is still running the sync test. You're hitting this panic: 449 void 450 tsc_test_sync_ap(struct cpu_info *ci) 451 { 452 if (!tsc_is_invariant) 453 return; 454 #ifndef TSC_DEBUG 455 if (!tsc_is_synchronized) 456 return; 457 #endif 458 /* The BP needs our name in order to report any problems. */ 459 if (atomic_cas_ptr(_ap_name, NULL, ci->ci_dev->dv_xname) != NULL) { 460 panic("%s: %s: tsc_ap_name is not NULL: %s", 461 __func__, ci->ci_dev->dv_xname, tsc_ap_name); 462 } The BP is supposed to reset tsc_ap_name to NULL at the conclusion of every sync test, from tsc_test_sync_bp(): 415 /* 416 * Report what happened. Adjust the TSC's quality 417 * if this is the first time we've failed the test. 418 */ 419 tsc_report_test_results(); 420 if (tsc_ap_status.lag_count || tsc_bp_status.lag_count) { 421 if (tsc_is_synchronized) { 422 tsc_is_synchronized = 0; 423 tc_reset_quality(_timecounter, -1000); 424 } 425 tsc_test_rounds = 0; 426 } else 427 tsc_test_rounds--; 428 429 /* 430 * Clean up for the next round. It is safe to reset the 431 * ingress barrier because at this point we know the AP 432 * has reached the egress barrier. 433 */ 434 memset(_ap_status, 0, sizeof tsc_ap_status); 435 memset(_bp_status, 0, sizeof tsc_bp_status); 436 tsc_ingress_barrier = 0; 437 if (tsc_test_rounds == 0) 438 tsc_ap_name = NULL; It's possible the BP's store: tsc_ap_name = NULL; is not *always* globally visible by the time the next AP reaches the tsc_ap_name CAS, triggering the panic. If so, we could force the store to complete with membar_producer(). tsc_ap_name should be volatile, too. OTOH, it's possible this particular check is not the right thing here. My intention is correct... we definitely don't want more than one AP in the sync test at any given moment. But this tsc_ap_name handshake thing may be the wrong way to assert that. Index: tsc.c === RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v retrieving revision 1.30 diff -u -p -r1.30 tsc.c --- tsc.c 24 Oct 2022 00:56:33 - 1.30 +++ tsc.c 14 Dec 2022 18:12:54 - @@ -372,7 +372,7 @@ struct tsc_test_status { struct tsc_test_status tsc_ap_status; /* Test results from AP */ struct tsc_test_status tsc_bp_status; /* Test results from BP */ uint64_t tsc_test_cycles; /* [p] TSC cycles per test round */ -const char *tsc_ap_name; /* [b] Name of AP running test */ +volatile const char *tsc_ap_name; /* [b] Name of AP running test */ volatile u_int tsc_egress_barrier; /* [a] Test end barrier */ volatile u_int tsc_ingress_barrier;/* [a] Test start barrier */ volatile u_int tsc_test_rounds;/* [p] Remaining test rounds */ @@ -434,8 +434,10 @@ tsc_test_sync_bp(struct cpu_info *ci) memset(_ap_status, 0, sizeof tsc_ap_status); memset(_bp_status, 0, sizeof tsc_bp_status); tsc_ingress_barrier = 0; - if (tsc_test_rounds == 0) + if (tsc_test_rounds == 0) { tsc_ap_name =