Re: Strange issue after early AP startup

2017-01-22 Thread Hans Petter Selasky

On 01/23/17 06:27, Jia-Shiun Li wrote:

On Sat, Jan 21, 2017 at 3:51 PM, Hans Petter Selasky 
wrote:


FYI:

https://svnweb.freebsd.org/changeset/base/312551



Hi hps,

sorry I have to correct my test results.
I found that I did not revert changes to kernel config,
which commented out EARLY_AP_STARTUP option,
before testing you patches. So they were tested without
EARLY_AP_STARTUP.

I tested again your patches with kernel config reverted,
and they (and head as of r312620) did not solve the c2d
system time lag issue.

vmstat results attached.
And also my apology for this stupid mistake.



This might be another issue, not related to my patch.

--HPS

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-22 Thread Jia-Shiun Li
On Sat, Jan 21, 2017 at 3:51 PM, Hans Petter Selasky 
wrote:

> FYI:
>
> https://svnweb.freebsd.org/changeset/base/312551
>
>
Hi hps,

sorry I have to correct my test results.
I found that I did not revert changes to kernel config,
which commented out EARLY_AP_STARTUP option,
before testing you patches. So they were tested without
EARLY_AP_STARTUP.

I tested again your patches with kernel config reverted,
and they (and head as of r312620) did not solve the c2d
system time lag issue.

vmstat results attached.
And also my apology for this stupid mistake.

-Jia-Shiun.
interrupt  total   rate
???0  0
irq1: atkbd0   2  0
stray irq1 0  0
irq0: attimer0 0  0
stray irq0 0  0
irq3:  0  0
stray irq3 0  0
irq4: uart00  0
stray irq4 0  0
irq5:  0  0
stray irq5 0  0
irq6:  0  0
stray irq6 0  0
irq7:  0  0
stray irq7 0  0
irq8: atrtc0   0  0
stray irq8 0  0
irq9: acpi00  0
stray irq9 0  0
irq10: 0  0
stray irq100  0
irq11: 0  0
stray irq110  0
irq12: 0  0
stray irq120  0
irq13: 0  0
stray irq130  0
irq14: 0  0
stray irq140  0
irq15: 0  0
stray irq150  0
irq16: em0:irq0++ 18  0
stray irq160  0
irq17: 0  0
stray irq170  0
irq18: uhci2 ehci0+   18  0
stray irq180  0
irq19: uhci4   0  0
stray irq190  0
irq20: hpet0   12503277
stray irq200  0
irq21: uhci1   0  0
stray irq210  0
irq22: 0  0
stray irq220  0
irq23: uhci3 ehci1 0  0
stray irq230  0
cpu0:timer 0  0
cpu1:timer 0  0
irq256: hdac0 97  2
stray irq256   0  0
irq257: pcib1  0  0
stray irq257   0  0
irq258: pcib2  0  0
stray irq258   0  0
irq259: pcib3  0  0
stray irq259   0  0
irq260: re0 9049201
stray irq260   0  0
irq261: ahci0:ch0   4186 93
stray irq261   0  0
irq262: ahci0:ch1  0  0
stray irq262   0  0
irq263: ahci0:ch2  0  0
stray irq263   0  0
irq264: ahci0:ch3  0  0
stray irq264   0  0
irq265: ahci0:ch4  0  0
stray irq265   0  0
irq266: ahci0:ch5  0  0
stray irq266   0  0
irq267: ahci0:60  0
stray irq267   0  0
irq268: ahci0:70  0
stray irq268   0  0
irq269: ahci0:80  0
stray irq269   0  0
irq270: ahci0:90  0
stray irq270   0  0
irq271: ahci0:10   0  0
stray irq271   0  0
irq272: ahci0:11   0  0
stray irq272   0  0
irq273: ahci0:12   0 

Re: Strange issue after early AP startup

2017-01-20 Thread Hans Petter Selasky

FYI:

https://svnweb.freebsd.org/changeset/base/312551

--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-19 Thread Jia-Shiun Li
On Thu, Jan 19, 2017 at 4:23 PM, Hans Petter Selasky 
wrote:

>
> I can add prints/asserts to show that what happens is that
> "state->nextcallopt > now" while "state->nextcall <= now". This situtation
> is allowed to persist due to the way getnextcpuevent() is currently
> implemented.
>
> Can the people CC'ed give the attached patch a spin and report back?
>
>
As far as c2d system time is concerned, it works correctly for me w/
r312210.

-Jia-Shiun.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-19 Thread Hans Petter Selasky

On 01/18/17 09:00, Hans Petter Selasky wrote:

On 01/18/17 02:18, John Baldwin wrote:

You might still want to adjust 'nextevent' to schedule the next interrupt
to be sooner than 'timerperiod' though.  You could just set
'nextevent' to
'now' in that case instead of 'next'.


Right, I'll give that a spin. Would have to be "now + 1" instead of
"now", due to check before et_start() ?



Hi John,

Here is another variant of my patch which solves the EARLY AP startup 
problem with timers. What do you think?


> diff --git a/sys/kern/kern_clocksource.c b/sys/kern/kern_clocksource.c
> index 7f7769d..8bacff6 100644
> --- a/sys/kern/kern_clocksource.c
> +++ b/sys/kern/kern_clocksource.c
> @@ -207,7 +207,7 @@ handleevents(sbintime_t now, int fake)
> }
> } else
> state->nextprof = state->nextstat;
> -   if (now >= state->nextcallopt) {
> +   if (now >= state->nextcallopt || now >= state->nextcall) {
> state->nextcall = state->nextcallopt = SBT_MAX;
> callout_process(now);
> }

I can add prints/asserts to show that what happens is that 
"state->nextcallopt > now" while "state->nextcall <= now". This 
situtation is allowed to persist due to the way getnextcpuevent() is 
currently implemented.


Can the people CC'ed give the attached patch a spin and report back?

--HPS
diff --git a/sys/kern/kern_clocksource.c b/sys/kern/kern_clocksource.c
index 7f7769d..8bacff6 100644
--- a/sys/kern/kern_clocksource.c
+++ b/sys/kern/kern_clocksource.c
@@ -207,7 +207,7 @@ handleevents(sbintime_t now, int fake)
 		}
 	} else
 		state->nextprof = state->nextstat;
-	if (now >= state->nextcallopt) {
+	if (now >= state->nextcallopt || now >= state->nextcall) {
 		state->nextcall = state->nextcallopt = SBT_MAX;
 		callout_process(now);
 	}
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Strange issue after early AP startup

2017-01-18 Thread Cy Schubert
In message <1922021.4hjeqfj...@ralph.baldwin.cx>, John Baldwin writes:
> On Tuesday, January 17, 2017 05:08:58 PM Cy Schubert wrote:
> > In message <1492450.xzfnz8z...@ralph.baldwin.cx>, John Baldwin writes:
> > > On Tuesday, January 17, 2017 12:53:19 PM Cy Schubert wrote:
> > > > In message , Hans Pet
> ter 
> > > > Sela
> > > > sky writes:
> > > > > Hi,
> > > > > 
> > > > > When booting I observe an additional 30-second delay after this print
> :
> > > > > 
> > > > > > Timecounters tick every 1.000 msec
> > > > > 
> > > > > ~30 second delay and boot continues like normal.
> > > > > 
> > > > > Checking "vmstat -i" reveals that some timers have been running loose
> .
> > > > > 
> > > > > > cpu0:timer 44300442
> > > > > > cpu1:timer 40561404
> > > > > > cpu3:timer  48462822 483058
> > > > > > cpu2:timer  48477898 483209
> > > > > 
> > > > > Trying to add delays and/or prints around the Timecounters printout 
> > > > > makes the issue go away. Any ideas for debugging?
> > > > > 
> > > > > Looks like a startup race to me.
> > > > 
> > > > just picking a random email to reply to, I'm seeing a different issue w
> ith 
> > > > early AP startup. It affects one of my four machines, my laptop. My thr
> ee 
> > > > server systems downstairs have no problem however my laptop will reboot
>  
> > > > repeatedly at:
> > > > 
> > > > Jan 17 11:55:16 slippy kernel: cd0: Attempt to query device size failed
> : 
> > > > NOT READY, Medium not present - tray closed
> > > 
> > > So it panics and reboots after this?
> > 
> > Yes, it goes into a panic/reboot loop for a few iterations until it 
> > successfully boots. Disabling early AP startup allows it to boot up without
>  
> > the assumed race.
> 
> Can you add DDB to the kernel config (and remove DDB_UNATTENDED) to get it
> to break into DDB when it panics to get the panic message (and a stack trace
> as well)?

I found and fixed the problem. It was in some code I had added a long time 
ago but not committed yet to the bge driver to implement WOL. It was a lock 
assertion.


-- 
Cheers,
Cy Schubert 
FreeBSD UNIX:     Web:  http://www.FreeBSD.org

The need of the many outweighs the greed of the few.


___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-18 Thread John Baldwin
On Wednesday, January 18, 2017 09:00:52 AM Hans Petter Selasky wrote:
> On 01/18/17 02:18, John Baldwin wrote:
> > You might still want to adjust 'nextevent' to schedule the next interrupt
> > to be sooner than 'timerperiod' though.  You could just set 'nextevent' to
> > 'now' in that case instead of 'next'.
> 
> Right, I'll give that a spin. Would have to be "now + 1" instead of 
> "now", due to check before et_start() ?

Ugh, ok.  + 1 with sbintime_t is kind of odd which is why I would have liked
to avoid it, but it seems it is required.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-18 Thread Hans Petter Selasky

On 01/18/17 02:18, John Baldwin wrote:

You might still want to adjust 'nextevent' to schedule the next interrupt
to be sooner than 'timerperiod' though.  You could just set 'nextevent' to
'now' in that case instead of 'next'.


Right, I'll give that a spin. Would have to be "now + 1" instead of 
"now", due to check before et_start() ?


--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-18 Thread Hans Petter Selasky

On 01/18/17 02:18, John Baldwin wrote:

Note that 'nextevent' remains a full 'timerperiod' out (now + timerperiod)
and so the first clock interrupt is still 'timerperiod' time away and
any callouts are delayed by that amount of time.  Also, I think you could
set nextcallopt to 'now' rather than 'now + 1'.


Hi,

Does that mean the following piece of code is missing from getnextevent():


/* Handle callout events. */
if (event > state->nextcall)
event = state->nextcall;


Like getnextcpuevent() is doing?

--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread Hans Petter Selasky

Hi,

On 01/18/17 02:18, John Baldwin wrote:

Also, I think you could
set nextcallopt to 'now' rather than 'now + 1'.


There is a check in loadtimer() if next == now, and then the event timer 
is not started ??



} else {
new = getnextevent();
eq = (new == *next);
CTR4(KTR_SPARE2, "load at %d:next %d.%08x eq %d",
curcpu, (int)(new >> 32), (u_int)(new & 0x), eq);
if (!eq) {
*next = new;
et_start(timer, new - now, 0);
}
}



--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread John Baldwin
On Tuesday, January 17, 2017 05:08:58 PM Cy Schubert wrote:
> In message <1492450.xzfnz8z...@ralph.baldwin.cx>, John Baldwin writes:
> > On Tuesday, January 17, 2017 12:53:19 PM Cy Schubert wrote:
> > > In message , Hans 
> > > Petter 
> > > Sela
> > > sky writes:
> > > > Hi,
> > > > 
> > > > When booting I observe an additional 30-second delay after this print:
> > > > 
> > > > > Timecounters tick every 1.000 msec
> > > > 
> > > > ~30 second delay and boot continues like normal.
> > > > 
> > > > Checking "vmstat -i" reveals that some timers have been running loose.
> > > > 
> > > > > cpu0:timer 44300442
> > > > > cpu1:timer 40561404
> > > > > cpu3:timer  48462822 483058
> > > > > cpu2:timer  48477898 483209
> > > > 
> > > > Trying to add delays and/or prints around the Timecounters printout 
> > > > makes the issue go away. Any ideas for debugging?
> > > > 
> > > > Looks like a startup race to me.
> > > 
> > > just picking a random email to reply to, I'm seeing a different issue 
> > > with 
> > > early AP startup. It affects one of my four machines, my laptop. My three 
> > > server systems downstairs have no problem however my laptop will reboot 
> > > repeatedly at:
> > > 
> > > Jan 17 11:55:16 slippy kernel: cd0: Attempt to query device size failed: 
> > > NOT READY, Medium not present - tray closed
> > 
> > So it panics and reboots after this?
> 
> Yes, it goes into a panic/reboot loop for a few iterations until it 
> successfully boots. Disabling early AP startup allows it to boot up without 
> the assumed race.

Can you add DDB to the kernel config (and remove DDB_UNATTENDED) to get it
to break into DDB when it panics to get the panic message (and a stack trace
as well)?

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread John Baldwin
On Tuesday, January 17, 2017 10:35:06 PM Hans Petter Selasky wrote:
> On 01/17/17 22:28, Hans Petter Selasky wrote:
> > +   state->nextcall = SBT_MAX;
> > +   state->nextcallopt = now + 1;
> 
> BTW: What locks are protecting the update of these fields? Can they be 
> written simultaneously by configtimer() and cpu_new_callout()?

Both functions do ET_HW_LOCK() of DPCPU_PTR(timerstate).

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread John Baldwin
On Tuesday, January 17, 2017 10:28:47 PM Hans Petter Selasky wrote:
> On 01/17/17 20:46, Ian Lepore wrote:
> >>> Does this matter for the first tick? How often is configtimer() called?
> >> >
> >> > As I said, it is called at runtime when profclock is started / stopped, 
> >> > not
> >> > just at boot.  Those changes at runtime probably have existing callouts
> >> > active and your change will not process any callouts until the next 
> >> > hardclock
> >> > tick fires (but only because you are setting nextcallopt to the bogus
> >> > 'next' value).
> > On some platforms, configtimer() can be called quite often.  Power
> > saving modes can change the frequency of the timer, and systems that
> > suppport such dynamic frequency scaling call configtimer()
> > (via cpu_et_frequency()) to handle the changes.
> 
> Hi,
> 
> I propose the following patch then:
> 
> diff --git a/sys/kern/kern_clocksource.c b/sys/kern/kern_clocksource.c
> index 7f7769d..5ae925b 100644
> --- a/sys/kern/kern_clocksource.c
> +++ b/sys/kern/kern_clocksource.c
> @@ -511,8 +511,13 @@ configtimer(int start)
>  state->nexthard = next;
>  state->nextstat = next;
>  state->nextprof = next;
> -   state->nextcall = next;
> -   state->nextcallopt = next;
> +   /*
> +* Force callout_process() to be called
> +* instantly, so that the correct value of
> +* "nextcall" can be computed:
> +*/
> +   state->nextcall = SBT_MAX;
> +   state->nextcallopt = now + 1;
>  hardclock_sync(cpu);
>  }
>  busy = 0;
> 
> 
> Then there is no problem having to wait for the next tick or anything, 
> like John Baldwin pointed out.

Note that 'nextevent' remains a full 'timerperiod' out (now + timerperiod)
and so the first clock interrupt is still 'timerperiod' time away and
any callouts are delayed by that amount of time.  Also, I think you could
set nextcallopt to 'now' rather than 'now + 1'.

You might still want to adjust 'nextevent' to schedule the next interrupt
to be sooner than 'timerperiod' though.  You could just set 'nextevent' to
'now' in that case instead of 'next'.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread Cy Schubert
In message <1492450.xzfnz8z...@ralph.baldwin.cx>, John Baldwin writes:
> On Tuesday, January 17, 2017 12:53:19 PM Cy Schubert wrote:
> > In message , Hans Petter 
> > Sela
> > sky writes:
> > > Hi,
> > > 
> > > When booting I observe an additional 30-second delay after this print:
> > > 
> > > > Timecounters tick every 1.000 msec
> > > 
> > > ~30 second delay and boot continues like normal.
> > > 
> > > Checking "vmstat -i" reveals that some timers have been running loose.
> > > 
> > > > cpu0:timer 44300442
> > > > cpu1:timer 40561404
> > > > cpu3:timer  48462822 483058
> > > > cpu2:timer  48477898 483209
> > > 
> > > Trying to add delays and/or prints around the Timecounters printout 
> > > makes the issue go away. Any ideas for debugging?
> > > 
> > > Looks like a startup race to me.
> > 
> > just picking a random email to reply to, I'm seeing a different issue with 
> > early AP startup. It affects one of my four machines, my laptop. My three 
> > server systems downstairs have no problem however my laptop will reboot 
> > repeatedly at:
> > 
> > Jan 17 11:55:16 slippy kernel: cd0: Attempt to query device size failed: 
> > NOT READY, Medium not present - tray closed
> 
> So it panics and reboots after this?

Yes, it goes into a panic/reboot loop for a few iterations until it 
successfully boots. Disabling early AP startup allows it to boot up without 
the assumed race.


-- 
Cheers,
Cy Schubert 
FreeBSD UNIX:     Web:  http://www.FreeBSD.org

The need of the many outweighs the greed of the few.


___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread John Baldwin
On Tuesday, January 17, 2017 12:53:19 PM Cy Schubert wrote:
> In message , Hans Petter 
> Sela
> sky writes:
> > Hi,
> > 
> > When booting I observe an additional 30-second delay after this print:
> > 
> > > Timecounters tick every 1.000 msec
> > 
> > ~30 second delay and boot continues like normal.
> > 
> > Checking "vmstat -i" reveals that some timers have been running loose.
> > 
> > > cpu0:timer 44300442
> > > cpu1:timer 40561404
> > > cpu3:timer  48462822 483058
> > > cpu2:timer  48477898 483209
> > 
> > Trying to add delays and/or prints around the Timecounters printout 
> > makes the issue go away. Any ideas for debugging?
> > 
> > Looks like a startup race to me.
> 
> just picking a random email to reply to, I'm seeing a different issue with 
> early AP startup. It affects one of my four machines, my laptop. My three 
> server systems downstairs have no problem however my laptop will reboot 
> repeatedly at:
> 
> Jan 17 11:55:16 slippy kernel: cd0: Attempt to query device size failed: 
> NOT READY, Medium not present - tray closed

So it panics and reboots after this?

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread John Baldwin
On Tuesday, January 17, 2017 08:31:28 PM Hans Petter Selasky wrote:
> Hi,
> 
> On 01/17/17 20:00, John Baldwin wrote:
> >>
> >> Does this matter for the first tick? How often is configtimer() called?
> >
> > As I said, it is called at runtime when profclock is started / stopped, not
> > just at boot.  Those changes at runtime probably have existing callouts
> > active and your change will not process any callouts until the next 
> > hardclock
> > tick fires (but only because you are setting nextcallopt to the bogus
> > 'next' value).
> 
> 
> 
> >>  > (One odd thing is that even in your case the first call to
> >> handleevents(),
> >>  > the 'now => state->nextcallout' check in handleevents() should be true
> >>  > which resets both nextcall and nextcallopt and invokes
> >> callout_process().)
> >>
> >> Let me take you through the failure path, by code inspection:
> >
> > I would really appreciate it if you could add traces to find out what 
> > actually
> > happens rather than what seems to happen by looking at the code. :-/
> 
> The problem is that once you add some prints, the problem goes away. 
> Maybe I should try to set hz to 100 or 25 ???

Maybe use KTR instead and then have a kdb_enter() later in boot so you can use
'show ktr' in DDB.  (KTR has less overhead than printfs so might not disrupt
the timing as badly.)

> >
> > 0) cpu_initclocks_bsp() is called and init's nextcall and nexcallopt to 
> > SBT_MAX
> >similar to your change.  If no callout is scheduled before configtimer()
> >then they remain set to SBT_MAX.  Your current patch happens to trigger a
> >(bogus) call to callout_process() on the first hardclock() because it
> >sets nextcallopt to 'next' even though no callout is actually scheduled 
> > to
> >fire at time 'next'.
> >
> >> 1) configtimer() is called and we init nextcall and nextcallopt:
> >>
> >>  > next = now + timerperiod;
> >> ...
> >>  >   state->nextcall = next;
> >>  >   state->nextcallopt = next;
> >
> > These both say "the next callout() should fire at 'next' which is the time 
> > of
> > the next hardclock()", even though there may be no callouts scheduled (in 
> > which
> > case both of these fields should be set to SBT_MAX from the call to
> > cpu_initclocks_bsp(), or there may be callouts scheduled in which case 
> > 'nextcall'
> > and 'nextcallopt' will reflect the time that those callouts are already
> > scheduled for and this overwrites that).
> 
> I see there are some callouts scheduled by SYSINITs, before the first 
> configtimer(), like NFS_TIMERINIT in nfs_init(). These are setup using 
> "dummy_timecounter" which means any nextcall values before the first 
> configtimer should be discarded.

Hmm, I actually tested early callouts by having callouts scheduled for 1, 4, and
8 seconds right after callouts were initialized and they worked correctly 
(albeit
using lapic timer as the eventtimer and TSC as the timecounter).  By the time 
they
were called though, sbinuptime() was not returning dummy values, but real ones
(probably because TSC gets added as a timecounter in SI_SUB_CPU).

The patch I used for testing is still in my work branch here:

https://github.com/freebsd/freebsd/compare/master...bsdjhb:early_callout

So in at least some cases you don't have to discard nextcall during boot.

However, if TSC isn't available you might not get a timecounter until later
in boot during device probe in which case you would get dummy timecounter, and
then the nextcall/nextcallopt values aren't great.

Hmmm, I wonder if just bumping nextcall to be 'now' in case it is less than 
'now'
would be sufficient.  I think my previous patch still looped even though it 
might
have set 'next_event' correctly because the 'nextcall' value was still too 
small.

That is:

Index: kern_clocksource.c
===
--- kern_clocksource.c  (revision 312301)
+++ kern_clocksource.c  (working copy)
@@ -498,12 +498,18 @@ configtimer(int start)
CPU_FOREACH(cpu) {
state = DPCPU_ID_PTR(cpu, timerstate);
state->now = now;
+   printf("%s: CPU %d: now %jd nextcall %jd nextcallopt 
%jd next %jd\n", __func__, cpu, state->nextcall, state->nextcall, next);
+   if (state->nextcall < now)
+   state->nextcall = now;
 #ifndef EARLY_AP_STARTUP
if (!smp_started && cpu != CPU_FIRST())
state->nextevent = SBT_MAX;
else
 #endif
+   if (next < state->nextcall)
state->nextevent = next;
+   else
+   state->nextevent = state->nextcall;
if (periodic)
state->nexttick = next;
else
@@ -511,8 +517,6 @@ 

Re: Strange issue after early AP startup

2017-01-17 Thread Hans Petter Selasky

On 01/17/17 22:28, Hans Petter Selasky wrote:

+   state->nextcall = SBT_MAX;
+   state->nextcallopt = now + 1;


BTW: What locks are protecting the update of these fields? Can they be 
written simultaneously by configtimer() and cpu_new_callout()?


--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread Hans Petter Selasky

On 01/17/17 20:46, Ian Lepore wrote:

Does this matter for the first tick? How often is configtimer() called?

>
> As I said, it is called at runtime when profclock is started / stopped, not
> just at boot.  Those changes at runtime probably have existing callouts
> active and your change will not process any callouts until the next hardclock
> tick fires (but only because you are setting nextcallopt to the bogus
> 'next' value).

On some platforms, configtimer() can be called quite often.  Power
saving modes can change the frequency of the timer, and systems that
suppport such dynamic frequency scaling call configtimer()
(via cpu_et_frequency()) to handle the changes.


Hi,

I propose the following patch then:

diff --git a/sys/kern/kern_clocksource.c b/sys/kern/kern_clocksource.c
index 7f7769d..5ae925b 100644
--- a/sys/kern/kern_clocksource.c
+++ b/sys/kern/kern_clocksource.c
@@ -511,8 +511,13 @@ configtimer(int start)
state->nexthard = next;
state->nextstat = next;
state->nextprof = next;
-   state->nextcall = next;
-   state->nextcallopt = next;
+   /*
+* Force callout_process() to be called
+* instantly, so that the correct value of
+* "nextcall" can be computed:
+*/
+   state->nextcall = SBT_MAX;
+   state->nextcallopt = now + 1;
hardclock_sync(cpu);
}
busy = 0;


Then there is no problem having to wait for the next tick or anything, 
like John Baldwin pointed out.


--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread Cy Schubert
In message , Hans Petter 
Sela
sky writes:
> Hi,
> 
> When booting I observe an additional 30-second delay after this print:
> 
> > Timecounters tick every 1.000 msec
> 
> ~30 second delay and boot continues like normal.
> 
> Checking "vmstat -i" reveals that some timers have been running loose.
> 
> > cpu0:timer 44300442
> > cpu1:timer 40561404
> > cpu3:timer  48462822 483058
> > cpu2:timer  48477898 483209
> 
> Trying to add delays and/or prints around the Timecounters printout 
> makes the issue go away. Any ideas for debugging?
> 
> Looks like a startup race to me.

just picking a random email to reply to, I'm seeing a different issue with 
early AP startup. It affects one of my four machines, my laptop. My three 
server systems downstairs have no problem however my laptop will reboot 
repeatedly at:

Jan 17 11:55:16 slippy kernel: cd0: Attempt to query device size failed: 
NOT READY, Medium not present - tray closed

Then finally boot after a number of reboots (0-N), it finally boots. 
Disabling early AP start allows it to boot past that point first time.


-- 
Cheers,
Cy Schubert 
FreeBSD UNIX:     Web:  http://www.FreeBSD.org

The need of the many outweighs the greed of the few.


___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread Ian Lepore
On Tue, 2017-01-17 at 11:00 -0800, John Baldwin wrote:
> >  > You could
> >  > do that by setting it to 'cc_firstevent' of the associated CPU, but in
> >  > practice 'state->nextcall' should already be set to that (it is 
> > initalized
> >  > to SBT_MAX in cpu_initclocks_bsp() and is then only set to other 
> > values due
> >  > to cpu_new_callout()).  Keep in mind that configtimer() is not just 
> > called
> >  > from boot, but is also invoked when starting/stopping the profiling 
> > timer.
> >  >
> > 
> >  > However, when setting 'nextevent' (which is used to schedule the next 
> > timer
> >  > interrupt), we should be honoring the existing 'nextcall' if it is sooner
> >  > than the next hardclock.
> > 
> > Does this matter for the first tick? How often is configtimer() called?
> 
> As I said, it is called at runtime when profclock is started / stopped, not
> just at boot.  Those changes at runtime probably have existing callouts
> active and your change will not process any callouts until the next hardclock
> tick fires (but only because you are setting nextcallopt to the bogus
> 'next' value).

On some platforms, configtimer() can be called quite often.  Power
saving modes can change the frequency of the timer, and systems that
suppport such dynamic frequency scaling call configtimer()
(via cpu_et_frequency()) to handle the changes.

-- Ian

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread Hans Petter Selasky

Hi,

On 01/17/17 20:00, John Baldwin wrote:


Does this matter for the first tick? How often is configtimer() called?


As I said, it is called at runtime when profclock is started / stopped, not
just at boot.  Those changes at runtime probably have existing callouts
active and your change will not process any callouts until the next hardclock
tick fires (but only because you are setting nextcallopt to the bogus
'next' value).





 > (One odd thing is that even in your case the first call to
handleevents(),
 > the 'now => state->nextcallout' check in handleevents() should be true
 > which resets both nextcall and nextcallopt and invokes
callout_process().)

Let me take you through the failure path, by code inspection:


I would really appreciate it if you could add traces to find out what actually
happens rather than what seems to happen by looking at the code. :-/


The problem is that once you add some prints, the problem goes away. 
Maybe I should try to set hz to 100 or 25 ???




0) cpu_initclocks_bsp() is called and init's nextcall and nexcallopt to SBT_MAX
   similar to your change.  If no callout is scheduled before configtimer()
   then they remain set to SBT_MAX.  Your current patch happens to trigger a
   (bogus) call to callout_process() on the first hardclock() because it
   sets nextcallopt to 'next' even though no callout is actually scheduled to
   fire at time 'next'.


1) configtimer() is called and we init nextcall and nextcallopt:

 > next = now + timerperiod;
...
 >   state->nextcall = next;
 >   state->nextcallopt = next;


These both say "the next callout() should fire at 'next' which is the time of
the next hardclock()", even though there may be no callouts scheduled (in which
case both of these fields should be set to SBT_MAX from the call to
cpu_initclocks_bsp(), or there may be callouts scheduled in which case 
'nextcall'
and 'nextcallopt' will reflect the time that those callouts are already
scheduled for and this overwrites that).


I see there are some callouts scheduled by SYSINITs, before the first 
configtimer(), like NFS_TIMERINIT in nfs_init(). These are setup using 
"dummy_timecounter" which means any nextcall values before the first 
configtimer should be discarded.





2) Any callout_reset() calls cpu_new_callout():

 >  */
 > state->nextcallopt = bt_opt;
 > if (bt >= state->nextcall)
We follow this path, because "bt" is surely based on sbinuptime() and is
greater or equal to state->nextcall. Note that state->nextcallopt is
updated to bt_opt, which is in the future.


Note, my patch should _leave_ nextcall at SBT_MAX (from cpu_initclocks_bsp())
unless there was already an earlier call to callout_reset().


Yes, there are calls to callout_reset(). See for example NFS_TIMERINIT, 
like mentioned above.


> IOW, it

should be a NOP for the purposes of this branch compared with your change.
(You could add a warning to print out if 'nextcall' != SBT_MAX during boot
and see if it fires for example.)


 > goto done;
 > state->nextcall = bt;

3) getnextcpuevent(0) is called by the fast timercb() to setup the next
event:

 > state = DPCPU_PTR(timerstate);
 > /* Handle hardclock() events, skipping some if CPU is idle. */
 > event = state->nexthard;
...
 > /* Handle callout events. */
 > if (event > state->nextcall)
We then go looping into this path, because state->nextcall is still
equal to "next" as in step 1) which is now in the past, until "now >=
state->nextcallopt" inside handleevents(), which clears this condition.
 > event = state->nextcall;
...
 > return (event);


I'm curious if there is a callout_reset() that has set 'nextcall' to a time
that is effectively before 'now'.

Maybe add a printf like this:

Index: kern_clocksource.c
===
--- kern_clocksource.c  (revision 312301)
+++ kern_clocksource.c  (working copy)
@@ -498,12 +498,18 @@ configtimer(int start)
CPU_FOREACH(cpu) {
state = DPCPU_ID_PTR(cpu, timerstate);
state->now = now;
+   printf("%s: CPU %d: now %jd nextcall %jd nextcallopt %jd next 
%jd\n", __func__, cpu, state->nextcall, state->nextcall, next);
 #ifndef EARLY_AP_STARTUP
if (!smp_started && cpu != CPU_FIRST())
state->nextevent = SBT_MAX;
else
 #endif

In particular what I am worried about with your patch is that for
post-boot calls to configtimer() you will delay any previously-scheduled
callouts until the next hardclock.



I understand. Would a solution be to refactor callout_process(), to 
accept the PCPU_GET(CPUID) as an argument and be executed for all CPUs 
by configtimer(), instead of trying to guess state->nextcall and 
state->nextcallopt in configtimer() ?

Re: Strange issue after early AP startup

2017-01-17 Thread John Baldwin
On Tuesday, January 17, 2017 07:04:15 PM Hans Petter Selasky wrote:
> On 01/17/17 16:50, John Baldwin wrote:
>  > On Monday, January 16, 2017 10:10:16 PM Hans Petter Selasky wrote:
>  >> On 01/16/17 20:31, John Baldwin wrote:
>  >>> On Monday, January 16, 2017 04:51:42 PM Hans Petter Selasky wrote:
>   Hi,
>  
>   When booting I observe an additional 30-second delay after this print:
>  
>  > Timecounters tick every 1.000 msec
>  
>   ~30 second delay and boot continues like normal.
>  
>   Checking "vmstat -i" reveals that some timers have been running loose.
>  
>  > cpu0:timer 44300442
>  > cpu1:timer 40561404
>  > cpu3:timer  48462822 483058
>  > cpu2:timer  48477898 483209
>  
>   Trying to add delays and/or prints around the Timecounters printout
>   makes the issue go away. Any ideas for debugging?
>  >>>
>  >>> I have generally used KTR tracing to trace what is happening during
>  >>> boot to debug EARLY_AP_STARTUP issues.
>  >>>
>  >>
>  >> Hi John,
>  >>
>  >> What happens is that getnextcpuevent(0) keeps on returning
>  >> "state->nextcall" which is in the past for CPU #2 and #3 on my box.
>  >>
>  >> In "cpu_new_callout()" there is a check if "bt >= state->nextcall",
>  >> which I suspect is true, so "state->nextcall" never gets set to real
>  >> minimum sbintime.
>  >>
>  >> The attached patch fixes the problem for me, but I'm not 100% sure 
> if it
>  >> is correct.
>  >
> 
> Hi,
> 
>  > I think we want to be honoring any currently scheduled callouts.
> 
> The problem here is that we might be changing the clocksource, then 
> sbinuptime() will change too, so I think the value should be reset by 
> configtimer() and then corrected at the next call to callout_process().
> 
>  > You could
>  > do that by setting it to 'cc_firstevent' of the associated CPU, but in
>  > practice 'state->nextcall' should already be set to that (it is 
> initalized
>  > to SBT_MAX in cpu_initclocks_bsp() and is then only set to other 
> values due
>  > to cpu_new_callout()).  Keep in mind that configtimer() is not just 
> called
>  > from boot, but is also invoked when starting/stopping the profiling 
> timer.
>  >
> 
>  > However, when setting 'nextevent' (which is used to schedule the next 
> timer
>  > interrupt), we should be honoring the existing 'nextcall' if it is sooner
>  > than the next hardclock.
> 
> Does this matter for the first tick? How often is configtimer() called?

As I said, it is called at runtime when profclock is started / stopped, not
just at boot.  Those changes at runtime probably have existing callouts
active and your change will not process any callouts until the next hardclock
tick fires (but only because you are setting nextcallopt to the bogus
'next' value).

>  > (One odd thing is that even in your case the first call to 
> handleevents(),
>  > the 'now => state->nextcallout' check in handleevents() should be true
>  > which resets both nextcall and nextcallopt and invokes 
> callout_process().)
> 
> Let me take you through the failure path, by code inspection:

I would really appreciate it if you could add traces to find out what actually
happens rather than what seems to happen by looking at the code. :-/

0) cpu_initclocks_bsp() is called and init's nextcall and nexcallopt to SBT_MAX
   similar to your change.  If no callout is scheduled before configtimer()
   then they remain set to SBT_MAX.  Your current patch happens to trigger a
   (bogus) call to callout_process() on the first hardclock() because it
   sets nextcallopt to 'next' even though no callout is actually scheduled to
   fire at time 'next'.

> 1) configtimer() is called and we init nextcall and nextcallopt:
> 
>  > next = now + timerperiod;
> ...
>  >   state->nextcall = next;
>  >   state->nextcallopt = next;

These both say "the next callout() should fire at 'next' which is the time of
the next hardclock()", even though there may be no callouts scheduled (in which
case both of these fields should be set to SBT_MAX from the call to
cpu_initclocks_bsp(), or there may be callouts scheduled in which case 
'nextcall'
and 'nextcallopt' will reflect the time that those callouts are already
scheduled for and this overwrites that).

> 2) Any callout_reset() calls cpu_new_callout():
> 
>  >  */
>  > state->nextcallopt = bt_opt;
>  > if (bt >= state->nextcall)
> We follow this path, because "bt" is surely based on sbinuptime() and is 
> greater or equal to state->nextcall. Note that state->nextcallopt is 
> updated to bt_opt, which is in the future.

Note, my patch should _leave_ nextcall at SBT_MAX (from cpu_initclocks_bsp())
unless there was already an earlier call to callout_reset().  IOW, it
should be a NOP for the purposes of this branch compared with your change.

Re: Strange issue after early AP startup

2017-01-17 Thread Hans Petter Selasky

On 01/17/17 16:50, John Baldwin wrote:
> On Monday, January 16, 2017 10:10:16 PM Hans Petter Selasky wrote:
>> On 01/16/17 20:31, John Baldwin wrote:
>>> On Monday, January 16, 2017 04:51:42 PM Hans Petter Selasky wrote:
 Hi,

 When booting I observe an additional 30-second delay after this print:

> Timecounters tick every 1.000 msec

 ~30 second delay and boot continues like normal.

 Checking "vmstat -i" reveals that some timers have been running loose.

> cpu0:timer 44300442
> cpu1:timer 40561404
> cpu3:timer  48462822 483058
> cpu2:timer  48477898 483209

 Trying to add delays and/or prints around the Timecounters printout
 makes the issue go away. Any ideas for debugging?
>>>
>>> I have generally used KTR tracing to trace what is happening during
>>> boot to debug EARLY_AP_STARTUP issues.
>>>
>>
>> Hi John,
>>
>> What happens is that getnextcpuevent(0) keeps on returning
>> "state->nextcall" which is in the past for CPU #2 and #3 on my box.
>>
>> In "cpu_new_callout()" there is a check if "bt >= state->nextcall",
>> which I suspect is true, so "state->nextcall" never gets set to real
>> minimum sbintime.
>>
>> The attached patch fixes the problem for me, but I'm not 100% sure 
if it

>> is correct.
>

Hi,

> I think we want to be honoring any currently scheduled callouts.

The problem here is that we might be changing the clocksource, then 
sbinuptime() will change too, so I think the value should be reset by 
configtimer() and then corrected at the next call to callout_process().


> You could
> do that by setting it to 'cc_firstevent' of the associated CPU, but in
> practice 'state->nextcall' should already be set to that (it is 
initalized
> to SBT_MAX in cpu_initclocks_bsp() and is then only set to other 
values due
> to cpu_new_callout()).  Keep in mind that configtimer() is not just 
called
> from boot, but is also invoked when starting/stopping the profiling 
timer.

>

> However, when setting 'nextevent' (which is used to schedule the next 
timer

> interrupt), we should be honoring the existing 'nextcall' if it is sooner
> than the next hardclock.

Does this matter for the first tick? How often is configtimer() called?

> (One odd thing is that even in your case the first call to 
handleevents(),

> the 'now => state->nextcallout' check in handleevents() should be true
> which resets both nextcall and nextcallopt and invokes 
callout_process().)


Let me take you through the failure path, by code inspection:

1) configtimer() is called and we init nextcall and nextcallopt:

> next = now + timerperiod;
...
>   state->nextcall = next;
>   state->nextcallopt = next;

2) Any callout_reset() calls cpu_new_callout():

>  */
> state->nextcallopt = bt_opt;
> if (bt >= state->nextcall)
We follow this path, because "bt" is surely based on sbinuptime() and is 
greater or equal to state->nextcall. Note that state->nextcallopt is 
updated to bt_opt, which is in the future.

> goto done;
> state->nextcall = bt;

3) getnextcpuevent(0) is called by the fast timercb() to setup the next 
event:


> state = DPCPU_PTR(timerstate);
> /* Handle hardclock() events, skipping some if CPU is idle. */
> event = state->nexthard;
...
> /* Handle callout events. */
> if (event > state->nextcall)
We then go looping into this path, because state->nextcall is still 
equal to "next" as in step 1) which is now in the past, until "now >= 
state->nextcallopt" inside handleevents(), which clears this condition.

> event = state->nextcall;
...
> return (event);

--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread Hans Petter Selasky

On 01/17/17 16:50, John Baldwin wrote:

Index: kern_clocksource.c
===
--- kern_clocksource.c  (revision 312301)
+++ kern_clocksource.c  (working copy)
@@ -503,7 +503,12 @@ configtimer(int start)
state->nextevent = SBT_MAX;
else
 #endif
+   if (next < state->nextcall)
state->nextevent = next;
+   else if (state->nextcall < now)
+   state->nextevent = now;
+   else
+   state->nextevent = state->nextcall;
if (periodic)
state->nexttick = next;
else
@@ -511,8 +516,6 @@ configtimer(int start)
state->nexthard = next;
state->nextstat = next;
state->nextprof = next;
-   state->nextcall = next;
-   state->nextcallopt = next;
hardclock_sync(cpu);
}
busy = 0;


This patch makes it worse. Now I don't even reach the login prompt.

--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread Hans Petter Selasky

On 01/17/17 16:50, John Baldwin wrote:

(One odd thing is that even in your case the first call to handleevents(),
the 'now => state->nextcallout' check in handleevents() should be true
which resets both nextcall and nextcallopt and invokes callout_process().)


Hi,

I suspect the cpu_new_callout() function is changing this condition 
after a callout_reset() call, before handleevents() gets a chance to run.


I'll give your patch a spin right away and let you know how it goes.

--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-17 Thread John Baldwin
On Monday, January 16, 2017 10:10:16 PM Hans Petter Selasky wrote:
> On 01/16/17 20:31, John Baldwin wrote:
> > On Monday, January 16, 2017 04:51:42 PM Hans Petter Selasky wrote:
> >> Hi,
> >>
> >> When booting I observe an additional 30-second delay after this print:
> >>
> >>> Timecounters tick every 1.000 msec
> >>
> >> ~30 second delay and boot continues like normal.
> >>
> >> Checking "vmstat -i" reveals that some timers have been running loose.
> >>
> >>> cpu0:timer 44300442
> >>> cpu1:timer 40561404
> >>> cpu3:timer  48462822 483058
> >>> cpu2:timer  48477898 483209
> >>
> >> Trying to add delays and/or prints around the Timecounters printout
> >> makes the issue go away. Any ideas for debugging?
> >
> > I have generally used KTR tracing to trace what is happening during
> > boot to debug EARLY_AP_STARTUP issues.
> >
> 
> Hi John,
> 
> What happens is that getnextcpuevent(0) keeps on returning 
> "state->nextcall" which is in the past for CPU #2 and #3 on my box.
> 
> In "cpu_new_callout()" there is a check if "bt >= state->nextcall", 
> which I suspect is true, so "state->nextcall" never gets set to real 
> minimum sbintime.
> 
> The attached patch fixes the problem for me, but I'm not 100% sure if it 
> is correct.

I think we want to be honoring any currently scheduled callouts.  You could
do that by setting it to 'cc_firstevent' of the associated CPU, but in
practice 'state->nextcall' should already be set to that (it is initalized
to SBT_MAX in cpu_initclocks_bsp() and is then only set to other values due
to cpu_new_callout()).  Keep in mind that configtimer() is not just called
from boot, but is also invoked when starting/stopping the profiling timer.

However, when setting 'nextevent' (which is used to schedule the next timer
interrupt), we should be honoring the existing 'nextcall' if it is sooner
than the next hardclock.

(One odd thing is that even in your case the first call to handleevents(),
the 'now => state->nextcallout' check in handleevents() should be true
which resets both nextcall and nextcallopt and invokes callout_process().)

Here is a suggestion that attempts what I described in the first paragraph.
If you still get hangs it would be good to break into DDB and capture the
output of 'show clocksource'.

Index: kern_clocksource.c
===
--- kern_clocksource.c  (revision 312301)
+++ kern_clocksource.c  (working copy)
@@ -503,7 +503,12 @@ configtimer(int start)
state->nextevent = SBT_MAX;
else
 #endif
+   if (next < state->nextcall)
state->nextevent = next;
+   else if (state->nextcall < now)
+   state->nextevent = now;
+   else
+   state->nextevent = state->nextcall;
if (periodic)
state->nexttick = next;
else
@@ -511,8 +516,6 @@ configtimer(int start)
state->nexthard = next;
state->nextstat = next;
state->nextprof = next;
-   state->nextcall = next;
-   state->nextcallopt = next;
hardclock_sync(cpu);
}
busy = 0;

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange issue after early AP startup

2017-01-16 Thread Hans Petter Selasky

On 01/16/17 20:31, John Baldwin wrote:

On Monday, January 16, 2017 04:51:42 PM Hans Petter Selasky wrote:

Hi,

When booting I observe an additional 30-second delay after this print:


Timecounters tick every 1.000 msec


~30 second delay and boot continues like normal.

Checking "vmstat -i" reveals that some timers have been running loose.


cpu0:timer 44300442
cpu1:timer 40561404
cpu3:timer  48462822 483058
cpu2:timer  48477898 483209


Trying to add delays and/or prints around the Timecounters printout
makes the issue go away. Any ideas for debugging?


I have generally used KTR tracing to trace what is happening during
boot to debug EARLY_AP_STARTUP issues.



Hi John,

What happens is that getnextcpuevent(0) keeps on returning 
"state->nextcall" which is in the past for CPU #2 and #3 on my box.


In "cpu_new_callout()" there is a check if "bt >= state->nextcall", 
which I suspect is true, so "state->nextcall" never gets set to real 
minimum sbintime.


The attached patch fixes the problem for me, but I'm not 100% sure if it 
is correct.


--HPS
diff --git a/sys/kern/kern_clocksource.c b/sys/kern/kern_clocksource.c
index 7f7769d..454a130 100644
--- a/sys/kern/kern_clocksource.c
+++ b/sys/kern/kern_clocksource.c
@@ -511,7 +511,7 @@ configtimer(int start)
 			state->nexthard = next;
 			state->nextstat = next;
 			state->nextprof = next;
-			state->nextcall = next;
+			state->nextcall = SBT_MAX;
 			state->nextcallopt = next;
 			hardclock_sync(cpu);
 		}
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Strange issue after early AP startup

2017-01-16 Thread John Baldwin
On Monday, January 16, 2017 04:51:42 PM Hans Petter Selasky wrote:
> Hi,
> 
> When booting I observe an additional 30-second delay after this print:
> 
> > Timecounters tick every 1.000 msec
> 
> ~30 second delay and boot continues like normal.
> 
> Checking "vmstat -i" reveals that some timers have been running loose.
> 
> > cpu0:timer 44300442
> > cpu1:timer 40561404
> > cpu3:timer  48462822 483058
> > cpu2:timer  48477898 483209
> 
> Trying to add delays and/or prints around the Timecounters printout 
> makes the issue go away. Any ideas for debugging?

I have generally used KTR tracing to trace what is happening during
boot to debug EARLY_AP_STARTUP issues.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Strange issue after early AP startup

2017-01-16 Thread Hans Petter Selasky

Hi,

When booting I observe an additional 30-second delay after this print:


Timecounters tick every 1.000 msec


~30 second delay and boot continues like normal.

Checking "vmstat -i" reveals that some timers have been running loose.


cpu0:timer 44300442
cpu1:timer 40561404
cpu3:timer  48462822 483058
cpu2:timer  48477898 483209


Trying to add delays and/or prints around the Timecounters printout 
makes the issue go away. Any ideas for debugging?


Looks like a startup race to me.

--HPS
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"