Re: huge nanosleep variance on 11-stable

2016-11-27 Thread Konstantin Belousov
On Sat, Nov 26, 2016 at 02:37:45PM -0800, Jason Harmening wrote:
> I can confirm this patch works.  HPET is now chosen over LAPIC as the
> eventtimer source, and the system works smoothly without disabling C2 or
> mwait.

Thank you for the testing.

The change was committed to HEAD as r309189.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-26 Thread Jason Harmening
I can confirm this patch works.  HPET is now chosen over LAPIC as the
eventtimer source, and the system works smoothly without disabling C2 or
mwait.

On Fri, Nov 25, 2016 at 4:12 AM, Jason Harmening 
wrote:

> On Fri, Nov 25, 2016 at 1:25 AM, Konstantin Belousov 
> wrote:
>
>> On Wed, Nov 02, 2016 at 06:28:08PM +0200, Konstantin Belousov wrote:
>> > On Wed, Nov 02, 2016 at 09:18:15AM -0700, Jason Harmening wrote:
>> > > I think you are probably right.  Hacking out the Intel-specific
>> > > additions to C-state parsing in acpi_cpu_cx_cst() from r282678 (thus
>> > > going back to sti;hlt instead of monitor+mwait at C1) fixed the
>> problem
>> > > for me.  But r282678 also had the effect of enabling C2 and C3 on my
>> > > system, because ACPI only presents MWAIT entries for those states and
>> > > not p_lvlx.
>> > You can do the same with "debug.acpi.disabled=mwait" loader tunable
>> > without hacking the code. And set sysctl hw.acpi.cpu.cx_lowest to C1 to
>> > enforce use of hlt instruction even when mwait states were requested.
>>
>> I believe I now understood the problem.  First, I got the definitive
>> confirmation that LAPIC timer on Nehalems is stopped in any C mode
>> higher than C1/C1E, i.e. even if C2 is enabled LAPIC eventtimer cannot
>> be used.  This is consistent with the ARAT CPUID bit CPUID[0x6].eax[2]
>> reported zero.
>>
>> On SandyBridge and IvyBridge CPUs, it seems that ARAT might be both 0
>> and 1 according to the same source, but all CPUs I saw have ARAT = 1.
>> And for Haswell and later generations, ARAT is claimed to be always
>> implemented.
>>
>> The actual issue is somewhat silly bug, I must admit: if ncpus >= 8, and
>> non-FSB interrupt routing from HPET, default HPET eventtimer quality 450
>> is reduced by 100, i.e. it is 350. OTOH, LAPIC default quality is 600
>> and it is reduced by 200 if ARAT is not reported. We end up with HPET
>> quality 350 < LAPIC quality 400, despite ARAT is not set.
>>
>> The patch below sets LAPIC eventtimer quality to 100 if not ARAT.  Also
>> I realized that there is no reason to disable deadline mode regardless
>> of ARAT.
>>
>> diff --git a/sys/x86/x86/local_apic.c b/sys/x86/x86/local_apic.c
>> index d9a3453..1b1547d 100644
>> --- a/sys/x86/x86/local_apic.c
>> +++ b/sys/x86/x86/local_apic.c
>> @@ -478,8 +478,9 @@ native_lapic_init(vm_paddr_t addr)
>> lapic_et.et_quality = 600;
>> if (!arat) {
>> lapic_et.et_flags |= ET_FLAGS_C3STOP;
>> -   lapic_et.et_quality -= 200;
>> -   } else if ((cpu_feature & CPUID_TSC) != 0 &&
>> +   lapic_et.et_quality = 100;
>> +   }
>> +   if ((cpu_feature & CPUID_TSC) != 0 &&
>> (cpu_feature2 & CPUID2_TSCDLT) != 0 &&
>> tsc_is_invariant && tsc_freq != 0) {
>> lapic_timer_tsc_deadline = 1;
>>
>> Ah, that makes sense.  Thanks!
>
> I'll try the patch as soon as I get back from vacation.  I've been able to
> verify that setting cx_lowest and disabling mwait fixes the problem without
> hacking the code.  But I've been too busy at $(WORK) to check anything
> else, namely whether forcing HPET would also fix the problem.
>
>
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-25 Thread Jason Harmening
On Fri, Nov 25, 2016 at 1:25 AM, Konstantin Belousov 
wrote:

> On Wed, Nov 02, 2016 at 06:28:08PM +0200, Konstantin Belousov wrote:
> > On Wed, Nov 02, 2016 at 09:18:15AM -0700, Jason Harmening wrote:
> > > I think you are probably right.  Hacking out the Intel-specific
> > > additions to C-state parsing in acpi_cpu_cx_cst() from r282678 (thus
> > > going back to sti;hlt instead of monitor+mwait at C1) fixed the problem
> > > for me.  But r282678 also had the effect of enabling C2 and C3 on my
> > > system, because ACPI only presents MWAIT entries for those states and
> > > not p_lvlx.
> > You can do the same with "debug.acpi.disabled=mwait" loader tunable
> > without hacking the code. And set sysctl hw.acpi.cpu.cx_lowest to C1 to
> > enforce use of hlt instruction even when mwait states were requested.
>
> I believe I now understood the problem.  First, I got the definitive
> confirmation that LAPIC timer on Nehalems is stopped in any C mode
> higher than C1/C1E, i.e. even if C2 is enabled LAPIC eventtimer cannot
> be used.  This is consistent with the ARAT CPUID bit CPUID[0x6].eax[2]
> reported zero.
>
> On SandyBridge and IvyBridge CPUs, it seems that ARAT might be both 0
> and 1 according to the same source, but all CPUs I saw have ARAT = 1.
> And for Haswell and later generations, ARAT is claimed to be always
> implemented.
>
> The actual issue is somewhat silly bug, I must admit: if ncpus >= 8, and
> non-FSB interrupt routing from HPET, default HPET eventtimer quality 450
> is reduced by 100, i.e. it is 350. OTOH, LAPIC default quality is 600
> and it is reduced by 200 if ARAT is not reported. We end up with HPET
> quality 350 < LAPIC quality 400, despite ARAT is not set.
>
> The patch below sets LAPIC eventtimer quality to 100 if not ARAT.  Also
> I realized that there is no reason to disable deadline mode regardless
> of ARAT.
>
> diff --git a/sys/x86/x86/local_apic.c b/sys/x86/x86/local_apic.c
> index d9a3453..1b1547d 100644
> --- a/sys/x86/x86/local_apic.c
> +++ b/sys/x86/x86/local_apic.c
> @@ -478,8 +478,9 @@ native_lapic_init(vm_paddr_t addr)
> lapic_et.et_quality = 600;
> if (!arat) {
> lapic_et.et_flags |= ET_FLAGS_C3STOP;
> -   lapic_et.et_quality -= 200;
> -   } else if ((cpu_feature & CPUID_TSC) != 0 &&
> +   lapic_et.et_quality = 100;
> +   }
> +   if ((cpu_feature & CPUID_TSC) != 0 &&
> (cpu_feature2 & CPUID2_TSCDLT) != 0 &&
> tsc_is_invariant && tsc_freq != 0) {
> lapic_timer_tsc_deadline = 1;
>
> Ah, that makes sense.  Thanks!

I'll try the patch as soon as I get back from vacation.  I've been able to
verify that setting cx_lowest and disabling mwait fixes the problem without
hacking the code.  But I've been too busy at $(WORK) to check anything
else, namely whether forcing HPET would also fix the problem.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-25 Thread Konstantin Belousov
On Wed, Nov 02, 2016 at 06:28:08PM +0200, Konstantin Belousov wrote:
> On Wed, Nov 02, 2016 at 09:18:15AM -0700, Jason Harmening wrote:
> > I think you are probably right.  Hacking out the Intel-specific
> > additions to C-state parsing in acpi_cpu_cx_cst() from r282678 (thus
> > going back to sti;hlt instead of monitor+mwait at C1) fixed the problem
> > for me.  But r282678 also had the effect of enabling C2 and C3 on my
> > system, because ACPI only presents MWAIT entries for those states and
> > not p_lvlx.
> You can do the same with "debug.acpi.disabled=mwait" loader tunable
> without hacking the code. And set sysctl hw.acpi.cpu.cx_lowest to C1 to
> enforce use of hlt instruction even when mwait states were requested.

I believe I now understood the problem.  First, I got the definitive
confirmation that LAPIC timer on Nehalems is stopped in any C mode
higher than C1/C1E, i.e. even if C2 is enabled LAPIC eventtimer cannot
be used.  This is consistent with the ARAT CPUID bit CPUID[0x6].eax[2]
reported zero.

On SandyBridge and IvyBridge CPUs, it seems that ARAT might be both 0
and 1 according to the same source, but all CPUs I saw have ARAT = 1.
And for Haswell and later generations, ARAT is claimed to be always
implemented.

The actual issue is somewhat silly bug, I must admit: if ncpus >= 8, and
non-FSB interrupt routing from HPET, default HPET eventtimer quality 450
is reduced by 100, i.e. it is 350. OTOH, LAPIC default quality is 600
and it is reduced by 200 if ARAT is not reported. We end up with HPET
quality 350 < LAPIC quality 400, despite ARAT is not set.

The patch below sets LAPIC eventtimer quality to 100 if not ARAT.  Also
I realized that there is no reason to disable deadline mode regardless
of ARAT.

diff --git a/sys/x86/x86/local_apic.c b/sys/x86/x86/local_apic.c
index d9a3453..1b1547d 100644
--- a/sys/x86/x86/local_apic.c
+++ b/sys/x86/x86/local_apic.c
@@ -478,8 +478,9 @@ native_lapic_init(vm_paddr_t addr)
lapic_et.et_quality = 600;
if (!arat) {
lapic_et.et_flags |= ET_FLAGS_C3STOP;
-   lapic_et.et_quality -= 200;
-   } else if ((cpu_feature & CPUID_TSC) != 0 &&
+   lapic_et.et_quality = 100;
+   }
+   if ((cpu_feature & CPUID_TSC) != 0 &&
(cpu_feature2 & CPUID2_TSCDLT) != 0 &&
tsc_is_invariant && tsc_freq != 0) {
lapic_timer_tsc_deadline = 1;

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-03 Thread Scott Bennett
 On Wed, 2 Nov 2016 10:23:24 -0400 George Mitchell 
wrote:
>On 11/01/16 23:45, Kevin Oberman wrote:
>> On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening 
>> wrote:
>> 
>>> Sorry, that should be ~*30ms* to get 30fps, though the variance is still
>>> up to 500ms for me either way.
>>>
>>> On 11/01/16 14:29, Jason Harmening wrote:
 repro code is at http://pastebin.com/B68N4AFY if anyone's interested.

 On 11/01/16 13:58, Jason Harmening wrote:
> Hi everyone,
>
> I recently upgraded my main amd64 server from 10.3-stable (r302011) to
> 11.0-stable (r308099).  It went smoothly except for one big issue:
> certain applications (but not the system as a whole) respond very
> sluggishly, and video playback of any kind is extremely choppy.
>
> [...]
>> I eliminated the annoyance by change scheduler from ULE to 4BSD. That was
>> it, but I have not seen the issue since. I'd be very interested in whether
>> the scheduler is somehow impacting timing functions or it's s different
>> issue. I've felt that there was something off in ULE for some time, but it
>> was not until these annoying hiccups convinced me to try going back to
>> 4BSD.
>> 
>> Tip o' the hat to Doug B. for his suggestions that ULE may have issues that
>> impacted interactivity.
>> [...]
>
>Not to beat a dead horse, but I've been a non-fan of SCHED_ULE since
>it was first introduced, and I don't like it even today.  I run the
>distributed.net client on my machines, but even without that, ULE
>screws interactive behavior.  With distributed.net running and ULE,
>a make buildworld/make buildkernel takes 10 2/3 hours on 10.3-RELEASE
>on a 6-CPU machine versus 2 1/2 hours on the same machine with 4BSD
>and distributed.net running.  I'm told that SCHED_ULE is the greatest
>thing since sliced bread for some compute load or other (details are
>scarce), but I (fortunately) don't often have to run heavy server
>type loads; and for everyday use (even without distributed.net
>running), SCHED_4BSD is my choice by far.  It's too bad I can't run
>freebsd_update with it, though.
>
 I gave up on ULE during 8-STABLE.  I had tried tinkering with
kern.sched.preempt_thresh as recommended, as well as some more extreme
values, but I couldn't see any improvement.  Some values may have made
performance even worse.  The last straw for me, however, was when I
discovered that ULE happily scheduled *idle* priority processes at times
when both CPU threads on a P4 Prescott were tied up by 100% CPU-bound
(mprime) threads at normal priority niced to 20.  Idle priority tasks
should *only* run when no higher priority tasks are available to run for
all CPU threads.  The 4BSD scheduler handles this situation properly.
 Now I'm running 10.3-STABLE on a QX9650, and I haven't tested ULE
on it to see whether it's still as flawed.  If and when I get a machine
with a multi-cored, hyperthreaded CPU or perhaps a board with multiple
CPU chips, then I may worry about the multi-level affinity stuff that
ULE was supposedly designed for enough to bother testing it.  But for
now, I can't see any advantage in it for my current machine.


  Scott Bennett, Comm. ASMELG, CFIAG
**
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
**
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."   *
*-- Gov. John Hancock, New York Journal, 28 January 1790 *
**
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-02 Thread Ian Smith
On Wed, 2 Nov 2016 18:28:08 +0200, Konstantin Belousov wrote:
 > On Wed, Nov 02, 2016 at 09:18:15AM -0700, Jason Harmening wrote:
 > > I think you are probably right.  Hacking out the Intel-specific
 > > additions to C-state parsing in acpi_cpu_cx_cst() from r282678 (thus
 > > going back to sti;hlt instead of monitor+mwait at C1) fixed the problem
 > > for me.  But r282678 also had the effect of enabling C2 and C3 on my
 > > system, because ACPI only presents MWAIT entries for those states and
 > > not p_lvlx.

 > You can do the same with "debug.acpi.disabled=mwait" loader tunable
 > without hacking the code. And set sysctl hw.acpi.cpu.cx_lowest to C1 to
 > enforce use of hlt instruction even when mwait states were requested.

But hw.acpi.cpu.cx_lowest=C1 disables C2 & C3 etc, and Jason wanted 
those - but without using mwait - if I've read him right?

cheers, Ian
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-02 Thread Konstantin Belousov
On Wed, Nov 02, 2016 at 09:18:15AM -0700, Jason Harmening wrote:
> I think you are probably right.  Hacking out the Intel-specific
> additions to C-state parsing in acpi_cpu_cx_cst() from r282678 (thus
> going back to sti;hlt instead of monitor+mwait at C1) fixed the problem
> for me.  But r282678 also had the effect of enabling C2 and C3 on my
> system, because ACPI only presents MWAIT entries for those states and
> not p_lvlx.
You can do the same with "debug.acpi.disabled=mwait" loader tunable
without hacking the code. And set sysctl hw.acpi.cpu.cx_lowest to C1 to
enforce use of hlt instruction even when mwait states were requested.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-02 Thread Jason Harmening


On 11/02/16 00:55, Konstantin Belousov wrote:
> On Tue, Nov 01, 2016 at 02:29:13PM -0700, Jason Harmening wrote:
>> repro code is at http://pastebin.com/B68N4AFY if anyone's interested.
>>
>> On 11/01/16 13:58, Jason Harmening wrote:
>>> Hi everyone,
>>>
>>> I recently upgraded my main amd64 server from 10.3-stable (r302011) to
>>> 11.0-stable (r308099).  It went smoothly except for one big issue:
>>> certain applications (but not the system as a whole) respond very
>>> sluggishly, and video playback of any kind is extremely choppy.
>>>
>>> The system is under very light load, and I see no evidence of abnormal
>>> interrupt latency or interrupt load.  More interestingly, if I place the
>>> system under full load (~0.0% idle) the problem *disappears* and
>>> playback/responsiveness are smooth and quick.
>>>
>>> Running ktrace on some of the affected apps points me at the problem:
>>> huge variance in the amount of time spent in the nanosleep system call.
>>> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry
>>> to return of the syscall.  OTOH, anything CPU-bound or that waits on
>>> condvars or I/O interrupts seems to work fine, so this doesn't seem to
>>> be an issue with overall system latency.
>>>
>>> I can repro this with a simple program that just does a 3ms usleep in a
>>> tight loop (i.e. roughly the amount of time a video player would sleep
>>> between frames @ 30fps).  At light load ktrace will show the huge
>>> nanosleep variance; under heavy load every nanosleep will complete in
>>> almost exactly 3ms.
>>>
>>> FWIW, I don't see this on -current, although right now all my -current
>>> images are VMs on different HW so that might not mean anything.  I'm not
>>> aware of any recent timer- or scheduler- specific changes, so I'm
>>> wondering if perhaps the recent IPI or taskqueue changes might be
>>> somehow to blame.
>>>
>>> I'm not especially familiar w/ the relevant parts of the kernel, so any
>>> guidance on where I should focus my debugging efforts would be much
>>> appreciated.
>>>
> 
> I am confident, with very high degree of certainity, that the issue is a
> CPU bug in interaction between deep sleep states (C6) and LAPIC timer.
> Check what hardware is used for the eventtimers,
>   sysctl kern.eventtimer.timer
> It should report LAPIC, and you should get rid of jitter with setting
> the sysctl to HPET.  Also please show the first 50 lines of the verbose
> boot dmesg.
> 
> I know that the Nehalem cores are affected, I do not know was the bug
> fixed for Westmere or not.  I asked Intel contact about the problem,
> but got no response.  It is not unreasonable, given that the CPUs are
> beyond their support time.  I intended to automatically bump HPET quality
> on Nehalem and might be Westmere, but I was not able to check Westmere,
> and waited for more information, so this was forgotten.
> BTW, using the latest CPU microcode did not helped.
> 
> After I discovered this, I specifically looked at my Sandy and Haswell
> test systems, but they do not exhibit such problem.
> 
> In the Intel document 320836-036US 'Intel(R) CoreTM i7-900 Desktop
> Processor Extreme Edition Series and Intel(R) CoreTM i7-900 Desktop
> Processor Series Specification Update', there are two erratas which
> might be relevant and show the LAPIC bugs: AAJ47 (but default is to
> not use periodic mode), and AAJ121.  The 121 might be the real cause,
> but Intel does not provide enough details to understand.  And of
> course, the suggested workaround is not feasible.
> 
> Googling for 'Windows LAPIC Nehalem' shows very interesting results,
> in particular,
> https://support.microsoft.com/en-us/kb/2000977 (which I think is the bug
> you see) and
> https://hardware.slashdot.org/story/09/11/28/1723257/microsoft-advice-against-nehalem-xeons-snuffed-out
> for amusement.
> 

I think you are probably right.  Hacking out the Intel-specific
additions to C-state parsing in acpi_cpu_cx_cst() from r282678 (thus
going back to sti;hlt instead of monitor+mwait at C1) fixed the problem
for me.  But r282678 also had the effect of enabling C2 and C3 on my
system, because ACPI only presents MWAIT entries for those states and
not p_lvlx.

I will try switching to HPET when I have more time to test; may be a few
days.




signature.asc
Description: OpenPGP digital signature


Re: huge nanosleep variance on 11-stable

2016-11-02 Thread George Mitchell
On 11/01/16 23:45, Kevin Oberman wrote:
> On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening 
> wrote:
> 
>> Sorry, that should be ~*30ms* to get 30fps, though the variance is still
>> up to 500ms for me either way.
>>
>> On 11/01/16 14:29, Jason Harmening wrote:
>>> repro code is at http://pastebin.com/B68N4AFY if anyone's interested.
>>>
>>> On 11/01/16 13:58, Jason Harmening wrote:
 Hi everyone,

 I recently upgraded my main amd64 server from 10.3-stable (r302011) to
 11.0-stable (r308099).  It went smoothly except for one big issue:
 certain applications (but not the system as a whole) respond very
 sluggishly, and video playback of any kind is extremely choppy.

 [...]
> I eliminated the annoyance by change scheduler from ULE to 4BSD. That was
> it, but I have not seen the issue since. I'd be very interested in whether
> the scheduler is somehow impacting timing functions or it's s different
> issue. I've felt that there was something off in ULE for some time, but it
> was not until these annoying hiccups convinced me to try going back to
> 4BSD.
> 
> Tip o' the hat to Doug B. for his suggestions that ULE may have issues that
> impacted interactivity.
> [...]

Not to beat a dead horse, but I've been a non-fan of SCHED_ULE since
it was first introduced, and I don't like it even today.  I run the
distributed.net client on my machines, but even without that, ULE
screws interactive behavior.  With distributed.net running and ULE,
a make buildworld/make buildkernel takes 10 2/3 hours on 10.3-RELEASE
on a 6-CPU machine versus 2 1/2 hours on the same machine with 4BSD
and distributed.net running.  I'm told that SCHED_ULE is the greatest
thing since sliced bread for some compute load or other (details are
scarce), but I (fortunately) don't often have to run heavy server
type loads; and for everyday use (even without distributed.net
running), SCHED_4BSD is my choice by far.  It's too bad I can't run
freebsd_update with it, though.

I promise to shut up about this now.   -- George
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-02 Thread Konstantin Belousov
On Tue, Nov 01, 2016 at 02:29:13PM -0700, Jason Harmening wrote:
> repro code is at http://pastebin.com/B68N4AFY if anyone's interested.
> 
> On 11/01/16 13:58, Jason Harmening wrote:
> > Hi everyone,
> > 
> > I recently upgraded my main amd64 server from 10.3-stable (r302011) to
> > 11.0-stable (r308099).  It went smoothly except for one big issue:
> > certain applications (but not the system as a whole) respond very
> > sluggishly, and video playback of any kind is extremely choppy.
> > 
> > The system is under very light load, and I see no evidence of abnormal
> > interrupt latency or interrupt load.  More interestingly, if I place the
> > system under full load (~0.0% idle) the problem *disappears* and
> > playback/responsiveness are smooth and quick.
> > 
> > Running ktrace on some of the affected apps points me at the problem:
> > huge variance in the amount of time spent in the nanosleep system call.
> > A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry
> > to return of the syscall.  OTOH, anything CPU-bound or that waits on
> > condvars or I/O interrupts seems to work fine, so this doesn't seem to
> > be an issue with overall system latency.
> > 
> > I can repro this with a simple program that just does a 3ms usleep in a
> > tight loop (i.e. roughly the amount of time a video player would sleep
> > between frames @ 30fps).  At light load ktrace will show the huge
> > nanosleep variance; under heavy load every nanosleep will complete in
> > almost exactly 3ms.
> > 
> > FWIW, I don't see this on -current, although right now all my -current
> > images are VMs on different HW so that might not mean anything.  I'm not
> > aware of any recent timer- or scheduler- specific changes, so I'm
> > wondering if perhaps the recent IPI or taskqueue changes might be
> > somehow to blame.
> > 
> > I'm not especially familiar w/ the relevant parts of the kernel, so any
> > guidance on where I should focus my debugging efforts would be much
> > appreciated.
> > 

I am confident, with very high degree of certainity, that the issue is a
CPU bug in interaction between deep sleep states (C6) and LAPIC timer.
Check what hardware is used for the eventtimers,
sysctl kern.eventtimer.timer
It should report LAPIC, and you should get rid of jitter with setting
the sysctl to HPET.  Also please show the first 50 lines of the verbose
boot dmesg.

I know that the Nehalem cores are affected, I do not know was the bug
fixed for Westmere or not.  I asked Intel contact about the problem,
but got no response.  It is not unreasonable, given that the CPUs are
beyond their support time.  I intended to automatically bump HPET quality
on Nehalem and might be Westmere, but I was not able to check Westmere,
and waited for more information, so this was forgotten.
BTW, using the latest CPU microcode did not helped.

After I discovered this, I specifically looked at my Sandy and Haswell
test systems, but they do not exhibit such problem.

In the Intel document 320836-036US 'Intel(R) CoreTM i7-900 Desktop
Processor Extreme Edition Series and Intel(R) CoreTM i7-900 Desktop
Processor Series Specification Update', there are two erratas which
might be relevant and show the LAPIC bugs: AAJ47 (but default is to
not use periodic mode), and AAJ121.  The 121 might be the real cause,
but Intel does not provide enough details to understand.  And of
course, the suggested workaround is not feasible.

Googling for 'Windows LAPIC Nehalem' shows very interesting results,
in particular,
https://support.microsoft.com/en-us/kb/2000977 (which I think is the bug
you see) and
https://hardware.slashdot.org/story/09/11/28/1723257/microsoft-advice-against-nehalem-xeons-snuffed-out
for amusement.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-02 Thread Jason Harmening


On 11/01/16 22:49, Kevin Oberman wrote:
> On Tue, Nov 1, 2016 at 10:16 PM, Jason Harmening
> > wrote:
> 
> 
> 
> On 11/01/16 20:45, Kevin Oberman wrote:
> > On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening
> > 
>  >> wrote:
> >
> > Sorry, that should be ~*30ms* to get 30fps, though the
> variance is still
> > up to 500ms for me either way.
> >
> > On 11/01/16 14:29, Jason Harmening wrote:
> > > repro code is at http://pastebin.com/B68N4AFY if anyone's
> interested.
> > >
> > > On 11/01/16 13:58, Jason Harmening wrote:
> > >> Hi everyone,
> > >>
> > >> I recently upgraded my main amd64 server from 10.3-stable
> > (r302011) to
> > >> 11.0-stable (r308099).  It went smoothly except for one big
> issue:
> > >> certain applications (but not the system as a whole)
> respond very
> > >> sluggishly, and video playback of any kind is extremely choppy.
> > >>
> > >> The system is under very light load, and I see no evidence of
> > abnormal
> > >> interrupt latency or interrupt load.  More interestingly, if I
> > place the
> > >> system under full load (~0.0% idle) the problem
> *disappears* and
> > >> playback/responsiveness are smooth and quick.
> > >>
> > >> Running ktrace on some of the affected apps points me at
> the problem:
> > >> huge variance in the amount of time spent in the nanosleep
> system
> > call.
> > >> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms
> from
> > entry
> > >> to return of the syscall.  OTOH, anything CPU-bound or that
> waits on
> > >> condvars or I/O interrupts seems to work fine, so this doesn't
> > seem to
> > >> be an issue with overall system latency.
> > >>
> > >> I can repro this with a simple program that just does a 3ms
> > usleep in a
> > >> tight loop (i.e. roughly the amount of time a video player
> would
> > sleep
> > >> between frames @ 30fps).  At light load ktrace will show
> the huge
> > >> nanosleep variance; under heavy load every nanosleep will
> complete in
> > >> almost exactly 3ms.
> > >>
> > >> FWIW, I don't see this on -current, although right now all my
> > -current
> > >> images are VMs on different HW so that might not mean anything.
> > I'm not
> > >> aware of any recent timer- or scheduler- specific changes,
> so I'm
> > >> wondering if perhaps the recent IPI or taskqueue changes
> might be
> > >> somehow to blame.
> > >>
> > >> I'm not especially familiar w/ the relevant parts of the
> kernel,
> > so any
> > >> guidance on where I should focus my debugging efforts would
> be much
> > >> appreciated.
> > >>
> > >> Thanks,
> > >> Jason
> >
> >
> > This is likely off track, but this is a behavior I have noticed since
> > moving to 11, though it might have started in 10.3-STABLE before
> moving
> > to head before 11 went to beta. I can't explain any way nanosleep
> could
> > be involved, but I saw annoying lock-ups similar to yours. I also no
> > longer see them.
> >
> > I eliminated the annoyance by change scheduler from ULE to 4BSD. That
> > was it, but I have not seen the issue since. I'd be very interested in
> > whether the scheduler is somehow impacting timing functions or it's s
> > different issue. I've felt that there was something off in ULE for
> some
> > time, but it was not until these annoying hiccups convinced me to try
> > going back to 4BSD.
> >
> > Tip o' the hat to Doug B. for his suggestions that ULE may have issues
> > that impacted interactivity.
> 
> I figured it out: r282678 (which was never MFCed to 10-stable) added
> support for the MWAIT instruction on the idle path for Intel CPUs that
> claim to support it.
> 
> While my CPU (2009-era Xeon 5500) advertises support for it in its
> feature mask and ACPI C-state entries, the cores don't seem to respond
> very quickly to interrupts while idling in MWAIT.  Disabling mwait in
> acpi_cpu.c and falling back to the old "sti; hlt" mechanism for C1
> completely fixes the responsiveness issues.
> 
> So if your CPU is of a similar vintage, it may not be ULE's fault.
> 
> 
> You are almost certainly correct. My system is circa 2011; i5-2520M,
> Sandy Bridge. While it might have the same issue, I'd be surprised. It's
> possible, but probably completely different 

Re: huge nanosleep variance on 11-stable

2016-11-01 Thread Kevin Oberman
On Tue, Nov 1, 2016 at 10:16 PM, Jason Harmening 
wrote:

>
>
> On 11/01/16 20:45, Kevin Oberman wrote:
> > On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening
> > > wrote:
> >
> > Sorry, that should be ~*30ms* to get 30fps, though the variance is
> still
> > up to 500ms for me either way.
> >
> > On 11/01/16 14:29, Jason Harmening wrote:
> > > repro code is at http://pastebin.com/B68N4AFY if anyone's
> interested.
> > >
> > > On 11/01/16 13:58, Jason Harmening wrote:
> > >> Hi everyone,
> > >>
> > >> I recently upgraded my main amd64 server from 10.3-stable
> > (r302011) to
> > >> 11.0-stable (r308099).  It went smoothly except for one big issue:
> > >> certain applications (but not the system as a whole) respond very
> > >> sluggishly, and video playback of any kind is extremely choppy.
> > >>
> > >> The system is under very light load, and I see no evidence of
> > abnormal
> > >> interrupt latency or interrupt load.  More interestingly, if I
> > place the
> > >> system under full load (~0.0% idle) the problem *disappears* and
> > >> playback/responsiveness are smooth and quick.
> > >>
> > >> Running ktrace on some of the affected apps points me at the
> problem:
> > >> huge variance in the amount of time spent in the nanosleep system
> > call.
> > >> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from
> > entry
> > >> to return of the syscall.  OTOH, anything CPU-bound or that waits
> on
> > >> condvars or I/O interrupts seems to work fine, so this doesn't
> > seem to
> > >> be an issue with overall system latency.
> > >>
> > >> I can repro this with a simple program that just does a 3ms
> > usleep in a
> > >> tight loop (i.e. roughly the amount of time a video player would
> > sleep
> > >> between frames @ 30fps).  At light load ktrace will show the huge
> > >> nanosleep variance; under heavy load every nanosleep will
> complete in
> > >> almost exactly 3ms.
> > >>
> > >> FWIW, I don't see this on -current, although right now all my
> > -current
> > >> images are VMs on different HW so that might not mean anything.
> > I'm not
> > >> aware of any recent timer- or scheduler- specific changes, so I'm
> > >> wondering if perhaps the recent IPI or taskqueue changes might be
> > >> somehow to blame.
> > >>
> > >> I'm not especially familiar w/ the relevant parts of the kernel,
> > so any
> > >> guidance on where I should focus my debugging efforts would be
> much
> > >> appreciated.
> > >>
> > >> Thanks,
> > >> Jason
> >
> >
> > This is likely off track, but this is a behavior I have noticed since
> > moving to 11, though it might have started in 10.3-STABLE before moving
> > to head before 11 went to beta. I can't explain any way nanosleep could
> > be involved, but I saw annoying lock-ups similar to yours. I also no
> > longer see them.
> >
> > I eliminated the annoyance by change scheduler from ULE to 4BSD. That
> > was it, but I have not seen the issue since. I'd be very interested in
> > whether the scheduler is somehow impacting timing functions or it's s
> > different issue. I've felt that there was something off in ULE for some
> > time, but it was not until these annoying hiccups convinced me to try
> > going back to 4BSD.
> >
> > Tip o' the hat to Doug B. for his suggestions that ULE may have issues
> > that impacted interactivity.
>
> I figured it out: r282678 (which was never MFCed to 10-stable) added
> support for the MWAIT instruction on the idle path for Intel CPUs that
> claim to support it.
>
> While my CPU (2009-era Xeon 5500) advertises support for it in its
> feature mask and ACPI C-state entries, the cores don't seem to respond
> very quickly to interrupts while idling in MWAIT.  Disabling mwait in
> acpi_cpu.c and falling back to the old "sti; hlt" mechanism for C1
> completely fixes the responsiveness issues.
>
> So if your CPU is of a similar vintage, it may not be ULE's fault.
>
>
You are almost certainly correct. My system is circa 2011; i5-2520M, Sandy
Bridge. While it might have the same issue, I'd be surprised. It's
possible, but probably completely different from what you are seeing.
Reports of the problem I was seeing definitely pre-date 11, but 11 made
things much worse, so it could be a combination of things. And I can't see
how ULE could have anything to do with this issue.

Congratulations on some really good sleuthing to find this.
--
Kevin Oberman, Part time kid herder and retired Network Engineer
E-mail: rkober...@gmail.com
PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to 

Re: huge nanosleep variance on 11-stable

2016-11-01 Thread Jason Harmening


On 11/01/16 20:45, Kevin Oberman wrote:
> On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening
> > wrote:
> 
> Sorry, that should be ~*30ms* to get 30fps, though the variance is still
> up to 500ms for me either way.
> 
> On 11/01/16 14:29, Jason Harmening wrote:
> > repro code is at http://pastebin.com/B68N4AFY if anyone's interested.
> >
> > On 11/01/16 13:58, Jason Harmening wrote:
> >> Hi everyone,
> >>
> >> I recently upgraded my main amd64 server from 10.3-stable
> (r302011) to
> >> 11.0-stable (r308099).  It went smoothly except for one big issue:
> >> certain applications (but not the system as a whole) respond very
> >> sluggishly, and video playback of any kind is extremely choppy.
> >>
> >> The system is under very light load, and I see no evidence of
> abnormal
> >> interrupt latency or interrupt load.  More interestingly, if I
> place the
> >> system under full load (~0.0% idle) the problem *disappears* and
> >> playback/responsiveness are smooth and quick.
> >>
> >> Running ktrace on some of the affected apps points me at the problem:
> >> huge variance in the amount of time spent in the nanosleep system
> call.
> >> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from
> entry
> >> to return of the syscall.  OTOH, anything CPU-bound or that waits on
> >> condvars or I/O interrupts seems to work fine, so this doesn't
> seem to
> >> be an issue with overall system latency.
> >>
> >> I can repro this with a simple program that just does a 3ms
> usleep in a
> >> tight loop (i.e. roughly the amount of time a video player would
> sleep
> >> between frames @ 30fps).  At light load ktrace will show the huge
> >> nanosleep variance; under heavy load every nanosleep will complete in
> >> almost exactly 3ms.
> >>
> >> FWIW, I don't see this on -current, although right now all my
> -current
> >> images are VMs on different HW so that might not mean anything. 
> I'm not
> >> aware of any recent timer- or scheduler- specific changes, so I'm
> >> wondering if perhaps the recent IPI or taskqueue changes might be
> >> somehow to blame.
> >>
> >> I'm not especially familiar w/ the relevant parts of the kernel,
> so any
> >> guidance on where I should focus my debugging efforts would be much
> >> appreciated.
> >>
> >> Thanks,
> >> Jason
> 
>  
> This is likely off track, but this is a behavior I have noticed since
> moving to 11, though it might have started in 10.3-STABLE before moving
> to head before 11 went to beta. I can't explain any way nanosleep could
> be involved, but I saw annoying lock-ups similar to yours. I also no
> longer see them.
> 
> I eliminated the annoyance by change scheduler from ULE to 4BSD. That
> was it, but I have not seen the issue since. I'd be very interested in
> whether the scheduler is somehow impacting timing functions or it's s
> different issue. I've felt that there was something off in ULE for some
> time, but it was not until these annoying hiccups convinced me to try
> going back to 4BSD.
> 
> Tip o' the hat to Doug B. for his suggestions that ULE may have issues
> that impacted interactivity.

I figured it out: r282678 (which was never MFCed to 10-stable) added
support for the MWAIT instruction on the idle path for Intel CPUs that
claim to support it.

While my CPU (2009-era Xeon 5500) advertises support for it in its
feature mask and ACPI C-state entries, the cores don't seem to respond
very quickly to interrupts while idling in MWAIT.  Disabling mwait in
acpi_cpu.c and falling back to the old "sti; hlt" mechanism for C1
completely fixes the responsiveness issues.

So if your CPU is of a similar vintage, it may not be ULE's fault.



signature.asc
Description: OpenPGP digital signature


Re: huge nanosleep variance on 11-stable

2016-11-01 Thread Kevin Oberman
On Tue, Nov 1, 2016 at 2:36 PM, Jason Harmening 
wrote:

> Sorry, that should be ~*30ms* to get 30fps, though the variance is still
> up to 500ms for me either way.
>
> On 11/01/16 14:29, Jason Harmening wrote:
> > repro code is at http://pastebin.com/B68N4AFY if anyone's interested.
> >
> > On 11/01/16 13:58, Jason Harmening wrote:
> >> Hi everyone,
> >>
> >> I recently upgraded my main amd64 server from 10.3-stable (r302011) to
> >> 11.0-stable (r308099).  It went smoothly except for one big issue:
> >> certain applications (but not the system as a whole) respond very
> >> sluggishly, and video playback of any kind is extremely choppy.
> >>
> >> The system is under very light load, and I see no evidence of abnormal
> >> interrupt latency or interrupt load.  More interestingly, if I place the
> >> system under full load (~0.0% idle) the problem *disappears* and
> >> playback/responsiveness are smooth and quick.
> >>
> >> Running ktrace on some of the affected apps points me at the problem:
> >> huge variance in the amount of time spent in the nanosleep system call.
> >> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry
> >> to return of the syscall.  OTOH, anything CPU-bound or that waits on
> >> condvars or I/O interrupts seems to work fine, so this doesn't seem to
> >> be an issue with overall system latency.
> >>
> >> I can repro this with a simple program that just does a 3ms usleep in a
> >> tight loop (i.e. roughly the amount of time a video player would sleep
> >> between frames @ 30fps).  At light load ktrace will show the huge
> >> nanosleep variance; under heavy load every nanosleep will complete in
> >> almost exactly 3ms.
> >>
> >> FWIW, I don't see this on -current, although right now all my -current
> >> images are VMs on different HW so that might not mean anything.  I'm not
> >> aware of any recent timer- or scheduler- specific changes, so I'm
> >> wondering if perhaps the recent IPI or taskqueue changes might be
> >> somehow to blame.
> >>
> >> I'm not especially familiar w/ the relevant parts of the kernel, so any
> >> guidance on where I should focus my debugging efforts would be much
> >> appreciated.
> >>
> >> Thanks,
> >> Jason
>

This is likely off track, but this is a behavior I have noticed since
moving to 11, though it might have started in 10.3-STABLE before moving to
head before 11 went to beta. I can't explain any way nanosleep could be
involved, but I saw annoying lock-ups similar to yours. I also no longer
see them.

I eliminated the annoyance by change scheduler from ULE to 4BSD. That was
it, but I have not seen the issue since. I'd be very interested in whether
the scheduler is somehow impacting timing functions or it's s different
issue. I've felt that there was something off in ULE for some time, but it
was not until these annoying hiccups convinced me to try going back to
4BSD.

Tip o' the hat to Doug B. for his suggestions that ULE may have issues that
impacted interactivity.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: huge nanosleep variance on 11-stable

2016-11-01 Thread Jason Harmening
Sorry, that should be ~*30ms* to get 30fps, though the variance is still
up to 500ms for me either way.

On 11/01/16 14:29, Jason Harmening wrote:
> repro code is at http://pastebin.com/B68N4AFY if anyone's interested.
> 
> On 11/01/16 13:58, Jason Harmening wrote:
>> Hi everyone,
>>
>> I recently upgraded my main amd64 server from 10.3-stable (r302011) to
>> 11.0-stable (r308099).  It went smoothly except for one big issue:
>> certain applications (but not the system as a whole) respond very
>> sluggishly, and video playback of any kind is extremely choppy.
>>
>> The system is under very light load, and I see no evidence of abnormal
>> interrupt latency or interrupt load.  More interestingly, if I place the
>> system under full load (~0.0% idle) the problem *disappears* and
>> playback/responsiveness are smooth and quick.
>>
>> Running ktrace on some of the affected apps points me at the problem:
>> huge variance in the amount of time spent in the nanosleep system call.
>> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry
>> to return of the syscall.  OTOH, anything CPU-bound or that waits on
>> condvars or I/O interrupts seems to work fine, so this doesn't seem to
>> be an issue with overall system latency.
>>
>> I can repro this with a simple program that just does a 3ms usleep in a
>> tight loop (i.e. roughly the amount of time a video player would sleep
>> between frames @ 30fps).  At light load ktrace will show the huge
>> nanosleep variance; under heavy load every nanosleep will complete in
>> almost exactly 3ms.
>>
>> FWIW, I don't see this on -current, although right now all my -current
>> images are VMs on different HW so that might not mean anything.  I'm not
>> aware of any recent timer- or scheduler- specific changes, so I'm
>> wondering if perhaps the recent IPI or taskqueue changes might be
>> somehow to blame.
>>
>> I'm not especially familiar w/ the relevant parts of the kernel, so any
>> guidance on where I should focus my debugging efforts would be much
>> appreciated.
>>
>> Thanks,
>> Jason
>>
> 



signature.asc
Description: OpenPGP digital signature


Re: huge nanosleep variance on 11-stable

2016-11-01 Thread Jason Harmening
repro code is at http://pastebin.com/B68N4AFY if anyone's interested.

On 11/01/16 13:58, Jason Harmening wrote:
> Hi everyone,
> 
> I recently upgraded my main amd64 server from 10.3-stable (r302011) to
> 11.0-stable (r308099).  It went smoothly except for one big issue:
> certain applications (but not the system as a whole) respond very
> sluggishly, and video playback of any kind is extremely choppy.
> 
> The system is under very light load, and I see no evidence of abnormal
> interrupt latency or interrupt load.  More interestingly, if I place the
> system under full load (~0.0% idle) the problem *disappears* and
> playback/responsiveness are smooth and quick.
> 
> Running ktrace on some of the affected apps points me at the problem:
> huge variance in the amount of time spent in the nanosleep system call.
> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry
> to return of the syscall.  OTOH, anything CPU-bound or that waits on
> condvars or I/O interrupts seems to work fine, so this doesn't seem to
> be an issue with overall system latency.
> 
> I can repro this with a simple program that just does a 3ms usleep in a
> tight loop (i.e. roughly the amount of time a video player would sleep
> between frames @ 30fps).  At light load ktrace will show the huge
> nanosleep variance; under heavy load every nanosleep will complete in
> almost exactly 3ms.
> 
> FWIW, I don't see this on -current, although right now all my -current
> images are VMs on different HW so that might not mean anything.  I'm not
> aware of any recent timer- or scheduler- specific changes, so I'm
> wondering if perhaps the recent IPI or taskqueue changes might be
> somehow to blame.
> 
> I'm not especially familiar w/ the relevant parts of the kernel, so any
> guidance on where I should focus my debugging efforts would be much
> appreciated.
> 
> Thanks,
> Jason
> 



signature.asc
Description: OpenPGP digital signature


huge nanosleep variance on 11-stable

2016-11-01 Thread Jason Harmening
Hi everyone,

I recently upgraded my main amd64 server from 10.3-stable (r302011) to
11.0-stable (r308099).  It went smoothly except for one big issue:
certain applications (but not the system as a whole) respond very
sluggishly, and video playback of any kind is extremely choppy.

The system is under very light load, and I see no evidence of abnormal
interrupt latency or interrupt load.  More interestingly, if I place the
system under full load (~0.0% idle) the problem *disappears* and
playback/responsiveness are smooth and quick.

Running ktrace on some of the affected apps points me at the problem:
huge variance in the amount of time spent in the nanosleep system call.
A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry
to return of the syscall.  OTOH, anything CPU-bound or that waits on
condvars or I/O interrupts seems to work fine, so this doesn't seem to
be an issue with overall system latency.

I can repro this with a simple program that just does a 3ms usleep in a
tight loop (i.e. roughly the amount of time a video player would sleep
between frames @ 30fps).  At light load ktrace will show the huge
nanosleep variance; under heavy load every nanosleep will complete in
almost exactly 3ms.

FWIW, I don't see this on -current, although right now all my -current
images are VMs on different HW so that might not mean anything.  I'm not
aware of any recent timer- or scheduler- specific changes, so I'm
wondering if perhaps the recent IPI or taskqueue changes might be
somehow to blame.

I'm not especially familiar w/ the relevant parts of the kernel, so any
guidance on where I should focus my debugging efforts would be much
appreciated.

Thanks,
Jason



signature.asc
Description: OpenPGP digital signature