Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread vasvir
> Let's try to speed up reproducing this.
>
> I have a hunch perhaps this might be related to some BIOS controlled
> MTRRs and a mismatch which then enables the kernel to think that a type
> of MTRR write might be OK, but in fact its not. Due to the work load
> description of this perhaps this could be related to fan control and BIOS
> control on them and against some other device MTRR. More on this suspicion
> on another thread where you provide more logs.
>
> On a kernel that you know fails can you try replacing this work load by
> making
> you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building
> for 2,
> 4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
> making the CPU fan trigger would accelerate the issue.  If 'make -j' is
> too nuts
> to the point you can't even CTRL C it, try 'make -j 16' . Note that if
> this is
> true then that means a hot CPU could still trigger CPU fan controls on on
> a
> fresh boot if the previous boot was CPU intensive.

OK that nailed it - with kernel 4.3 a known "bad" kernel I was able to
reproduce it in the second hibernate/resume cycle. Here is what I did in
my own words so you can spot inconsistencies.

I started a kernel compile with make -j 32. My computer was very
responsive which is an impressive feat by the way.
In a second tab in my Konsole (I am running KDE) I run $watch sensors. I
watched the temperature of the cores to go from 38 to ~70 and the cpu fan
from ~1630 to ~1900. Then the first time I hit Ctrl+C - stopped the
compilation and hibernated from the KDE. I always hibernate from the KDE
start menu. Previously I had made some tests where I was hibernating from
the VT console (although sddm may was running in VT7) and I have managed
to reproduce it - so (in my mind) it was not graphics mode specific. From
that point I am always hibernating from KDE.

The first time it worked. For the second time I thought - why to hit
Ctrl+C let's try to hibernate with the compilation running - and it
failed. Now I don't know if it failed because it was the second cycle or
because the load of the compilation was there or because of the
temperature controlled fan register you mentioned.

Then I repeated the test with a known good kernel 3.18 (which should be
773fed910d41e443e495a6bfa9ab1c2b7b13e012 according to my git bisect logs -
I have a problem there - see below) and it survived the same test
(hibernate two times with temperature being ~70).


> If this doesn't do it lets try forcing an MTRR capable driver, say
> graphics is
> the obvious target, try perhaps some 3D stuff or a screen saver prior to
> hibernation. Note that even if you boot nomtrr the BIOS may still use
> MTRRs,
> and PAT use on Linux could assume MTRR is not being used on drivers but
> the
> BIOS may still do something behind the scenes. This is actually one reason
> why
> we can't exactly remove MTRR support from Linux, since the BIOS may still
> do
> some wacky stuff with MTRRs, one example of such I was given was CPU can
> control might use WC MTRRs, so the kernel must be aware of this, even if
> no
> MTRRs are ever used on the Linux kernel at all -- this is the case now as
> of
> v4.3 and onwards.
>
> If that doesn't help speed it up , maybe try both screen saver + some 3D
> stuff + cpu instensive stuff.

I have 3D effects enabled in my KDE. Since your tip succeed to reproduce
the problem early I didn't bother but If I should test 3D which program /
benchmark should I run? glxgears?

>
> To help you speed up testing you can try reducing your build time by
> reducing
> the amount of crap you have to build:
>
> make localmodconfig
>
> That should only build things your kernel has loaded as modules or is
> already
> enabled (=y).
>

Thanks for the tip. I don't want to change that right now. I don't mind
waiting a little bit because I a get a deb with the kernel and can retest
a known configuration. The other tip you gave if it actually works as it
looks like working would give a great boost to the debugging cycle to
actually make me the bottleneck.

>
> That is commit a023748d53c10850650fe86b1c4a7d421d576451
> ("Merge branch 'x86-mm-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>
> Git is smart enough to tell you you've hit a merge commit and that all the
> possible commits on that merge could be the issue. This is why you bisect
> log shows a slew of commits. The next step is to bisect through the merge
> and then bisect through that, this will then let us identify the exact
> commit
> that may have caused the issue.
>
> There are a few ways to do this, my preferred way is to "unfold" a merge
> commit manually.
>
> To help keep thing separately (without affecting other tests you might
> have on your other git tree and to avoid having to force you to loose
> fresh object as you continue to build test on the other tree), I'd do
> something like this:

we will go with your preferred way - no question about that.

>
> mkdir ~/tmp
> git clone

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread vasvir
On 11/20/2015 02:23 PM, Juergen Gross wrote:

>
> As the BIOS obviously isn't disabling MTRR I don't think we have
> to go that route any longer.

ok.

>>
>> In the weekend I will return to 3.18-rc2 and I will try to verify my
>> bisection is correct. Double guessing your self is a terrible thing...
>
> Thanks.
>
>> I will also try with nopat and I will run dmesg | grep -i mtr and post
>> results
>>
>> Unless you have any other suggestions...
>

I hit a very big problem here. I did
$git checkout 773fed910d41e443e495a6bfa9ab1c2b7b13e012
$make (with gcc 4.8 - as all my tests)

and the resulting kernel in unbootable hunging in "Loading initial
ramdisk..." second line of the kernel boot

That means my bisection is not good because this release is marked as good.

So now I am at loss.

As I said I followed https://wiki.debian.org/DebianKernel/GitBisect

I notice now that the article suggest a step
  $make oldconfig

I did it once at the start of the bisection and then answering the default
(Enter) in all config questions.

> I think we have to find out where the kernel is really hanging. Do you
> have any chance to trigger a NMI?

I am googling about it.

>
> Looking into suspend/resume code I found a strange inconsistency for
> the lapic handling:
>
> lapic_suspend()
> {
> ...
> #ifdef CONFIG_X86_THERMAL_VECTOR
> if (maxlvt >= 5)
> apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
> #endif
> ...
> }
>
> lapic_resume()
> {
> ...
> #if defined(CONFIG_X86_MCE_INTEL)
> if (maxlvt >= 5)
> apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
> #endif
> ...
> }
>
> and comparing that to:
>
> clear_local_APIC()
> {
> ...
> #ifdef CONFIG_X86_THERMAL_VECTOR
> if (maxlvt >= 5) {
> v = apic_read(APIC_LVTTHMR);
> apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
> }
> #endif
> #ifdef CONFIG_X86_MCE_INTEL
> if (maxlvt >= 6) {
> v = apic_read(APIC_LVTCMCI);
> if (!(v & APIC_LVT_MASKED))
> apic_write(APIC_LVTCMCI, v | APIC_LVT_MASKED);
> }
> #endif
> ...
> }
>

Ok I will send the .config when I get back home. I have all kernels I
build in .deb archive. The problem is that the debian kernel build
procedure does not hold somewhere in the deb file the git commit hash.

Fow which kernel would you care to see the config? 4.3?

 Vassilis



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-20 Thread vasvir
> I've just found a potential issue: In case MTRR is disabled by the BIOS
> the PAT register of the boot processor won't be restored after resume.
>
> Can you check whether pr_info("MTRR: Disabled\n") has been executed in
> early boot? If yes, this might be a BIOS option.
>

I don't have access right now. I will test it later tonight (This is my
home machine).

Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
somewere else e.g. /proc /sys etc?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-18 Thread vasvir
Hi,

Thanks for the quick answer

>
> Could you please try the most recent 4.3 kernel? There has been some
> work related to this topic after 4.2 (large page pat handling done by
> Toshi Kani and mtrr/pat handling by Luis Rodriguez).

That means I will reset the bisection. Right? Is there any other info we
can extract from there?

So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
4.3 for now.

I will do it later tonight. It will take 2 days at least to report back

>
> Another interesting information would be the exact hardware you are
> using. Maybe we can see some similarities between yours and the other
> two cases you referenced above.
>

It is an i7
Motherboard: ASROCK H97 PRO4 RETAIL
CPU INTEL CORE I7-4790 3.60GHZ LGA1150 - BOX
It has 16GB of RAM, one SSD and one HDD
I have NO external graphics card

Do you want me to run something on this like lspci, lsusb

I upgraded the BIOS of the motherboard to the latest. This is not the
problem though because I upgraded after the problem occurred as a counter
measure in case I was hit by a buggy BIOS and linux had changed its
behavior to be stricter.

I experimented with ACPI compilers/decompilers and I was tempted to fix my
ACPI tables but I didn't.

I saw the kernel command line option acpi_os=!Windows2013 but I didn't try
it. Do you thing I should try it?

> Wow! Thanks for doing this work!
>

I would like this to be fixed so I am willing to do the testing.

   Vassilis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/