subject:"Hibernate resume bug around 3,18\-rc2 \- Full PAT support"

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-25 Thread Luis R. Rodriguez

On Wed, Nov 25, 2015 at 06:01:20AM +0100, Juergen Gross wrote:
> On 24/11/15 23:46, Luis R. Rodriguez wrote:
> > On Mon, Nov 23, 2015 at 03:19:16PM +0100, Juergen Gross wrote:
> >> On 23/11/15 15:11, vas...@iit.demokritos.gr wrote:
> >>> Ok I will send the .config when I get back home. I have all kernels I
> >>> build in .deb archive. The problem is that the debian kernel build
> >>> procedure does not hold somewhere in the deb file the git commit hash.
> >>>
> >>> Fow which kernel would you care to see the config? 4.3?
> >>
> >> Doesn't really matter anymore. I've posted a patch already to fix it and
> >> got the reply, that the fix is okay, but no harm can come from the
> >> current implementation, as the two config options are always either both
> >> set or reset.
> > 
> > Hrm, Vassilis seems to be able to reproduce this more effectively by 
> > heating up
> > his CPU prior to hibernation though. I have no idea what adding 
> > APIC_LVT_MASKED
> > ((1 << 16)) to the Local Vector Table (LVT) Thermal Monitor (APIC_LVTTHMR 
> > 0x330) does but
> > clear_local_APIC() seems to be used to "cleanout any BIOS leftovers during
> > boot." If we're suspending but the fan is still on I wonder if this could 
> > cause
> > an issue with some settings the BIOS may have set prior to hibernation, and
> > a mismatch upon resume.
> > 
> > I can't find what APIC_LVT_MASKED does though, the best doc I found:
> 
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf
> 
> Local APIC (chapter 10.4).

Thanks, yeah I only see the same thing you spotted and fixed [0] but also
agree it does not play a role with this issue. Although completely
not documented the APIC_LVT_MASKED just masks the thermal interrupts
while we go down, and we just set the original value of the thermal
register when we come up. The only other possible cautious reading about
the thermal register seemed to be x86-32 bit specific.

Let's see what the bisect ends up with.

[0] 
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=42baa2581c92f8d07e7260506c8d41caf14b0fc3

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-25 Thread Luis R. Rodriguez

On Wed, Nov 25, 2015 at 06:01:20AM +0100, Juergen Gross wrote:
> On 24/11/15 23:46, Luis R. Rodriguez wrote:
> > On Mon, Nov 23, 2015 at 03:19:16PM +0100, Juergen Gross wrote:
> >> On 23/11/15 15:11, vas...@iit.demokritos.gr wrote:
> >>> Ok I will send the .config when I get back home. I have all kernels I
> >>> build in .deb archive. The problem is that the debian kernel build
> >>> procedure does not hold somewhere in the deb file the git commit hash.
> >>>
> >>> Fow which kernel would you care to see the config? 4.3?
> >>
> >> Doesn't really matter anymore. I've posted a patch already to fix it and
> >> got the reply, that the fix is okay, but no harm can come from the
> >> current implementation, as the two config options are always either both
> >> set or reset.
> > 
> > Hrm, Vassilis seems to be able to reproduce this more effectively by 
> > heating up
> > his CPU prior to hibernation though. I have no idea what adding 
> > APIC_LVT_MASKED
> > ((1 << 16)) to the Local Vector Table (LVT) Thermal Monitor (APIC_LVTTHMR 
> > 0x330) does but
> > clear_local_APIC() seems to be used to "cleanout any BIOS leftovers during
> > boot." If we're suspending but the fan is still on I wonder if this could 
> > cause
> > an issue with some settings the BIOS may have set prior to hibernation, and
> > a mismatch upon resume.
> > 
> > I can't find what APIC_LVT_MASKED does though, the best doc I found:
> 
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf
> 
> Local APIC (chapter 10.4).

Thanks, yeah I only see the same thing you spotted and fixed [0] but also
agree it does not play a role with this issue. Although completely
not documented the APIC_LVT_MASKED just masks the thermal interrupts
while we go down, and we just set the original value of the thermal
register when we come up. The only other possible cautious reading about
the thermal register seemed to be x86-32 bit specific.

Let's see what the bisect ends up with.

[0] 
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=42baa2581c92f8d07e7260506c8d41caf14b0fc3

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread Juergen Gross

On 24/11/15 23:46, Luis R. Rodriguez wrote:
> On Mon, Nov 23, 2015 at 03:19:16PM +0100, Juergen Gross wrote:
>> On 23/11/15 15:11, vas...@iit.demokritos.gr wrote:
>>> Ok I will send the .config when I get back home. I have all kernels I
>>> build in .deb archive. The problem is that the debian kernel build
>>> procedure does not hold somewhere in the deb file the git commit hash.
>>>
>>> Fow which kernel would you care to see the config? 4.3?
>>
>> Doesn't really matter anymore. I've posted a patch already to fix it and
>> got the reply, that the fix is okay, but no harm can come from the
>> current implementation, as the two config options are always either both
>> set or reset.
> 
> Hrm, Vassilis seems to be able to reproduce this more effectively by heating 
> up
> his CPU prior to hibernation though. I have no idea what adding 
> APIC_LVT_MASKED
> ((1 << 16)) to the Local Vector Table (LVT) Thermal Monitor (APIC_LVTTHMR 
> 0x330) does but
> clear_local_APIC() seems to be used to "cleanout any BIOS leftovers during
> boot." If we're suspending but the fan is still on I wonder if this could 
> cause
> an issue with some settings the BIOS may have set prior to hibernation, and
> a mismatch upon resume.
> 
> I can't find what APIC_LVT_MASKED does though, the best doc I found:

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

Local APIC (chapter 10.4).


Juergen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread Luis R. Rodriguez

On Mon, Nov 23, 2015 at 03:19:16PM +0100, Juergen Gross wrote:
> On 23/11/15 15:11, vas...@iit.demokritos.gr wrote:
> > Ok I will send the .config when I get back home. I have all kernels I
> > build in .deb archive. The problem is that the debian kernel build
> > procedure does not hold somewhere in the deb file the git commit hash.
> > 
> > Fow which kernel would you care to see the config? 4.3?
> 
> Doesn't really matter anymore. I've posted a patch already to fix it and
> got the reply, that the fix is okay, but no harm can come from the
> current implementation, as the two config options are always either both
> set or reset.

Hrm, Vassilis seems to be able to reproduce this more effectively by heating up
his CPU prior to hibernation though. I have no idea what adding APIC_LVT_MASKED
((1 << 16)) to the Local Vector Table (LVT) Thermal Monitor (APIC_LVTTHMR 
0x330) does but
clear_local_APIC() seems to be used to "cleanout any BIOS leftovers during
boot." If we're suspending but the fan is still on I wonder if this could cause
an issue with some settings the BIOS may have set prior to hibernation, and
a mismatch upon resume.

I can't find what APIC_LVT_MASKED does though, the best doc I found:

https://www-ssl.intel.com/content/dam/www/public/us/en/documents/white-papers/cpu-monitoring-dts-peci-paper.pdf

The inability to set the MTRR for the i915 card might be totally separate
issue at this point, not sure. One could test that I suppose by just
using vesa graphics card driver (disabling i915) to at least get
a basic screen to see things and compile/test things.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread Luis R. Rodriguez

On Tue, Nov 24, 2015 at 01:01:31AM +0200, Vassilis Virvilis wrote:
> On 11/23/2015 08:56 PM, Luis R. Rodriguez wrote:
> >Its not clear from the log who called this MTRR call for WC that failed, I
> >hope we didn't attempt a WC wright on a WB region. Who owns
> >e000-efff ?
> 
> How can I answer that? Is there any utility to run? peek inside /proc?
> 
> [0.221012] pci :00:02.0: [8086:0412] type 00 class 0x03
> [0.221021] pci :00:02.0: reg 0x10: [mem 0xf780-0xf7bf 64bit]
> [0.221025] pci :00:02.0: reg 0x18: [mem 0xe000-0xefff 64bit 
> pref]
> [0.221028] pci :00:02.0: reg 0x20: [io  0xf000-0xf03f]

...

> [0.453783] calling  sysfb_init+0x0/0x96 @ 1
> [0.453811] simple-framebuffer simple-framebuffer.0: framebuffer at 
> 0xe000, 0x6bb000 bytes, mapped to 0xc9000200
> [0.453814] simple-framebuffer simple-framebuffer.0: format=a8r8g8b8, 
> mode=1680x1050x32, linelength=6720
> [0.557233] Console: switching to colour frame buffer device 210x65
> [0.660632] simple-framebuffer simple-framebuffer.0: fb0: simplefb 
> registered!
> [0.661262] initcall sysfb_init+0x0/0x96 returned 0 after 202686 usecs

...

> [9.745108] calling  i915_init+0x0/0xa2 [i915] @ 403
> [9.745542] [drm] Memory usable by graphics device = 2048M
> [9.745544] checking generic (e000 6bb000) vs hw (e000 1000)
> [9.745544] fb: switching to inteldrmfb from simple

...

> [9.943166] Console: switching to colour dummy device 80x25
> [9.943240] [drm] Replacing VGA console driver
> [9.943520] mtrr: type mismatch for e000,1000 old: write-back new: 
> write-combining
> [9.943526] Failed to add WC MTRR for [e000-efff]; 
> performance may suffer.
> [9.949724] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> [9.949728] [drm] Driver supports precise vblank timestamp query.
> [9.949801] vgaarb: device changed decodes: 
> PCI::00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem

...

> $lspci | grep 00:02.0
> 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen 
> Core Processor Integrated Graphics Controller (rev 06)
>
> Looks like it is the graphics card or the graphics driver.

Good job yes.

> I don't know if this is relevant
> $ cat /proc/mtrr
> reg00: base=0x0 (0MB), size=16384MB, count=1: write-back
> reg01: base=0x4 (16384MB), size=  512MB, count=1: write-back
> reg02: base=0x0e000 ( 3584MB), size=  512MB, count=1: uncachable

Right so it tried to set this to WC but failed, and when using PAT
MTRR is not used instead PAT is used and your log showed no error.

> reg03: base=0x0d000 ( 3328MB), size=  256MB, count=1: uncachable
> reg04: base=0x0cf00 ( 3312MB), size=   16MB, count=1: uncachable
> reg05: base=0x41f00 (16880MB), size=   16MB, count=1: uncachable
> reg06: base=0x41ee0 (16878MB), size=2MB, count=1: uncachable
> 
> >
> >What does your log show right before and after this? To find out try:
> >
> >dmesg | grep -5 -i mtrr
> >
> 
> See full dmesg attached
> 
> $dmesg | grep -5 -i mtrr
> [0.189333] initcall arch_kdebugfs_init+0x0/0x1f returned 0 after 0 usecs
> [0.189336] calling  pt_init+0x0/0x2a4 @ 1
> [0.189349] initcall pt_init+0x0/0x2a4 returned -19 after 0 usecs
> [0.189352] calling  bts_init+0x0/0xa4 @ 1
> [0.189354] initcall bts_init+0x0/0xa4 returned 0 after 0 usecs
> [0.189357] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189360] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189362] calling  ffh_cstate_init+0x0/0x26 @ 1
> [0.189363] initcall ffh_cstate_init+0x0/0x26 returned 0 after 0 usecs
> [0.189366] calling  activate_jump_labels+0x0/0x2d @ 1
> [0.189367] initcall activate_jump_labels+0x0/0x2d returned 0 after 0 usecs
> [0.189370] calling  kcmp_cookies_init+0x0/0x31 @ 1
> --
> [0.189424] calling  dmi_id_init+0x0/0x300 @ 1
> [0.189448] initcall dmi_id_init+0x0/0x300 returned 0 after 0 usecs
> [0.189450] calling  pci_arch_init+0x0/0x63 @ 1
> [0.189458] PCI: MMCONFIG for domain  [bus 00-3f] at [mem 
> 0xf800-0xfbff] (base 0xf800)
> [0.189462] PCI: MMCONFIG at [mem 0xf800-0xfbff] reserved in E820
> [0.189467] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with 
> a huge-page mapping due to MTRR override.
> [0.189514] PCI: Using configuration type 1 for base access
> [0.189519] initcall pci_arch_init+0x0/0x63 returned 0 after 0 usecs
> [0.189528] calling  init_vdso+0x0/0x44 @ 1
> [0.189535] initcall init_vdso+0x0/0x44 returned 0 after 0 usecs
> [0.189538] calling  sysenter_setup+0x0/0x52 @ 1
> --
> [0.189542] calling  topology_init+0x0/0x83 @ 1
> [0.189795] initcall topology_init+0x0/0x83 returned 0 after 0 usecs
> [0.189798] calling  fixup_ht_bug+0x0/0xed @ 1
> [0.189799] perf_event_intel: PMU erratum BJ122, BV98,

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread Luis R. Rodriguez

On Tue, Nov 24, 2015 at 11:36:54AM +0200, vas...@iit.demokritos.gr wrote:
> > Let's try to speed up reproducing this.
> >
> > I have a hunch perhaps this might be related to some BIOS controlled
> > MTRRs and a mismatch which then enables the kernel to think that a type
> > of MTRR write might be OK, but in fact its not. Due to the work load
> > description of this perhaps this could be related to fan control and BIOS
> > control on them and against some other device MTRR. More on this suspicion
> > on another thread where you provide more logs.
> >
> > On a kernel that you know fails can you try replacing this work load by
> > making
> > you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building
> > for 2,
> > 4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
> > making the CPU fan trigger would accelerate the issue.  If 'make -j' is
> > too nuts
> > to the point you can't even CTRL C it, try 'make -j 16' . Note that if
> > this is
> > true then that means a hot CPU could still trigger CPU fan controls on on
> > a
> > fresh boot if the previous boot was CPU intensive.
> 
> OK that nailed it - with kernel 4.3 a known "bad" kernel I was able to
> reproduce it in the second hibernate/resume cycle.

Great, glad we could reduce the amount of time to reproduce to what seems
to be a few minutes now.

> Here is what I did in my own words so you can spot inconsistencies.
> 
> I started a kernel compile with make -j 32. My computer was very
> responsive which is an impressive feat by the way.
> In a second tab in my Konsole (I am running KDE) I run $watch sensors. I
> watched the temperature of the cores to go from 38 to ~70 and the cpu fan
> from ~1630 to ~1900. Then the first time I hit Ctrl+C - stopped the
> compilation and hibernated from the KDE. I always hibernate from the KDE
> start menu. Previously I had made some tests where I was hibernating from
> the VT console (although sddm may was running in VT7) and I have managed
> to reproduce it - so (in my mind) it was not graphics mode specific. From
> that point I am always hibernating from KDE.

Come to think of it, the mtrr_add() and/or ioremap_wc() calls would be
triggered on driver initialization, that is on probe / boot time, so if this
issue you are running into is a clash of the BIOS's own notion of what is
set for an MTRR type and later another driver's desired MTRR desired type
(or equivalent PAT type) then the issue could be triggered just with
boot time / hibernation / resume time without much interaction at least
on the graphics front.

> The first time it worked. For the second time I thought - why to hit
> Ctrl+C let's try to hibernate with the compilation running - and it
> failed.

OK. How long did you leave the machine on idle before resuming?

Can you try on a fresh boot to bring up temperature to ~70 and while its still
compiling hibernate and see if that triggers it ? If we can reduce it to only
one hibernate that should reduce time to troubleshoot, it is also just puzzling
you'd need to hibernate twice to reproduce this issue.

> Now I don't know if it failed because it was the second cycle or
> because the load of the compilation was there or because of the
> temperature controlled fan register you mentioned.

If its fan related one test could be to hibertane on a fresh boot once fan
control is one, let it sit to cool, and then resume. Vs just resuming right
away. Ie: determine if we need fan control to be idle upon resume or not,
also how many times does fan control have to go on / off before you can
reproduce.

> Then I repeated the test with a known good kernel 3.18 (which should be
> 773fed910d41e443e495a6bfa9ab1c2b7b13e012 according to my git bisect logs -
> I have a problem there - see below) and it survived the same test
> (hibernate two times with temperature being ~70).
> 
> 
> > If this doesn't do it lets try forcing an MTRR capable driver, say
> > graphics is
> > the obvious target, try perhaps some 3D stuff or a screen saver prior to
> > hibernation. Note that even if you boot nomtrr the BIOS may still use
> > MTRRs,
> > and PAT use on Linux could assume MTRR is not being used on drivers but
> > the
> > BIOS may still do something behind the scenes. This is actually one reason
> > why
> > we can't exactly remove MTRR support from Linux, since the BIOS may still
> > do
> > some wacky stuff with MTRRs, one example of such I was given was CPU can
> > control might use WC MTRRs, so the kernel must be aware of this, even if
> > no
> > MTRRs are ever used on the Linux kernel at all -- this is the case now as
> > of
> > v4.3 and onwards.
> >
> > If that doesn't help speed it up , maybe try both screen saver + some 3D
> > stuff + cpu instensive stuff.
> 
> I have 3D effects enabled in my KDE. Since your tip succeed to reproduce
> the problem early I didn't bother but If I should test 3D which program /
> benchmark should I run? glxgears?

As I mentioned above I can't think now of a reason why

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread vasvir

> Let's try to speed up reproducing this.
>
> I have a hunch perhaps this might be related to some BIOS controlled
> MTRRs and a mismatch which then enables the kernel to think that a type
> of MTRR write might be OK, but in fact its not. Due to the work load
> description of this perhaps this could be related to fan control and BIOS
> control on them and against some other device MTRR. More on this suspicion
> on another thread where you provide more logs.
>
> On a kernel that you know fails can you try replacing this work load by
> making
> you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building
> for 2,
> 4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
> making the CPU fan trigger would accelerate the issue.  If 'make -j' is
> too nuts
> to the point you can't even CTRL C it, try 'make -j 16' . Note that if
> this is
> true then that means a hot CPU could still trigger CPU fan controls on on
> a
> fresh boot if the previous boot was CPU intensive.

OK that nailed it - with kernel 4.3 a known "bad" kernel I was able to
reproduce it in the second hibernate/resume cycle. Here is what I did in
my own words so you can spot inconsistencies.

I started a kernel compile with make -j 32. My computer was very
responsive which is an impressive feat by the way.
In a second tab in my Konsole (I am running KDE) I run $watch sensors. I
watched the temperature of the cores to go from 38 to ~70 and the cpu fan
from ~1630 to ~1900. Then the first time I hit Ctrl+C - stopped the
compilation and hibernated from the KDE. I always hibernate from the KDE
start menu. Previously I had made some tests where I was hibernating from
the VT console (although sddm may was running in VT7) and I have managed
to reproduce it - so (in my mind) it was not graphics mode specific. From
that point I am always hibernating from KDE.

The first time it worked. For the second time I thought - why to hit
Ctrl+C let's try to hibernate with the compilation running - and it
failed. Now I don't know if it failed because it was the second cycle or
because the load of the compilation was there or because of the
temperature controlled fan register you mentioned.

Then I repeated the test with a known good kernel 3.18 (which should be
773fed910d41e443e495a6bfa9ab1c2b7b13e012 according to my git bisect logs -
I have a problem there - see below) and it survived the same test
(hibernate two times with temperature being ~70).


> If this doesn't do it lets try forcing an MTRR capable driver, say
> graphics is
> the obvious target, try perhaps some 3D stuff or a screen saver prior to
> hibernation. Note that even if you boot nomtrr the BIOS may still use
> MTRRs,
> and PAT use on Linux could assume MTRR is not being used on drivers but
> the
> BIOS may still do something behind the scenes. This is actually one reason
> why
> we can't exactly remove MTRR support from Linux, since the BIOS may still
> do
> some wacky stuff with MTRRs, one example of such I was given was CPU can
> control might use WC MTRRs, so the kernel must be aware of this, even if
> no
> MTRRs are ever used on the Linux kernel at all -- this is the case now as
> of
> v4.3 and onwards.
>
> If that doesn't help speed it up , maybe try both screen saver + some 3D
> stuff + cpu instensive stuff.

I have 3D effects enabled in my KDE. Since your tip succeed to reproduce
the problem early I didn't bother but If I should test 3D which program /
benchmark should I run? glxgears?

>
> To help you speed up testing you can try reducing your build time by
> reducing
> the amount of crap you have to build:
>
> make localmodconfig
>
> That should only build things your kernel has loaded as modules or is
> already
> enabled (=y).
>

Thanks for the tip. I don't want to change that right now. I don't mind
waiting a little bit because I a get a deb with the kernel and can retest
a known configuration. The other tip you gave if it actually works as it
looks like working would give a great boost to the debugging cycle to
actually make me the bottleneck.

>
> That is commit a023748d53c10850650fe86b1c4a7d421d576451
> ("Merge branch 'x86-mm-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>
> Git is smart enough to tell you you've hit a merge commit and that all the
> possible commits on that merge could be the issue. This is why you bisect
> log shows a slew of commits. The next step is to bisect through the merge
> and then bisect through that, this will then let us identify the exact
> commit
> that may have caused the issue.
>
> There are a few ways to do this, my preferred way is to "unfold" a merge
> commit manually.
>
> To help keep thing separately (without affecting other tests you might
> have on your other git tree and to avoid having to force you to loose
> fresh object as you continue to build test on the other tree), I'd do
> something like this:

we will go with your preferred way - no question about that.

>
> mkdir ~/tmp
> git

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread Luis R. Rodriguez

On Tue, Nov 24, 2015 at 01:01:31AM +0200, Vassilis Virvilis wrote:
> On 11/23/2015 08:56 PM, Luis R. Rodriguez wrote:
> >Its not clear from the log who called this MTRR call for WC that failed, I
> >hope we didn't attempt a WC wright on a WB region. Who owns
> >e000-efff ?
> 
> How can I answer that? Is there any utility to run? peek inside /proc?
> 
> [0.221012] pci :00:02.0: [8086:0412] type 00 class 0x03
> [0.221021] pci :00:02.0: reg 0x10: [mem 0xf780-0xf7bf 64bit]
> [0.221025] pci :00:02.0: reg 0x18: [mem 0xe000-0xefff 64bit 
> pref]
> [0.221028] pci :00:02.0: reg 0x20: [io  0xf000-0xf03f]

...

> [0.453783] calling  sysfb_init+0x0/0x96 @ 1
> [0.453811] simple-framebuffer simple-framebuffer.0: framebuffer at 
> 0xe000, 0x6bb000 bytes, mapped to 0xc9000200
> [0.453814] simple-framebuffer simple-framebuffer.0: format=a8r8g8b8, 
> mode=1680x1050x32, linelength=6720
> [0.557233] Console: switching to colour frame buffer device 210x65
> [0.660632] simple-framebuffer simple-framebuffer.0: fb0: simplefb 
> registered!
> [0.661262] initcall sysfb_init+0x0/0x96 returned 0 after 202686 usecs

...

> [9.745108] calling  i915_init+0x0/0xa2 [i915] @ 403
> [9.745542] [drm] Memory usable by graphics device = 2048M
> [9.745544] checking generic (e000 6bb000) vs hw (e000 1000)
> [9.745544] fb: switching to inteldrmfb from simple

...

> [9.943166] Console: switching to colour dummy device 80x25
> [9.943240] [drm] Replacing VGA console driver
> [9.943520] mtrr: type mismatch for e000,1000 old: write-back new: 
> write-combining
> [9.943526] Failed to add WC MTRR for [e000-efff]; 
> performance may suffer.
> [9.949724] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> [9.949728] [drm] Driver supports precise vblank timestamp query.
> [9.949801] vgaarb: device changed decodes: 
> PCI::00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem

...

> $lspci | grep 00:02.0
> 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen 
> Core Processor Integrated Graphics Controller (rev 06)
>
> Looks like it is the graphics card or the graphics driver.

Good job yes.

> I don't know if this is relevant
> $ cat /proc/mtrr
> reg00: base=0x0 (0MB), size=16384MB, count=1: write-back
> reg01: base=0x4 (16384MB), size=  512MB, count=1: write-back
> reg02: base=0x0e000 ( 3584MB), size=  512MB, count=1: uncachable

Right so it tried to set this to WC but failed, and when using PAT
MTRR is not used instead PAT is used and your log showed no error.

> reg03: base=0x0d000 ( 3328MB), size=  256MB, count=1: uncachable
> reg04: base=0x0cf00 ( 3312MB), size=   16MB, count=1: uncachable
> reg05: base=0x41f00 (16880MB), size=   16MB, count=1: uncachable
> reg06: base=0x41ee0 (16878MB), size=2MB, count=1: uncachable
> 
> >
> >What does your log show right before and after this? To find out try:
> >
> >dmesg | grep -5 -i mtrr
> >
> 
> See full dmesg attached
> 
> $dmesg | grep -5 -i mtrr
> [0.189333] initcall arch_kdebugfs_init+0x0/0x1f returned 0 after 0 usecs
> [0.189336] calling  pt_init+0x0/0x2a4 @ 1
> [0.189349] initcall pt_init+0x0/0x2a4 returned -19 after 0 usecs
> [0.189352] calling  bts_init+0x0/0xa4 @ 1
> [0.189354] initcall bts_init+0x0/0xa4 returned 0 after 0 usecs
> [0.189357] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189360] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189362] calling  ffh_cstate_init+0x0/0x26 @ 1
> [0.189363] initcall ffh_cstate_init+0x0/0x26 returned 0 after 0 usecs
> [0.189366] calling  activate_jump_labels+0x0/0x2d @ 1
> [0.189367] initcall activate_jump_labels+0x0/0x2d returned 0 after 0 usecs
> [0.189370] calling  kcmp_cookies_init+0x0/0x31 @ 1
> --
> [0.189424] calling  dmi_id_init+0x0/0x300 @ 1
> [0.189448] initcall dmi_id_init+0x0/0x300 returned 0 after 0 usecs
> [0.189450] calling  pci_arch_init+0x0/0x63 @ 1
> [0.189458] PCI: MMCONFIG for domain  [bus 00-3f] at [mem 
> 0xf800-0xfbff] (base 0xf800)
> [0.189462] PCI: MMCONFIG at [mem 0xf800-0xfbff] reserved in E820
> [0.189467] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with 
> a huge-page mapping due to MTRR override.
> [0.189514] PCI: Using configuration type 1 for base access
> [0.189519] initcall pci_arch_init+0x0/0x63 returned 0 after 0 usecs
> [0.189528] calling  init_vdso+0x0/0x44 @ 1
> [0.189535] initcall init_vdso+0x0/0x44 returned 0 after 0 usecs
> [0.189538] calling  sysenter_setup+0x0/0x52 @ 1
> --
> [0.189542] calling  topology_init+0x0/0x83 @ 1
> [0.189795] initcall topology_init+0x0/0x83 returned 0 after 0 usecs
> [0.189798] calling  fixup_ht_bug+0x0/0xed @ 1
> [0.189799] perf_event_intel: PMU erratum BJ122, BV98,

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread Luis R. Rodriguez

On Mon, Nov 23, 2015 at 03:19:16PM +0100, Juergen Gross wrote:
> On 23/11/15 15:11, vas...@iit.demokritos.gr wrote:
> > Ok I will send the .config when I get back home. I have all kernels I
> > build in .deb archive. The problem is that the debian kernel build
> > procedure does not hold somewhere in the deb file the git commit hash.
> > 
> > Fow which kernel would you care to see the config? 4.3?
> 
> Doesn't really matter anymore. I've posted a patch already to fix it and
> got the reply, that the fix is okay, but no harm can come from the
> current implementation, as the two config options are always either both
> set or reset.

Hrm, Vassilis seems to be able to reproduce this more effectively by heating up
his CPU prior to hibernation though. I have no idea what adding APIC_LVT_MASKED
((1 << 16)) to the Local Vector Table (LVT) Thermal Monitor (APIC_LVTTHMR 
0x330) does but
clear_local_APIC() seems to be used to "cleanout any BIOS leftovers during
boot." If we're suspending but the fan is still on I wonder if this could cause
an issue with some settings the BIOS may have set prior to hibernation, and
a mismatch upon resume.

I can't find what APIC_LVT_MASKED does though, the best doc I found:

https://www-ssl.intel.com/content/dam/www/public/us/en/documents/white-papers/cpu-monitoring-dts-peci-paper.pdf

The inability to set the MTRR for the i915 card might be totally separate
issue at this point, not sure. One could test that I suppose by just
using vesa graphics card driver (disabling i915) to at least get
a basic screen to see things and compile/test things.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread Luis R. Rodriguez

On Tue, Nov 24, 2015 at 11:36:54AM +0200, vas...@iit.demokritos.gr wrote:
> > Let's try to speed up reproducing this.
> >
> > I have a hunch perhaps this might be related to some BIOS controlled
> > MTRRs and a mismatch which then enables the kernel to think that a type
> > of MTRR write might be OK, but in fact its not. Due to the work load
> > description of this perhaps this could be related to fan control and BIOS
> > control on them and against some other device MTRR. More on this suspicion
> > on another thread where you provide more logs.
> >
> > On a kernel that you know fails can you try replacing this work load by
> > making
> > you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building
> > for 2,
> > 4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
> > making the CPU fan trigger would accelerate the issue.  If 'make -j' is
> > too nuts
> > to the point you can't even CTRL C it, try 'make -j 16' . Note that if
> > this is
> > true then that means a hot CPU could still trigger CPU fan controls on on
> > a
> > fresh boot if the previous boot was CPU intensive.
> 
> OK that nailed it - with kernel 4.3 a known "bad" kernel I was able to
> reproduce it in the second hibernate/resume cycle.

Great, glad we could reduce the amount of time to reproduce to what seems
to be a few minutes now.

> Here is what I did in my own words so you can spot inconsistencies.
> 
> I started a kernel compile with make -j 32. My computer was very
> responsive which is an impressive feat by the way.
> In a second tab in my Konsole (I am running KDE) I run $watch sensors. I
> watched the temperature of the cores to go from 38 to ~70 and the cpu fan
> from ~1630 to ~1900. Then the first time I hit Ctrl+C - stopped the
> compilation and hibernated from the KDE. I always hibernate from the KDE
> start menu. Previously I had made some tests where I was hibernating from
> the VT console (although sddm may was running in VT7) and I have managed
> to reproduce it - so (in my mind) it was not graphics mode specific. From
> that point I am always hibernating from KDE.

Come to think of it, the mtrr_add() and/or ioremap_wc() calls would be
triggered on driver initialization, that is on probe / boot time, so if this
issue you are running into is a clash of the BIOS's own notion of what is
set for an MTRR type and later another driver's desired MTRR desired type
(or equivalent PAT type) then the issue could be triggered just with
boot time / hibernation / resume time without much interaction at least
on the graphics front.

> The first time it worked. For the second time I thought - why to hit
> Ctrl+C let's try to hibernate with the compilation running - and it
> failed.

OK. How long did you leave the machine on idle before resuming?

Can you try on a fresh boot to bring up temperature to ~70 and while its still
compiling hibernate and see if that triggers it ? If we can reduce it to only
one hibernate that should reduce time to troubleshoot, it is also just puzzling
you'd need to hibernate twice to reproduce this issue.

> Now I don't know if it failed because it was the second cycle or
> because the load of the compilation was there or because of the
> temperature controlled fan register you mentioned.

If its fan related one test could be to hibertane on a fresh boot once fan
control is one, let it sit to cool, and then resume. Vs just resuming right
away. Ie: determine if we need fan control to be idle upon resume or not,
also how many times does fan control have to go on / off before you can
reproduce.

> Then I repeated the test with a known good kernel 3.18 (which should be
> 773fed910d41e443e495a6bfa9ab1c2b7b13e012 according to my git bisect logs -
> I have a problem there - see below) and it survived the same test
> (hibernate two times with temperature being ~70).
> 
> 
> > If this doesn't do it lets try forcing an MTRR capable driver, say
> > graphics is
> > the obvious target, try perhaps some 3D stuff or a screen saver prior to
> > hibernation. Note that even if you boot nomtrr the BIOS may still use
> > MTRRs,
> > and PAT use on Linux could assume MTRR is not being used on drivers but
> > the
> > BIOS may still do something behind the scenes. This is actually one reason
> > why
> > we can't exactly remove MTRR support from Linux, since the BIOS may still
> > do
> > some wacky stuff with MTRRs, one example of such I was given was CPU can
> > control might use WC MTRRs, so the kernel must be aware of this, even if
> > no
> > MTRRs are ever used on the Linux kernel at all -- this is the case now as
> > of
> > v4.3 and onwards.
> >
> > If that doesn't help speed it up , maybe try both screen saver + some 3D
> > stuff + cpu instensive stuff.
> 
> I have 3D effects enabled in my KDE. Since your tip succeed to reproduce
> the problem early I didn't bother but If I should test 3D which program /
> benchmark should I run? glxgears?

As I mentioned above I can't think now of a reason why

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread Juergen Gross

On 24/11/15 23:46, Luis R. Rodriguez wrote:
> On Mon, Nov 23, 2015 at 03:19:16PM +0100, Juergen Gross wrote:
>> On 23/11/15 15:11, vas...@iit.demokritos.gr wrote:
>>> Ok I will send the .config when I get back home. I have all kernels I
>>> build in .deb archive. The problem is that the debian kernel build
>>> procedure does not hold somewhere in the deb file the git commit hash.
>>>
>>> Fow which kernel would you care to see the config? 4.3?
>>
>> Doesn't really matter anymore. I've posted a patch already to fix it and
>> got the reply, that the fix is okay, but no harm can come from the
>> current implementation, as the two config options are always either both
>> set or reset.
> 
> Hrm, Vassilis seems to be able to reproduce this more effectively by heating 
> up
> his CPU prior to hibernation though. I have no idea what adding 
> APIC_LVT_MASKED
> ((1 << 16)) to the Local Vector Table (LVT) Thermal Monitor (APIC_LVTTHMR 
> 0x330) does but
> clear_local_APIC() seems to be used to "cleanout any BIOS leftovers during
> boot." If we're suspending but the fan is still on I wonder if this could 
> cause
> an issue with some settings the BIOS may have set prior to hibernation, and
> a mismatch upon resume.
> 
> I can't find what APIC_LVT_MASKED does though, the best doc I found:

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

Local APIC (chapter 10.4).


Juergen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-24 Thread vasvir

> Let's try to speed up reproducing this.
>
> I have a hunch perhaps this might be related to some BIOS controlled
> MTRRs and a mismatch which then enables the kernel to think that a type
> of MTRR write might be OK, but in fact its not. Due to the work load
> description of this perhaps this could be related to fan control and BIOS
> control on them and against some other device MTRR. More on this suspicion
> on another thread where you provide more logs.
>
> On a kernel that you know fails can you try replacing this work load by
> making
> you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building
> for 2,
> 4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
> making the CPU fan trigger would accelerate the issue.  If 'make -j' is
> too nuts
> to the point you can't even CTRL C it, try 'make -j 16' . Note that if
> this is
> true then that means a hot CPU could still trigger CPU fan controls on on
> a
> fresh boot if the previous boot was CPU intensive.

OK that nailed it - with kernel 4.3 a known "bad" kernel I was able to
reproduce it in the second hibernate/resume cycle. Here is what I did in
my own words so you can spot inconsistencies.

I started a kernel compile with make -j 32. My computer was very
responsive which is an impressive feat by the way.
In a second tab in my Konsole (I am running KDE) I run $watch sensors. I
watched the temperature of the cores to go from 38 to ~70 and the cpu fan
from ~1630 to ~1900. Then the first time I hit Ctrl+C - stopped the
compilation and hibernated from the KDE. I always hibernate from the KDE
start menu. Previously I had made some tests where I was hibernating from
the VT console (although sddm may was running in VT7) and I have managed
to reproduce it - so (in my mind) it was not graphics mode specific. From
that point I am always hibernating from KDE.

The first time it worked. For the second time I thought - why to hit
Ctrl+C let's try to hibernate with the compilation running - and it
failed. Now I don't know if it failed because it was the second cycle or
because the load of the compilation was there or because of the
temperature controlled fan register you mentioned.

Then I repeated the test with a known good kernel 3.18 (which should be
773fed910d41e443e495a6bfa9ab1c2b7b13e012 according to my git bisect logs -
I have a problem there - see below) and it survived the same test
(hibernate two times with temperature being ~70).


> If this doesn't do it lets try forcing an MTRR capable driver, say
> graphics is
> the obvious target, try perhaps some 3D stuff or a screen saver prior to
> hibernation. Note that even if you boot nomtrr the BIOS may still use
> MTRRs,
> and PAT use on Linux could assume MTRR is not being used on drivers but
> the
> BIOS may still do something behind the scenes. This is actually one reason
> why
> we can't exactly remove MTRR support from Linux, since the BIOS may still
> do
> some wacky stuff with MTRRs, one example of such I was given was CPU can
> control might use WC MTRRs, so the kernel must be aware of this, even if
> no
> MTRRs are ever used on the Linux kernel at all -- this is the case now as
> of
> v4.3 and onwards.
>
> If that doesn't help speed it up , maybe try both screen saver + some 3D
> stuff + cpu instensive stuff.

I have 3D effects enabled in my KDE. Since your tip succeed to reproduce
the problem early I didn't bother but If I should test 3D which program /
benchmark should I run? glxgears?

>
> To help you speed up testing you can try reducing your build time by
> reducing
> the amount of crap you have to build:
>
> make localmodconfig
>
> That should only build things your kernel has loaded as modules or is
> already
> enabled (=y).
>

Thanks for the tip. I don't want to change that right now. I don't mind
waiting a little bit because I a get a deb with the kernel and can retest
a known configuration. The other tip you gave if it actually works as it
looks like working would give a great boost to the debugging cycle to
actually make me the bottleneck.

>
> That is commit a023748d53c10850650fe86b1c4a7d421d576451
> ("Merge branch 'x86-mm-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>
> Git is smart enough to tell you you've hit a merge commit and that all the
> possible commits on that merge could be the issue. This is why you bisect
> log shows a slew of commits. The next step is to bisect through the merge
> and then bisect through that, this will then let us identify the exact
> commit
> that may have caused the issue.
>
> There are a few ways to do this, my preferred way is to "unfold" a merge
> commit manually.
>
> To help keep thing separately (without affecting other tests you might
> have on your other git tree and to avoid having to force you to loose
> fresh object as you continue to build test on the other tree), I'd do
> something like this:

we will go with your preferred way - no question about that.

>
> mkdir ~/tmp
> git

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Vassilis Virvilis


On 11/23/2015 08:56 PM, Luis R. Rodriguez wrote:

Its not clear from the log who called this MTRR call for WC that failed, I
hope we didn't attempt a WC wright on a WB region. Who owns
e000-efff ?


How can I answer that? Is there any utility to run? peek inside /proc?

Here is an idea:
$dmesg | grep -i -5 e000
[0.220941] pci_bus :00: root bus resource [mem 0x000e4000-0x000e7fff 
window]
[0.220944] pci_bus :00: root bus resource [mem 0xdf20-0xfeaf 
window]
[0.220950] pci :00:00.0: [8086:0c00] type 00 class 0x06
[0.221012] pci :00:02.0: [8086:0412] type 00 class 0x03
[0.221021] pci :00:02.0: reg 0x10: [mem 0xf780-0xf7bf 64bit]
[0.221025] pci :00:02.0: reg 0x18: [mem 0xe000-0xefff 64bit 
pref]
[0.221028] pci :00:02.0: reg 0x20: [io  0xf000-0xf03f]
[0.221081] pci :00:03.0: [8086:0c0c] type 00 class 0x040300
[0.221089] pci :00:03.0: reg 0x10: [mem 0xf7c34000-0xf7c37fff 64bit]
[0.221163] pci :00:14.0: [8086:8cb1] type 00 class 0x0c0330
[0.221184] pci :00:14.0: reg 0x10: [mem 0xf7c2-0xf7c2 64bit]
--
[0.453765] calling  ioapic_init_ops+0x0/0xf @ 1
[0.453767] initcall ioapic_init_ops+0x0/0xf returned 0 after 0 usecs
[0.453770] calling  add_pcspkr+0x0/0x3b @ 1
[0.453781] initcall add_pcspkr+0x0/0x3b returned 0 after 8 usecs
[0.453783] calling  sysfb_init+0x0/0x96 @ 1
[0.453811] simple-framebuffer simple-framebuffer.0: framebuffer at 
0xe000, 0x6bb000 bytes, mapped to 0xc9000200
[0.453814] simple-framebuffer simple-framebuffer.0: format=a8r8g8b8, 
mode=1680x1050x32, linelength=6720
[0.557233] Console: switching to colour frame buffer device 210x65
[0.660632] simple-framebuffer simple-framebuffer.0: fb0: simplefb 
registered!
[0.661262] initcall sysfb_init+0x0/0x96 returned 0 after 202686 usecs
[0.661266] calling  audit_classes_init+0x0/0xaa @ 1
--
[9.744397] input: gspca_zc3xx as 
/devices/pci:00/:00:14.0/usb3/3-3/input/input18
[9.744481] usbcore: registered new interface driver gspca_zc3xx
[9.744484] initcall sd_driver_init+0x0/0x1000 [gspca_zc3xx] returned 0 
after 319 usecs
[9.745108] calling  i915_init+0x0/0xa2 [i915] @ 403
[9.745542] [drm] Memory usable by graphics device = 2048M
[9.745544] checking generic (e000 6bb000) vs hw (e000 1000)
[9.745544] fb: switching to inteldrmfb from simple
[9.745831] calling  alsa_seq_device_init+0x0/0x1000 [snd_seq_device] @ 384
[9.745842] initcall alsa_seq_device_init+0x0/0x1000 [snd_seq_device] 
returned 0 after 9 usecs
[9.746179] calling  hmac_module_init+0x0/0x1000 [hmac] @ 471
[9.746180] initcall hmac_module_init+0x0/0x1000 [hmac] returned 0 after 0 
usecs
--
[9.749840] calling  usb_audio_driver_init+0x0/0x1000 [snd_usb_audio] @ 384
[9.751163] usbcore: registered new interface driver snd-usb-audio
[9.751166] initcall usb_audio_driver_init+0x0/0x1000 [snd_usb_audio] 
returned 0 after 1292 usecs
[9.943166] Console: switching to colour dummy device 80x25
[9.943240] [drm] Replacing VGA console driver
[9.943520] mtrr: type mismatch for e000,1000 old: write-back new: 
write-combining
[9.943526] Failed to add WC MTRR for [e000-efff]; 
performance may suffer.
[9.947147] Adding 31249404k swap on /dev/sdb1.  Priority:-1 extents:1 
across:31249404k FS
[9.949724] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[9.949728] [drm] Driver supports precise vblank timestamp query.
[9.949801] vgaarb: device changed decodes: 
PCI::00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[9.965787] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: 
(null)

$lspci | grep 00:02.0
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen 
Core Processor Integrated Graphics Controller (rev 06)

Looks like it is the graphics card or the graphics driver.

I don't know if this is relevant
$ cat /proc/mtrr
reg00: base=0x0 (0MB), size=16384MB, count=1: write-back
reg01: base=0x4 (16384MB), size=  512MB, count=1: write-back
reg02: base=0x0e000 ( 3584MB), size=  512MB, count=1: uncachable
reg03: base=0x0d000 ( 3328MB), size=  256MB, count=1: uncachable
reg04: base=0x0cf00 ( 3312MB), size=   16MB, count=1: uncachable
reg05: base=0x41f00 (16880MB), size=   16MB, count=1: uncachable
reg06: base=0x41ee0 (16878MB), size=2MB, count=1: uncachable



What does your log show right before and after this? To find out try:

dmesg | grep -5 -i mtrr



See full dmesg attached

$dmesg | grep -5 -i mtrr
[0.189333] initcall arch_kdebugfs_init+0x0/0x1f returned 0 after 0 usecs
[0.189336] calling  pt_init+0x0/0x2a4 @ 1
[0.189349] initcall pt_init+0x0/0x2a4 returned -19 after 0 usecs
[0.189352] calling  bts_init+0x0/0xa4 @ 1
[0.189354] initcall bts_init+0x0/0xa4

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Luis R. Rodriguez

On Sat, Nov 21, 2015 at 01:49:06PM +0200, Vassilis Virvilis wrote:
> On 11/20/2015 02:23 PM, Juergen Gross wrote:
> >On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:
> >>>I've just found a potential issue: In case MTRR is disabled by the BIOS
> >>>the PAT register of the boot processor won't be restored after resume.
> >>>
> >>>Can you check whether pr_info("MTRR: Disabled\n") has been executed in
> >>>early boot? If yes, this might be a BIOS option.
> >>>
> >>
> >>I don't have access right now. I will test it later tonight (This is my
> >>home machine).
> >>
> >>Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
> >>somewere else e.g. /proc /sys etc?
> >
> >I think grepping for MTRR in dmesg should be enough.
> 
> kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the familiar 
> (see previously attached image) "Calling lapic..." place.
> 
> $dmesg | grep -i mtr for 4.3 kernel with notpat
> [0.189113] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189222] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with 
> a huge-page mapping due to MTRR override.
> [0.189559] calling  mtrr_init_finialize+0x0/0x3a @ 1
> [0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs
> [8.994140] mtrr: type mismatch for e000,1000 old: write-back new: 
> write-combining
> [8.994154] Failed to add WC MTRR for [e000-efff]; 
> performance may suffer.

Its not clear from the log who called this MTRR call for WC that failed, I 
hope we didn't attempt a WC wright on a WB region. Who owns
e000-efff ?

What does your log show right before and after this? To find out try:

dmesg | grep -5 -i mtrr  

Not being able to use WC is not fatal, its just a performance issue, but if we 
tried
to override a region which we should not have to WC for which another area the 
BIOS
might rely on to not be WC, that could be a big issue.

> $dmesg | grep -i mtr for 4.3 kernel with default pat enabled
> [0.189368] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189478] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with 
> a huge-page mapping due to MTRR override.
> [0.189814] calling  mtrr_init_finialize+0x0/0x3a @ 1
> [0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs

The fact we don't see a conflict doesn't mean an issue or conflict didn't
trigger. If PAT didn't see something the BIOS did that make the kernel assume
it could do something that it was not able to. The MTRR init code should pick
up on this stuff and let the kernel PAT code know if there could be a conflict,
but if for some reason that was missed, that could be an issue.

> I also checked my BIOS. I found nothing about mtrr. My BIOS manual is 
> ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option about 
> MTRR?
> 
> Question: If we assume your theory is correct about mtrr/pat, wouldn't 
> lockup/hang reboot every time the system goes to hibernate/resume? Can this 
> assumption explain why the first hibernation/resume cycles in rapid 
> succession after system boot are working and the long ones fail somewhat more 
> consistently?
> 
> Note: With PAT enabled the system boots up significantly faster.
> 
> In the weekend I will return to 3.18-rc2 and I will try to verify my 
> bisection is correct. Double guessing your self is a terrible thing...
> 
> I will also try with nopat and I will run dmesg | grep -i mtr and post results
> 
> Unless you have any other suggestions...

Bisection on the merge commit would help.

 Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Luis R. Rodriguez

On Thu, Nov 19, 2015 at 06:39:28AM +0100, Juergen Gross wrote:
> On 18/11/15 22:43, Vassilis Virvilis wrote:
> > Hi,
> > 
> > I have been hit by a hibernate/resume bug. Other people may have too:
> > The following links are consistent with my observations
> > 
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1490494
> > https://bugs.archlinux.org/task/44807
> > 
> > Some observations:
> > 1) The first few rapid hibernation / resume cycles do not fail.
> > 
> > 2) If the computer is loaded (eclipse + chromium + firefox/iceweasel +
> > thunderbird/icedove + Konsole) helps to reproduce and lock up during resume

Let's try to speed up reproducing this.

I have a hunch perhaps this might be related to some BIOS controlled
MTRRs and a mismatch which then enables the kernel to think that a type 
of MTRR write might be OK, but in fact its not. Due to the work load
description of this perhaps this could be related to fan control and BIOS
control on them and against some other device MTRR. More on this suspicion
on another thread where you provide more logs.

On a kernel that you know fails can you try replacing this work load by making
you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building for 2,
4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
making the CPU fan trigger would accelerate the issue.  If 'make -j' is too nuts
to the point you can't even CTRL C it, try 'make -j 16' . Note that if this is
true then that means a hot CPU could still trigger CPU fan controls on on a
fresh boot if the previous boot was CPU intensive.

If this doesn't do it lets try forcing an MTRR capable driver, say graphics is
the obvious target, try perhaps some 3D stuff or a screen saver prior to
hibernation. Note that even if you boot nomtrr the BIOS may still use MTRRs,
and PAT use on Linux could assume MTRR is not being used on drivers but the
BIOS may still do something behind the scenes. This is actually one reason why
we can't exactly remove MTRR support from Linux, since the BIOS may still do
some wacky stuff with MTRRs, one example of such I was given was CPU can
control might use WC MTRRs, so the kernel must be aware of this, even if no
MTRRs are ever used on the Linux kernel at all -- this is the case now as of
v4.3 and onwards.

If that doesn't help speed it up , maybe try both screen saver + some 3D
stuff + cpu instensive stuff.

To help you speed up testing you can try reducing your build time by reducing
the amount of crap you have to build:

make localmodconfig

That should only build things your kernel has loaded as modules or is already
enabled (=y).

> > 3) Long hibernation times (overnight) helps to reproduce and lock up
> > during resume
> > 
> > 4) For the bad commits (where the lockup during resume takes place) -
> > the image loading during resume is significantly faster. It is fast and
> > then it locks.
> > 
> > How I hit the problem and what I have done:
> > 
> > I am running debian unstable
> > 
> > Debian went from 3.16 to 3.19 - hence the problem raised its ugly head.
> > I upgraded diligently up to 4.2.6 - The problem persists
> > 
> > I started kernel bisection from 3.16 to 3.19 following
> > https://wiki.debian.org/DebianKernel/GitBisect
> > 
> > One month and 25 kernels later see below for the bisect log
> 
> Wow! Thanks for doing this work!
> 

Vassilis, indeed, the amount of work you have put into this is extremely
appreciated!

> Juergen
> 
> > 
> > I hit some untestable kernel that weren't booting. They were hanging at
> > "Loading ramdisk..." before any actual kernel message.
> > 
> > Looks like the first bad / untestable commit is from  Juergen Gross /
> > Thomas Gleixner Merge branch 'x86-mm-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip [full PAT support]
> > 

That is commit a023748d53c10850650fe86b1c4a7d421d576451
("Merge branch 'x86-mm-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

Git is smart enough to tell you you've hit a merge commit and that all the
possible commits on that merge could be the issue. This is why you bisect
log shows a slew of commits. The next step is to bisect through the merge
and then bisect through that, this will then let us identify the exact commit
that may have caused the issue.

There are a few ways to do this, my preferred way is to "unfold" a merge
commit manually.

To help keep thing separately (without affecting other tests you might
have on your other git tree and to avoid having to force you to loose
fresh object as you continue to build test on the other tree), I'd do
something like this:

mkdir ~/tmp
git clone ~/linux/.git linux-dev-test

cd linux-dev-test

Notice how if you do git log and search for 
a023748d53c10850650fe86b1c4a7d421d576451
you'll see that the commit listed before this is 
773fed910d41e443e495a6bfa9ab1c2b7b13e012
("Merge branches 'x86-platform-for-linus' and 'x86-uv-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Juergen Gross

On 23/11/15 15:11, vas...@iit.demokritos.gr wrote:
> Ok I will send the .config when I get back home. I have all kernels I
> build in .deb archive. The problem is that the debian kernel build
> procedure does not hold somewhere in the deb file the git commit hash.
> 
> Fow which kernel would you care to see the config? 4.3?

Doesn't really matter anymore. I've posted a patch already to fix it and
got the reply, that the fix is okay, but no harm can come from the
current implementation, as the two config options are always either both
set or reset.

Juergen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread vasvir

On 11/20/2015 02:23 PM, Juergen Gross wrote:

>
> As the BIOS obviously isn't disabling MTRR I don't think we have
> to go that route any longer.

ok.

>>
>> In the weekend I will return to 3.18-rc2 and I will try to verify my
>> bisection is correct. Double guessing your self is a terrible thing...
>
> Thanks.
>
>> I will also try with nopat and I will run dmesg | grep -i mtr and post
>> results
>>
>> Unless you have any other suggestions...
>

I hit a very big problem here. I did
$git checkout 773fed910d41e443e495a6bfa9ab1c2b7b13e012
$make (with gcc 4.8 - as all my tests)

and the resulting kernel in unbootable hunging in "Loading initial
ramdisk..." second line of the kernel boot

That means my bisection is not good because this release is marked as good.

So now I am at loss.

As I said I followed https://wiki.debian.org/DebianKernel/GitBisect

I notice now that the article suggest a step
  $make oldconfig

I did it once at the start of the bisection and then answering the default
(Enter) in all config questions.

> I think we have to find out where the kernel is really hanging. Do you
> have any chance to trigger a NMI?

I am googling about it.

>
> Looking into suspend/resume code I found a strange inconsistency for
> the lapic handling:
>
> lapic_suspend()
> {
> ...
> #ifdef CONFIG_X86_THERMAL_VECTOR
> if (maxlvt >= 5)
> apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
> #endif
> ...
> }
>
> lapic_resume()
> {
> ...
> #if defined(CONFIG_X86_MCE_INTEL)
> if (maxlvt >= 5)
> apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
> #endif
> ...
> }
>
> and comparing that to:
>
> clear_local_APIC()
> {
> ...
> #ifdef CONFIG_X86_THERMAL_VECTOR
> if (maxlvt >= 5) {
> v = apic_read(APIC_LVTTHMR);
> apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
> }
> #endif
> #ifdef CONFIG_X86_MCE_INTEL
> if (maxlvt >= 6) {
> v = apic_read(APIC_LVTCMCI);
> if (!(v & APIC_LVT_MASKED))
> apic_write(APIC_LVTCMCI, v | APIC_LVT_MASKED);
> }
> #endif
> ...
> }
>

Ok I will send the .config when I get back home. I have all kernels I
build in .deb archive. The problem is that the debian kernel build
procedure does not hold somewhere in the deb file the git commit hash.

Fow which kernel would you care to see the config? 4.3?

 Vassilis



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Luis R. Rodriguez

On Sat, Nov 21, 2015 at 01:49:06PM +0200, Vassilis Virvilis wrote:
> On 11/20/2015 02:23 PM, Juergen Gross wrote:
> >On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:
> >>>I've just found a potential issue: In case MTRR is disabled by the BIOS
> >>>the PAT register of the boot processor won't be restored after resume.
> >>>
> >>>Can you check whether pr_info("MTRR: Disabled\n") has been executed in
> >>>early boot? If yes, this might be a BIOS option.
> >>>
> >>
> >>I don't have access right now. I will test it later tonight (This is my
> >>home machine).
> >>
> >>Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
> >>somewere else e.g. /proc /sys etc?
> >
> >I think grepping for MTRR in dmesg should be enough.
> 
> kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the familiar 
> (see previously attached image) "Calling lapic..." place.
> 
> $dmesg | grep -i mtr for 4.3 kernel with notpat
> [0.189113] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189222] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with 
> a huge-page mapping due to MTRR override.
> [0.189559] calling  mtrr_init_finialize+0x0/0x3a @ 1
> [0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs
> [8.994140] mtrr: type mismatch for e000,1000 old: write-back new: 
> write-combining
> [8.994154] Failed to add WC MTRR for [e000-efff]; 
> performance may suffer.

Its not clear from the log who called this MTRR call for WC that failed, I 
hope we didn't attempt a WC wright on a WB region. Who owns
e000-efff ?

What does your log show right before and after this? To find out try:

dmesg | grep -5 -i mtrr  

Not being able to use WC is not fatal, its just a performance issue, but if we 
tried
to override a region which we should not have to WC for which another area the 
BIOS
might rely on to not be WC, that could be a big issue.

> $dmesg | grep -i mtr for 4.3 kernel with default pat enabled
> [0.189368] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189478] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with 
> a huge-page mapping due to MTRR override.
> [0.189814] calling  mtrr_init_finialize+0x0/0x3a @ 1
> [0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs

The fact we don't see a conflict doesn't mean an issue or conflict didn't
trigger. If PAT didn't see something the BIOS did that make the kernel assume
it could do something that it was not able to. The MTRR init code should pick
up on this stuff and let the kernel PAT code know if there could be a conflict,
but if for some reason that was missed, that could be an issue.

> I also checked my BIOS. I found nothing about mtrr. My BIOS manual is 
> ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option about 
> MTRR?
> 
> Question: If we assume your theory is correct about mtrr/pat, wouldn't 
> lockup/hang reboot every time the system goes to hibernate/resume? Can this 
> assumption explain why the first hibernation/resume cycles in rapid 
> succession after system boot are working and the long ones fail somewhat more 
> consistently?
> 
> Note: With PAT enabled the system boots up significantly faster.
> 
> In the weekend I will return to 3.18-rc2 and I will try to verify my 
> bisection is correct. Double guessing your self is a terrible thing...
> 
> I will also try with nopat and I will run dmesg | grep -i mtr and post results
> 
> Unless you have any other suggestions...

Bisection on the merge commit would help.

 Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread vasvir

On 11/20/2015 02:23 PM, Juergen Gross wrote:

>
> As the BIOS obviously isn't disabling MTRR I don't think we have
> to go that route any longer.

ok.

>>
>> In the weekend I will return to 3.18-rc2 and I will try to verify my
>> bisection is correct. Double guessing your self is a terrible thing...
>
> Thanks.
>
>> I will also try with nopat and I will run dmesg | grep -i mtr and post
>> results
>>
>> Unless you have any other suggestions...
>

I hit a very big problem here. I did
$git checkout 773fed910d41e443e495a6bfa9ab1c2b7b13e012
$make (with gcc 4.8 - as all my tests)

and the resulting kernel in unbootable hunging in "Loading initial
ramdisk..." second line of the kernel boot

That means my bisection is not good because this release is marked as good.

So now I am at loss.

As I said I followed https://wiki.debian.org/DebianKernel/GitBisect

I notice now that the article suggest a step
  $make oldconfig

I did it once at the start of the bisection and then answering the default
(Enter) in all config questions.

> I think we have to find out where the kernel is really hanging. Do you
> have any chance to trigger a NMI?

I am googling about it.

>
> Looking into suspend/resume code I found a strange inconsistency for
> the lapic handling:
>
> lapic_suspend()
> {
> ...
> #ifdef CONFIG_X86_THERMAL_VECTOR
> if (maxlvt >= 5)
> apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
> #endif
> ...
> }
>
> lapic_resume()
> {
> ...
> #if defined(CONFIG_X86_MCE_INTEL)
> if (maxlvt >= 5)
> apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
> #endif
> ...
> }
>
> and comparing that to:
>
> clear_local_APIC()
> {
> ...
> #ifdef CONFIG_X86_THERMAL_VECTOR
> if (maxlvt >= 5) {
> v = apic_read(APIC_LVTTHMR);
> apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
> }
> #endif
> #ifdef CONFIG_X86_MCE_INTEL
> if (maxlvt >= 6) {
> v = apic_read(APIC_LVTCMCI);
> if (!(v & APIC_LVT_MASKED))
> apic_write(APIC_LVTCMCI, v | APIC_LVT_MASKED);
> }
> #endif
> ...
> }
>

Ok I will send the .config when I get back home. I have all kernels I
build in .deb archive. The problem is that the debian kernel build
procedure does not hold somewhere in the deb file the git commit hash.

Fow which kernel would you care to see the config? 4.3?

 Vassilis



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Juergen Gross

On 23/11/15 15:11, vas...@iit.demokritos.gr wrote:
> Ok I will send the .config when I get back home. I have all kernels I
> build in .deb archive. The problem is that the debian kernel build
> procedure does not hold somewhere in the deb file the git commit hash.
> 
> Fow which kernel would you care to see the config? 4.3?

Doesn't really matter anymore. I've posted a patch already to fix it and
got the reply, that the fix is okay, but no harm can come from the
current implementation, as the two config options are always either both
set or reset.

Juergen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Vassilis Virvilis


On 11/23/2015 08:56 PM, Luis R. Rodriguez wrote:

Its not clear from the log who called this MTRR call for WC that failed, I
hope we didn't attempt a WC wright on a WB region. Who owns
e000-efff ?


How can I answer that? Is there any utility to run? peek inside /proc?

Here is an idea:
$dmesg | grep -i -5 e000
[0.220941] pci_bus :00: root bus resource [mem 0x000e4000-0x000e7fff 
window]
[0.220944] pci_bus :00: root bus resource [mem 0xdf20-0xfeaf 
window]
[0.220950] pci :00:00.0: [8086:0c00] type 00 class 0x06
[0.221012] pci :00:02.0: [8086:0412] type 00 class 0x03
[0.221021] pci :00:02.0: reg 0x10: [mem 0xf780-0xf7bf 64bit]
[0.221025] pci :00:02.0: reg 0x18: [mem 0xe000-0xefff 64bit 
pref]
[0.221028] pci :00:02.0: reg 0x20: [io  0xf000-0xf03f]
[0.221081] pci :00:03.0: [8086:0c0c] type 00 class 0x040300
[0.221089] pci :00:03.0: reg 0x10: [mem 0xf7c34000-0xf7c37fff 64bit]
[0.221163] pci :00:14.0: [8086:8cb1] type 00 class 0x0c0330
[0.221184] pci :00:14.0: reg 0x10: [mem 0xf7c2-0xf7c2 64bit]
--
[0.453765] calling  ioapic_init_ops+0x0/0xf @ 1
[0.453767] initcall ioapic_init_ops+0x0/0xf returned 0 after 0 usecs
[0.453770] calling  add_pcspkr+0x0/0x3b @ 1
[0.453781] initcall add_pcspkr+0x0/0x3b returned 0 after 8 usecs
[0.453783] calling  sysfb_init+0x0/0x96 @ 1
[0.453811] simple-framebuffer simple-framebuffer.0: framebuffer at 
0xe000, 0x6bb000 bytes, mapped to 0xc9000200
[0.453814] simple-framebuffer simple-framebuffer.0: format=a8r8g8b8, 
mode=1680x1050x32, linelength=6720
[0.557233] Console: switching to colour frame buffer device 210x65
[0.660632] simple-framebuffer simple-framebuffer.0: fb0: simplefb 
registered!
[0.661262] initcall sysfb_init+0x0/0x96 returned 0 after 202686 usecs
[0.661266] calling  audit_classes_init+0x0/0xaa @ 1
--
[9.744397] input: gspca_zc3xx as 
/devices/pci:00/:00:14.0/usb3/3-3/input/input18
[9.744481] usbcore: registered new interface driver gspca_zc3xx
[9.744484] initcall sd_driver_init+0x0/0x1000 [gspca_zc3xx] returned 0 
after 319 usecs
[9.745108] calling  i915_init+0x0/0xa2 [i915] @ 403
[9.745542] [drm] Memory usable by graphics device = 2048M
[9.745544] checking generic (e000 6bb000) vs hw (e000 1000)
[9.745544] fb: switching to inteldrmfb from simple
[9.745831] calling  alsa_seq_device_init+0x0/0x1000 [snd_seq_device] @ 384
[9.745842] initcall alsa_seq_device_init+0x0/0x1000 [snd_seq_device] 
returned 0 after 9 usecs
[9.746179] calling  hmac_module_init+0x0/0x1000 [hmac] @ 471
[9.746180] initcall hmac_module_init+0x0/0x1000 [hmac] returned 0 after 0 
usecs
--
[9.749840] calling  usb_audio_driver_init+0x0/0x1000 [snd_usb_audio] @ 384
[9.751163] usbcore: registered new interface driver snd-usb-audio
[9.751166] initcall usb_audio_driver_init+0x0/0x1000 [snd_usb_audio] 
returned 0 after 1292 usecs
[9.943166] Console: switching to colour dummy device 80x25
[9.943240] [drm] Replacing VGA console driver
[9.943520] mtrr: type mismatch for e000,1000 old: write-back new: 
write-combining
[9.943526] Failed to add WC MTRR for [e000-efff]; 
performance may suffer.
[9.947147] Adding 31249404k swap on /dev/sdb1.  Priority:-1 extents:1 
across:31249404k FS
[9.949724] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[9.949728] [drm] Driver supports precise vblank timestamp query.
[9.949801] vgaarb: device changed decodes: 
PCI::00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[9.965787] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: 
(null)

$lspci | grep 00:02.0
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen 
Core Processor Integrated Graphics Controller (rev 06)

Looks like it is the graphics card or the graphics driver.

I don't know if this is relevant
$ cat /proc/mtrr
reg00: base=0x0 (0MB), size=16384MB, count=1: write-back
reg01: base=0x4 (16384MB), size=  512MB, count=1: write-back
reg02: base=0x0e000 ( 3584MB), size=  512MB, count=1: uncachable
reg03: base=0x0d000 ( 3328MB), size=  256MB, count=1: uncachable
reg04: base=0x0cf00 ( 3312MB), size=   16MB, count=1: uncachable
reg05: base=0x41f00 (16880MB), size=   16MB, count=1: uncachable
reg06: base=0x41ee0 (16878MB), size=2MB, count=1: uncachable



What does your log show right before and after this? To find out try:

dmesg | grep -5 -i mtrr



See full dmesg attached

$dmesg | grep -5 -i mtrr
[0.189333] initcall arch_kdebugfs_init+0x0/0x1f returned 0 after 0 usecs
[0.189336] calling  pt_init+0x0/0x2a4 @ 1
[0.189349] initcall pt_init+0x0/0x2a4 returned -19 after 0 usecs
[0.189352] calling  bts_init+0x0/0xa4 @ 1
[0.189354] initcall bts_init+0x0/0xa4

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Luis R. Rodriguez

On Thu, Nov 19, 2015 at 06:39:28AM +0100, Juergen Gross wrote:
> On 18/11/15 22:43, Vassilis Virvilis wrote:
> > Hi,
> > 
> > I have been hit by a hibernate/resume bug. Other people may have too:
> > The following links are consistent with my observations
> > 
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1490494
> > https://bugs.archlinux.org/task/44807
> > 
> > Some observations:
> > 1) The first few rapid hibernation / resume cycles do not fail.
> > 
> > 2) If the computer is loaded (eclipse + chromium + firefox/iceweasel +
> > thunderbird/icedove + Konsole) helps to reproduce and lock up during resume

Let's try to speed up reproducing this.

I have a hunch perhaps this might be related to some BIOS controlled
MTRRs and a mismatch which then enables the kernel to think that a type 
of MTRR write might be OK, but in fact its not. Due to the work load
description of this perhaps this could be related to fan control and BIOS
control on them and against some other device MTRR. More on this suspicion
on another thread where you provide more logs.

On a kernel that you know fails can you try replacing this work load by making
you CPU crawl to its knees quickly, perhaps 'make -j' on Linux building for 2,
4, 8, 16, minutes and then hit CTRL C to continue to hibernation to see if
making the CPU fan trigger would accelerate the issue.  If 'make -j' is too nuts
to the point you can't even CTRL C it, try 'make -j 16' . Note that if this is
true then that means a hot CPU could still trigger CPU fan controls on on a
fresh boot if the previous boot was CPU intensive.

If this doesn't do it lets try forcing an MTRR capable driver, say graphics is
the obvious target, try perhaps some 3D stuff or a screen saver prior to
hibernation. Note that even if you boot nomtrr the BIOS may still use MTRRs,
and PAT use on Linux could assume MTRR is not being used on drivers but the
BIOS may still do something behind the scenes. This is actually one reason why
we can't exactly remove MTRR support from Linux, since the BIOS may still do
some wacky stuff with MTRRs, one example of such I was given was CPU can
control might use WC MTRRs, so the kernel must be aware of this, even if no
MTRRs are ever used on the Linux kernel at all -- this is the case now as of
v4.3 and onwards.

If that doesn't help speed it up , maybe try both screen saver + some 3D
stuff + cpu instensive stuff.

To help you speed up testing you can try reducing your build time by reducing
the amount of crap you have to build:

make localmodconfig

That should only build things your kernel has loaded as modules or is already
enabled (=y).

> > 3) Long hibernation times (overnight) helps to reproduce and lock up
> > during resume
> > 
> > 4) For the bad commits (where the lockup during resume takes place) -
> > the image loading during resume is significantly faster. It is fast and
> > then it locks.
> > 
> > How I hit the problem and what I have done:
> > 
> > I am running debian unstable
> > 
> > Debian went from 3.16 to 3.19 - hence the problem raised its ugly head.
> > I upgraded diligently up to 4.2.6 - The problem persists
> > 
> > I started kernel bisection from 3.16 to 3.19 following
> > https://wiki.debian.org/DebianKernel/GitBisect
> > 
> > One month and 25 kernels later see below for the bisect log
> 
> Wow! Thanks for doing this work!
> 

Vassilis, indeed, the amount of work you have put into this is extremely
appreciated!

> Juergen
> 
> > 
> > I hit some untestable kernel that weren't booting. They were hanging at
> > "Loading ramdisk..." before any actual kernel message.
> > 
> > Looks like the first bad / untestable commit is from  Juergen Gross /
> > Thomas Gleixner Merge branch 'x86-mm-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip [full PAT support]
> > 

That is commit a023748d53c10850650fe86b1c4a7d421d576451
("Merge branch 'x86-mm-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

Git is smart enough to tell you you've hit a merge commit and that all the
possible commits on that merge could be the issue. This is why you bisect
log shows a slew of commits. The next step is to bisect through the merge
and then bisect through that, this will then let us identify the exact commit
that may have caused the issue.

There are a few ways to do this, my preferred way is to "unfold" a merge
commit manually.

To help keep thing separately (without affecting other tests you might
have on your other git tree and to avoid having to force you to loose
fresh object as you continue to build test on the other tree), I'd do
something like this:

mkdir ~/tmp
git clone ~/linux/.git linux-dev-test

cd linux-dev-test

Notice how if you do git log and search for 
a023748d53c10850650fe86b1c4a7d421d576451
you'll see that the commit listed before this is 
773fed910d41e443e495a6bfa9ab1c2b7b13e012
("Merge branches 'x86-platform-for-linus' and 'x86-uv-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-22 Thread Juergen Gross

On 21/11/15 12:49, Vassilis Virvilis wrote:
> On 11/20/2015 02:23 PM, Juergen Gross wrote:
>> On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:
 I've just found a potential issue: In case MTRR is disabled by the BIOS
 the PAT register of the boot processor won't be restored after resume.

 Can you check whether pr_info("MTRR: Disabled\n") has been executed in
 early boot? If yes, this might be a BIOS option.

>>>
>>> I don't have access right now. I will test it later tonight (This is my
>>> home machine).
>>>
>>> Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
>>> somewere else e.g. /proc /sys etc?
>>
>> I think grepping for MTRR in dmesg should be enough.
> 
> kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the
> familiar (see previously attached image) "Calling lapic..." place.
> 
> $dmesg | grep -i mtr for 4.3 kernel with notpat
> [0.189113] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189222] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820]
> with a huge-page mapping due to MTRR override.
> [0.189559] calling  mtrr_init_finialize+0x0/0x3a @ 1
> [0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0
> usecs
> [8.994140] mtrr: type mismatch for e000,1000 old: write-back
> new: write-combining
> [8.994154] Failed to add WC MTRR for
> [e000-efff]; performance may suffer.
> 
> $dmesg | grep -i mtr for 4.3 kernel with default pat enabled
> [0.189368] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189478] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820]
> with a huge-page mapping due to MTRR override.
> [0.189814] calling  mtrr_init_finialize+0x0/0x3a @ 1
> [0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0
> usecs
> 
> 
> I also checked my BIOS. I found nothing about mtrr. My BIOS manual is
> ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option
> about MTRR?

As the BIOS obviously isn't disabling MTRR I don't think we have
to go that route any longer.

> Question: If we assume your theory is correct about mtrr/pat, wouldn't
> lockup/hang reboot every time the system goes to hibernate/resume? Can
> this assumption explain why the first hibernation/resume cycles in rapid
> succession after system boot are working and the long ones fail somewhat
> more consistently?

Hmm, I'm really not sure. It would depend on the usage of non-standard
cache mode mappings. But as MTRR isn't disabled this theory won't apply
to your problem.

> Note: With PAT enabled the system boots up significantly faster.
> 
> In the weekend I will return to 3.18-rc2 and I will try to verify my
> bisection is correct. Double guessing your self is a terrible thing...

Thanks.

> I will also try with nopat and I will run dmesg | grep -i mtr and post
> results
> 
> Unless you have any other suggestions...

I think we have to find out where the kernel is really hanging. Do you
have any chance to trigger a NMI?

Looking into suspend/resume code I found a strange inconsistency for
the lapic handling:

lapic_suspend()
{
...
#ifdef CONFIG_X86_THERMAL_VECTOR
if (maxlvt >= 5)
apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
#endif
...
}

lapic_resume()
{
...
#if defined(CONFIG_X86_MCE_INTEL)
if (maxlvt >= 5)
apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
#endif
...
}

and comparing that to:

clear_local_APIC()
{
...
#ifdef CONFIG_X86_THERMAL_VECTOR
if (maxlvt >= 5) {
v = apic_read(APIC_LVTTHMR);
apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
}
#endif
#ifdef CONFIG_X86_MCE_INTEL
if (maxlvt >= 6) {
v = apic_read(APIC_LVTCMCI);
if (!(v & APIC_LVT_MASKED))
apic_write(APIC_LVTCMCI, v | APIC_LVT_MASKED);
}
#endif
...
}

I think it would be interesting to know your kernel config...


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-22 Thread Juergen Gross

On 21/11/15 12:49, Vassilis Virvilis wrote:
> On 11/20/2015 02:23 PM, Juergen Gross wrote:
>> On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:
 I've just found a potential issue: In case MTRR is disabled by the BIOS
 the PAT register of the boot processor won't be restored after resume.

 Can you check whether pr_info("MTRR: Disabled\n") has been executed in
 early boot? If yes, this might be a BIOS option.

>>>
>>> I don't have access right now. I will test it later tonight (This is my
>>> home machine).
>>>
>>> Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
>>> somewere else e.g. /proc /sys etc?
>>
>> I think grepping for MTRR in dmesg should be enough.
> 
> kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the
> familiar (see previously attached image) "Calling lapic..." place.
> 
> $dmesg | grep -i mtr for 4.3 kernel with notpat
> [0.189113] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189222] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820]
> with a huge-page mapping due to MTRR override.
> [0.189559] calling  mtrr_init_finialize+0x0/0x3a @ 1
> [0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0
> usecs
> [8.994140] mtrr: type mismatch for e000,1000 old: write-back
> new: write-combining
> [8.994154] Failed to add WC MTRR for
> [e000-efff]; performance may suffer.
> 
> $dmesg | grep -i mtr for 4.3 kernel with default pat enabled
> [0.189368] calling  mtrr_if_init+0x0/0x5f @ 1
> [0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
> [0.189478] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820]
> with a huge-page mapping due to MTRR override.
> [0.189814] calling  mtrr_init_finialize+0x0/0x3a @ 1
> [0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0
> usecs
> 
> 
> I also checked my BIOS. I found nothing about mtrr. My BIOS manual is
> ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option
> about MTRR?

As the BIOS obviously isn't disabling MTRR I don't think we have
to go that route any longer.

> Question: If we assume your theory is correct about mtrr/pat, wouldn't
> lockup/hang reboot every time the system goes to hibernate/resume? Can
> this assumption explain why the first hibernation/resume cycles in rapid
> succession after system boot are working and the long ones fail somewhat
> more consistently?

Hmm, I'm really not sure. It would depend on the usage of non-standard
cache mode mappings. But as MTRR isn't disabled this theory won't apply
to your problem.

> Note: With PAT enabled the system boots up significantly faster.
> 
> In the weekend I will return to 3.18-rc2 and I will try to verify my
> bisection is correct. Double guessing your self is a terrible thing...

Thanks.

> I will also try with nopat and I will run dmesg | grep -i mtr and post
> results
> 
> Unless you have any other suggestions...

I think we have to find out where the kernel is really hanging. Do you
have any chance to trigger a NMI?

Looking into suspend/resume code I found a strange inconsistency for
the lapic handling:

lapic_suspend()
{
...
#ifdef CONFIG_X86_THERMAL_VECTOR
if (maxlvt >= 5)
apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
#endif
...
}

lapic_resume()
{
...
#if defined(CONFIG_X86_MCE_INTEL)
if (maxlvt >= 5)
apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
#endif
...
}

and comparing that to:

clear_local_APIC()
{
...
#ifdef CONFIG_X86_THERMAL_VECTOR
if (maxlvt >= 5) {
v = apic_read(APIC_LVTTHMR);
apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
}
#endif
#ifdef CONFIG_X86_MCE_INTEL
if (maxlvt >= 6) {
v = apic_read(APIC_LVTCMCI);
if (!(v & APIC_LVT_MASKED))
apic_write(APIC_LVTCMCI, v | APIC_LVT_MASKED);
}
#endif
...
}

I think it would be interesting to know your kernel config...


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-21 Thread Vassilis Virvilis


On 11/20/2015 02:23 PM, Juergen Gross wrote:

On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:

I've just found a potential issue: In case MTRR is disabled by the BIOS
the PAT register of the boot processor won't be restored after resume.

Can you check whether pr_info("MTRR: Disabled\n") has been executed in
early boot? If yes, this might be a BIOS option.



I don't have access right now. I will test it later tonight (This is my
home machine).

Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
somewere else e.g. /proc /sys etc?


I think grepping for MTRR in dmesg should be enough.


kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the familiar (see 
previously attached image) "Calling lapic..." place.

$dmesg | grep -i mtr for 4.3 kernel with notpat
[0.189113] calling  mtrr_if_init+0x0/0x5f @ 1
[0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
[0.189222] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with a 
huge-page mapping due to MTRR override.
[0.189559] calling  mtrr_init_finialize+0x0/0x3a @ 1
[0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs
[8.994140] mtrr: type mismatch for e000,1000 old: write-back new: 
write-combining
[8.994154] Failed to add WC MTRR for [e000-efff]; 
performance may suffer.

$dmesg | grep -i mtr for 4.3 kernel with default pat enabled
[0.189368] calling  mtrr_if_init+0x0/0x5f @ 1
[0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
[0.189478] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with a 
huge-page mapping due to MTRR override.
[0.189814] calling  mtrr_init_finialize+0x0/0x3a @ 1
[0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs


I also checked my BIOS. I found nothing about mtrr. My BIOS manual is 
ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option about 
MTRR?

Question: If we assume your theory is correct about mtrr/pat, wouldn't 
lockup/hang reboot every time the system goes to hibernate/resume? Can this 
assumption explain why the first hibernation/resume cycles in rapid succession 
after system boot are working and the long ones fail somewhat more consistently?

Note: With PAT enabled the system boots up significantly faster.

In the weekend I will return to 3.18-rc2 and I will try to verify my bisection 
is correct. Double guessing your self is a terrible thing...

I will also try with nopat and I will run dmesg | grep -i mtr and post results

Unless you have any other suggestions...

Vassilis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-21 Thread Vassilis Virvilis


On 11/20/2015 02:23 PM, Juergen Gross wrote:

On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:

I've just found a potential issue: In case MTRR is disabled by the BIOS
the PAT register of the boot processor won't be restored after resume.

Can you check whether pr_info("MTRR: Disabled\n") has been executed in
early boot? If yes, this might be a BIOS option.



I don't have access right now. I will test it later tonight (This is my
home machine).

Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
somewere else e.g. /proc /sys etc?


I think grepping for MTRR in dmesg should be enough.


kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the familiar (see 
previously attached image) "Calling lapic..." place.

$dmesg | grep -i mtr for 4.3 kernel with notpat
[0.189113] calling  mtrr_if_init+0x0/0x5f @ 1
[0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
[0.189222] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with a 
huge-page mapping due to MTRR override.
[0.189559] calling  mtrr_init_finialize+0x0/0x3a @ 1
[0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs
[8.994140] mtrr: type mismatch for e000,1000 old: write-back new: 
write-combining
[8.994154] Failed to add WC MTRR for [e000-efff]; 
performance may suffer.

$dmesg | grep -i mtr for 4.3 kernel with default pat enabled
[0.189368] calling  mtrr_if_init+0x0/0x5f @ 1
[0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
[0.189478] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with a 
huge-page mapping due to MTRR override.
[0.189814] calling  mtrr_init_finialize+0x0/0x3a @ 1
[0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs


I also checked my BIOS. I found nothing about mtrr. My BIOS manual is 
ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option about 
MTRR?

Question: If we assume your theory is correct about mtrr/pat, wouldn't 
lockup/hang reboot every time the system goes to hibernate/resume? Can this 
assumption explain why the first hibernation/resume cycles in rapid succession 
after system boot are working and the long ones fail somewhat more consistently?

Note: With PAT enabled the system boots up significantly faster.

In the weekend I will return to 3.18-rc2 and I will try to verify my bisection 
is correct. Double guessing your self is a terrible thing...

I will also try with nopat and I will run dmesg | grep -i mtr and post results

Unless you have any other suggestions...

Vassilis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-20 Thread Juergen Gross

On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:
>> I've just found a potential issue: In case MTRR is disabled by the BIOS
>> the PAT register of the boot processor won't be restored after resume.
>>
>> Can you check whether pr_info("MTRR: Disabled\n") has been executed in
>> early boot? If yes, this might be a BIOS option.
>>
> 
> I don't have access right now. I will test it later tonight (This is my
> home machine).
> 
> Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
> somewere else e.g. /proc /sys etc?

I think grepping for MTRR in dmesg should be enough.


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-20 Thread vasvir

> I've just found a potential issue: In case MTRR is disabled by the BIOS
> the PAT register of the boot processor won't be restored after resume.
>
> Can you check whether pr_info("MTRR: Disabled\n") has been executed in
> early boot? If yes, this might be a BIOS option.
>

I don't have access right now. I will test it later tonight (This is my
home machine).

Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
somewere else e.g. /proc /sys etc?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-20 Thread Juergen Gross

On 20/11/15 06:25, Vassilis Virvilis wrote:
> On 11/19/2015 10:35 PM, Vassilis Virvilis wrote:
>>
>> I compiled and I am running 4.3 right now.
>>
> 
> It failed this morning. Last night I did 3 hibernate / resume cycles. In
> the last one I I also turned off the PSU (this seems to push it over the
> edge - but it may be random behavior) and it worked. This morning 7h
> later failed to resume - but it didn't hang on _lapic_resume. This time
> it rebooted - and I seem to recall this behavior for 4.2+ kernels. I
> forgot to mention it because my testing with 4.x kernels were one month
> before.
> 
> So 4.3 kernel - reboots on resume after a long hibernation time.
> 
> I am testing with 4.3 and nopat right now.

I've just found a potential issue: In case MTRR is disabled by the BIOS
the PAT register of the boot processor won't be restored after resume.

Can you check whether pr_info("MTRR: Disabled\n") has been executed in
early boot? If yes, this might be a BIOS option.


Juergen




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-20 Thread Juergen Gross

On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:
>> I've just found a potential issue: In case MTRR is disabled by the BIOS
>> the PAT register of the boot processor won't be restored after resume.
>>
>> Can you check whether pr_info("MTRR: Disabled\n") has been executed in
>> early boot? If yes, this might be a BIOS option.
>>
> 
> I don't have access right now. I will test it later tonight (This is my
> home machine).
> 
> Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
> somewere else e.g. /proc /sys etc?

I think grepping for MTRR in dmesg should be enough.


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-20 Thread vasvir

> I've just found a potential issue: In case MTRR is disabled by the BIOS
> the PAT register of the boot processor won't be restored after resume.
>
> Can you check whether pr_info("MTRR: Disabled\n") has been executed in
> early boot? If yes, this might be a BIOS option.
>

I don't have access right now. I will test it later tonight (This is my
home machine).

Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
somewere else e.g. /proc /sys etc?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-20 Thread Juergen Gross

On 20/11/15 06:25, Vassilis Virvilis wrote:
> On 11/19/2015 10:35 PM, Vassilis Virvilis wrote:
>>
>> I compiled and I am running 4.3 right now.
>>
> 
> It failed this morning. Last night I did 3 hibernate / resume cycles. In
> the last one I I also turned off the PSU (this seems to push it over the
> edge - but it may be random behavior) and it worked. This morning 7h
> later failed to resume - but it didn't hang on _lapic_resume. This time
> it rebooted - and I seem to recall this behavior for 4.2+ kernels. I
> forgot to mention it because my testing with 4.x kernels were one month
> before.
> 
> So 4.3 kernel - reboots on resume after a long hibernation time.
> 
> I am testing with 4.3 and nopat right now.

I've just found a potential issue: In case MTRR is disabled by the BIOS
the PAT register of the boot processor won't be restored after resume.

Can you check whether pr_info("MTRR: Disabled\n") has been executed in
early boot? If yes, this might be a BIOS option.


Juergen




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Vassilis Virvilis


On 11/19/2015 10:35 PM, Vassilis Virvilis wrote:


I compiled and I am running 4.3 right now.



It failed this morning. Last night I did 3 hibernate / resume cycles. In the 
last one I I also turned off the PSU (this seems to push it over the edge - but 
it may be random behavior) and it worked. This morning 7h later failed to 
resume - but it didn't hang on _lapic_resume. This time it rebooted - and I 
seem to recall this behavior for 4.2+ kernels. I forgot to mention it because 
my testing with 4.x kernels were one month before.

So 4.3 kernel - reboots on resume after a long hibernation time.

I am testing with 4.3 and nopat right now.

 Vassilis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Vassilis Virvilis


On 11/19/2015 11:10 AM, Juergen Gross wrote:


So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
4.3 for now.


I think 4.3 is okay.


I will do it later tonight. It will take 2 days at least to report back


I compiled and I am running 4.3 right now.

If it fails I will try with the nopat option.

If it fails I will try 3.18-rc2+nopat to see if that fails.



Do you want me to run something on this like lspci, lsusb


Yes, please post the output of both.



Here they are. See attachments




I would like this to be fixed so I am willing to do the testing.


I appreciate this spirit. :-)



I appreciate the guidance. :-)


Vassilis
Bus 004 Device 002: ID 8087:8001 Intel Corp. 
Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 002: ID 8087:8009 Intel Corp. 
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 046d:089d Logitech, Inc. QuickCam E2500 series
Bus 001 Device 003: ID 045e:0745 Microsoft Corp. Nano Transceiver v1.0 for 
Bluetooth
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 004 Device 002: ID 8087:8001 Intel Corp. 
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 1 Single TT
  bMaxPacketSize064
  idVendor   0x8087 Intel Corp.
  idProduct  0x8001 
  bcdDevice0.00
  iManufacturer   0 
  iProduct0 
  iSerial 0 
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81  EP 1 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0002  1x 2 bytes
bInterval  12
Hub Descriptor:
  bLength  11
  bDescriptorType  41
  nNbrPorts 8
  wHubCharacteristic 0x0009
Per-port power switching
Per-port overcurrent protection
TT think time 8 FS bits
  bPwrOn2PwrGood0 * 2 milli seconds
  bHubContrCurrent  0 milli Ampere
  DeviceRemovable0x00 0x00
  PortPwrCtrlMask0xff 0xff
 Hub Port Status:
   Port 1: .0100 power
   Port 2: .0100 power
   Port 3: .0100 power
   Port 4: .0100 power
   Port 5: .0100 power
   Port 6: .0100 power
   Port 7: .0100 power
   Port 8: .0100 power
Device Qualifier (for other device speed):
  bLength10
  bDescriptorType 6
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 0 Full speed (or root) hub
  bMaxPacketSize064
  bNumConfigurations  1
Device Status: 0x0001
  Self Powered

Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 0 Full speed (or root) hub
  bMaxPacketSize064
  idVendor   0x1d6b Linux Foundation
  idProduct  0x0002 2.0 root hub
  bcdDevice4.03
  iManufacturer   3 Linux 4.3.0+ ehci_hcd
  iProduct2 EHCI Host Controller
  iSerial 1 :00:1d.0
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Juergen Gross

On 19/11/15 08:50, vas...@iit.demokritos.gr wrote:
> Hi,
> 
> Thanks for the quick answer
> 
>>
>> Could you please try the most recent 4.3 kernel? There has been some
>> work related to this topic after 4.2 (large page pat handling done by
>> Toshi Kani and mtrr/pat handling by Luis Rodriguez).
> 
> That means I will reset the bisection. Right? Is there any other info we
> can extract from there?

I don't see what else should be specific to that patch other than the
information that the issue occurred due to that patch. All further
diagnostic information should be obtainable with a newer kernel, too.

> So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
> 4.3 for now.

I think 4.3 is okay.

> I will do it later tonight. It will take 2 days at least to report back

Okay, thank you for your effort!

> 
>>
>> Another interesting information would be the exact hardware you are
>> using. Maybe we can see some similarities between yours and the other
>> two cases you referenced above.
>>
> 
> It is an i7
> Motherboard: ASROCK H97 PRO4 RETAIL
> CPU INTEL CORE I7-4790 3.60GHZ LGA1150 - BOX
> It has 16GB of RAM, one SSD and one HDD
> I have NO external graphics card
> 
> Do you want me to run something on this like lspci, lsusb

Yes, please post the output of both.

> I upgraded the BIOS of the motherboard to the latest. This is not the
> problem though because I upgraded after the problem occurred as a counter
> measure in case I was hit by a buggy BIOS and linux had changed its
> behavior to be stricter.

BIOS was my first guess, but in case the other two reports are really
due to the same problem I doubt the BIOS is to blame (one Lenovo and one
Sony laptop).

> I experimented with ACPI compilers/decompilers and I was tempted to fix my
> ACPI tables but I didn't.
> 
> I saw the kernel command line option acpi_os=!Windows2013 but I didn't try
> it. Do you thing I should try it?

You could try "nopat" as command line option.

> 
>> Wow! Thanks for doing this work!
>>
> 
> I would like this to be fixed so I am willing to do the testing.

I appreciate this spirit. :-)


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Juergen Gross

On 19/11/15 08:50, vas...@iit.demokritos.gr wrote:
> Hi,
> 
> Thanks for the quick answer
> 
>>
>> Could you please try the most recent 4.3 kernel? There has been some
>> work related to this topic after 4.2 (large page pat handling done by
>> Toshi Kani and mtrr/pat handling by Luis Rodriguez).
> 
> That means I will reset the bisection. Right? Is there any other info we
> can extract from there?

I don't see what else should be specific to that patch other than the
information that the issue occurred due to that patch. All further
diagnostic information should be obtainable with a newer kernel, too.

> So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
> 4.3 for now.

I think 4.3 is okay.

> I will do it later tonight. It will take 2 days at least to report back

Okay, thank you for your effort!

> 
>>
>> Another interesting information would be the exact hardware you are
>> using. Maybe we can see some similarities between yours and the other
>> two cases you referenced above.
>>
> 
> It is an i7
> Motherboard: ASROCK H97 PRO4 RETAIL
> CPU INTEL CORE I7-4790 3.60GHZ LGA1150 - BOX
> It has 16GB of RAM, one SSD and one HDD
> I have NO external graphics card
> 
> Do you want me to run something on this like lspci, lsusb

Yes, please post the output of both.

> I upgraded the BIOS of the motherboard to the latest. This is not the
> problem though because I upgraded after the problem occurred as a counter
> measure in case I was hit by a buggy BIOS and linux had changed its
> behavior to be stricter.

BIOS was my first guess, but in case the other two reports are really
due to the same problem I doubt the BIOS is to blame (one Lenovo and one
Sony laptop).

> I experimented with ACPI compilers/decompilers and I was tempted to fix my
> ACPI tables but I didn't.
> 
> I saw the kernel command line option acpi_os=!Windows2013 but I didn't try
> it. Do you thing I should try it?

You could try "nopat" as command line option.

> 
>> Wow! Thanks for doing this work!
>>
> 
> I would like this to be fixed so I am willing to do the testing.

I appreciate this spirit. :-)


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Vassilis Virvilis


On 11/19/2015 11:10 AM, Juergen Gross wrote:


So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
4.3 for now.


I think 4.3 is okay.


I will do it later tonight. It will take 2 days at least to report back


I compiled and I am running 4.3 right now.

If it fails I will try with the nopat option.

If it fails I will try 3.18-rc2+nopat to see if that fails.



Do you want me to run something on this like lspci, lsusb


Yes, please post the output of both.



Here they are. See attachments




I would like this to be fixed so I am willing to do the testing.


I appreciate this spirit. :-)



I appreciate the guidance. :-)


Vassilis
Bus 004 Device 002: ID 8087:8001 Intel Corp. 
Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 002: ID 8087:8009 Intel Corp. 
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 046d:089d Logitech, Inc. QuickCam E2500 series
Bus 001 Device 003: ID 045e:0745 Microsoft Corp. Nano Transceiver v1.0 for 
Bluetooth
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 004 Device 002: ID 8087:8001 Intel Corp. 
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 1 Single TT
  bMaxPacketSize064
  idVendor   0x8087 Intel Corp.
  idProduct  0x8001 
  bcdDevice0.00
  iManufacturer   0 
  iProduct0 
  iSerial 0 
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81  EP 1 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0002  1x 2 bytes
bInterval  12
Hub Descriptor:
  bLength  11
  bDescriptorType  41
  nNbrPorts 8
  wHubCharacteristic 0x0009
Per-port power switching
Per-port overcurrent protection
TT think time 8 FS bits
  bPwrOn2PwrGood0 * 2 milli seconds
  bHubContrCurrent  0 milli Ampere
  DeviceRemovable0x00 0x00
  PortPwrCtrlMask0xff 0xff
 Hub Port Status:
   Port 1: .0100 power
   Port 2: .0100 power
   Port 3: .0100 power
   Port 4: .0100 power
   Port 5: .0100 power
   Port 6: .0100 power
   Port 7: .0100 power
   Port 8: .0100 power
Device Qualifier (for other device speed):
  bLength10
  bDescriptorType 6
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 0 Full speed (or root) hub
  bMaxPacketSize064
  bNumConfigurations  1
Device Status: 0x0001
  Self Powered

Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 0 Full speed (or root) hub
  bMaxPacketSize064
  idVendor   0x1d6b Linux Foundation
  idProduct  0x0002 2.0 root hub
  bcdDevice4.03
  iManufacturer   3 Linux 4.3.0+ ehci_hcd
  iProduct2 EHCI Host Controller
  iSerial 1 :00:1d.0
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Vassilis Virvilis


On 11/19/2015 10:35 PM, Vassilis Virvilis wrote:


I compiled and I am running 4.3 right now.



It failed this morning. Last night I did 3 hibernate / resume cycles. In the 
last one I I also turned off the PSU (this seems to push it over the edge - but 
it may be random behavior) and it worked. This morning 7h later failed to 
resume - but it didn't hang on _lapic_resume. This time it rebooted - and I 
seem to recall this behavior for 4.2+ kernels. I forgot to mention it because 
my testing with 4.x kernels were one month before.

So 4.3 kernel - reboots on resume after a long hibernation time.

I am testing with 4.3 and nopat right now.

 Vassilis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-18 Thread vasvir

Hi,

Thanks for the quick answer

>
> Could you please try the most recent 4.3 kernel? There has been some
> work related to this topic after 4.2 (large page pat handling done by
> Toshi Kani and mtrr/pat handling by Luis Rodriguez).

That means I will reset the bisection. Right? Is there any other info we
can extract from there?

So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
4.3 for now.

I will do it later tonight. It will take 2 days at least to report back

>
> Another interesting information would be the exact hardware you are
> using. Maybe we can see some similarities between yours and the other
> two cases you referenced above.
>

It is an i7
Motherboard: ASROCK H97 PRO4 RETAIL
CPU INTEL CORE I7-4790 3.60GHZ LGA1150 - BOX
It has 16GB of RAM, one SSD and one HDD
I have NO external graphics card

Do you want me to run something on this like lspci, lsusb

I upgraded the BIOS of the motherboard to the latest. This is not the
problem though because I upgraded after the problem occurred as a counter
measure in case I was hit by a buggy BIOS and linux had changed its
behavior to be stricter.

I experimented with ACPI compilers/decompilers and I was tempted to fix my
ACPI tables but I didn't.

I saw the kernel command line option acpi_os=!Windows2013 but I didn't try
it. Do you thing I should try it?

> Wow! Thanks for doing this work!
>

I would like this to be fixed so I am willing to do the testing.

   Vassilis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-18 Thread Juergen Gross

On 18/11/15 22:43, Vassilis Virvilis wrote:
> Hi,
> 
> I have been hit by a hibernate/resume bug. Other people may have too:
> The following links are consistent with my observations
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1490494
> https://bugs.archlinux.org/task/44807
> 
> Some observations:
> 1) The first few rapid hibernation / resume cycles do not fail.
> 
> 2) If the computer is loaded (eclipse + chromium + firefox/iceweasel +
> thunderbird/icedove + Konsole) helps to reproduce and lock up during resume
> 
> 3) Long hibernation times (overnight) helps to reproduce and lock up
> during resume
> 
> 4) For the bad commits (where the lockup during resume takes place) -
> the image loading during resume is significantly faster. It is fast and
> then it locks.
> 
> How I hit the problem and what I have done:
> 
> I am running debian unstable
> 
> Debian went from 3.16 to 3.19 - hence the problem raised its ugly head.
> I upgraded diligently up to 4.2.6 - The problem persists

Could you please try the most recent 4.3 kernel? There has been some
work related to this topic after 4.2 (large page pat handling done by
Toshi Kani and mtrr/pat handling by Luis Rodriguez).

Another interesting information would be the exact hardware you are
using. Maybe we can see some similarities between yours and the other
two cases you referenced above.

> I added no_console_suspend initcall_debug to the kernel command line -
> see attached image of the lockup.
> 
> I added the drm.debug=0xe but it didn't produce any interesting (ok I
> know who I am to judge?) and the runs did not have it so I took it out
> again.
> 
> I reproduced with hibernating and resuming back to KDE and or back to
> text console.
> 
> I switched to the VGA console and the resume problem persists.
> 
> I started kernel bisection from 3.16 to 3.19 following
> https://wiki.debian.org/DebianKernel/GitBisect
> 
> One month and 25 kernels later see below for the bisect log

Wow! Thanks for doing this work!


Juergen

> 
> I hit some untestable kernel that weren't booting. They were hanging at
> "Loading ramdisk..." before any actual kernel message.
> 
> Looks like the first bad / untestable commit is from  Juergen Gross /
> Thomas Gleixner Merge branch 'x86-mm-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip [full PAT support]
> 
> Full disclaimer: I may have fucked up the bisection. Finding bad commits
> was semi easy - finding good commits needs a run time for 2-3 days.
> 
> I would really appreciate some help and directions to nail this down.
> 
> 
> Regards
> 
>  Vassilis Virvilis
> 
> 
> 
> bill@localhost:~/Downloads/linux$ git bisect log
> git bisect start
> # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
> git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
> # bad: [bfa76d49576599a4b9f9b7a71f23d73d6dcff735] Linux 3.19
> git bisect bad bfa76d49576599a4b9f9b7a71f23d73d6dcff735
> # good: [754c780953397dd5ee5191b7b3ca67e09088ce7a] Merge branch
> 'for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
> git bisect good 754c780953397dd5ee5191b7b3ca67e09088ce7a
> # bad: [7ef58b32f571bffb7763c6252ad7527562081f34] Merge tag
> 'devicetree-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux
> git bisect bad 7ef58b32f571bffb7763c6252ad7527562081f34
> # good: [53429290a054b30e4683297409fc4627b2592315] Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
> git bisect good 53429290a054b30e4683297409fc4627b2592315
> # good: [3a647c1d7ab08145cee4b650f5e797d168846c51] Merge tag
> 'drivers-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> git bisect good 3a647c1d7ab08145cee4b650f5e797d168846c51
> # bad: [1366f5d3129f2abde606214de7afc3dd61781fa3] Merge branch
> 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> git bisect bad 1366f5d3129f2abde606214de7afc3dd61781fa3
> # good: [151cd97630f87451cab412e40750d0e5f7581c98] Merge tag
> 'defconfig-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> git bisect good 151cd97630f87451cab412e40750d0e5f7581c98
> # good: [ecb50f0afd35a51ef487e8a54b976052eb03d729] Merge branch
> 'irq-core-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good ecb50f0afd35a51ef487e8a54b976052eb03d729
> # bad: [3a5dc1fafb016560315fe45bb4ef8bde259dd1bc] Merge branch
> 'x86-microcode-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect bad 3a5dc1fafb016560315fe45bb4ef8bde259dd1bc
> # good: [b6444bd0a18eb47343e16749ce80a6ebd521f124] Merge branch
> 'x86-boot-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good b6444bd0a18eb47343e16749ce80a6ebd521f124
> # bad: [a023748d53c10850650fe86b1c4a7d421d576451] Merge branch
> 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect bad a023748d53c10850650fe86b1c4a7d421d576451
> # good:

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-18 Thread Juergen Gross

On 18/11/15 22:43, Vassilis Virvilis wrote:
> Hi,
> 
> I have been hit by a hibernate/resume bug. Other people may have too:
> The following links are consistent with my observations
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1490494
> https://bugs.archlinux.org/task/44807
> 
> Some observations:
> 1) The first few rapid hibernation / resume cycles do not fail.
> 
> 2) If the computer is loaded (eclipse + chromium + firefox/iceweasel +
> thunderbird/icedove + Konsole) helps to reproduce and lock up during resume
> 
> 3) Long hibernation times (overnight) helps to reproduce and lock up
> during resume
> 
> 4) For the bad commits (where the lockup during resume takes place) -
> the image loading during resume is significantly faster. It is fast and
> then it locks.
> 
> How I hit the problem and what I have done:
> 
> I am running debian unstable
> 
> Debian went from 3.16 to 3.19 - hence the problem raised its ugly head.
> I upgraded diligently up to 4.2.6 - The problem persists

Could you please try the most recent 4.3 kernel? There has been some
work related to this topic after 4.2 (large page pat handling done by
Toshi Kani and mtrr/pat handling by Luis Rodriguez).

Another interesting information would be the exact hardware you are
using. Maybe we can see some similarities between yours and the other
two cases you referenced above.

> I added no_console_suspend initcall_debug to the kernel command line -
> see attached image of the lockup.
> 
> I added the drm.debug=0xe but it didn't produce any interesting (ok I
> know who I am to judge?) and the runs did not have it so I took it out
> again.
> 
> I reproduced with hibernating and resuming back to KDE and or back to
> text console.
> 
> I switched to the VGA console and the resume problem persists.
> 
> I started kernel bisection from 3.16 to 3.19 following
> https://wiki.debian.org/DebianKernel/GitBisect
> 
> One month and 25 kernels later see below for the bisect log

Wow! Thanks for doing this work!


Juergen

> 
> I hit some untestable kernel that weren't booting. They were hanging at
> "Loading ramdisk..." before any actual kernel message.
> 
> Looks like the first bad / untestable commit is from  Juergen Gross /
> Thomas Gleixner Merge branch 'x86-mm-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip [full PAT support]
> 
> Full disclaimer: I may have fucked up the bisection. Finding bad commits
> was semi easy - finding good commits needs a run time for 2-3 days.
> 
> I would really appreciate some help and directions to nail this down.
> 
> 
> Regards
> 
>  Vassilis Virvilis
> 
> 
> 
> bill@localhost:~/Downloads/linux$ git bisect log
> git bisect start
> # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
> git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
> # bad: [bfa76d49576599a4b9f9b7a71f23d73d6dcff735] Linux 3.19
> git bisect bad bfa76d49576599a4b9f9b7a71f23d73d6dcff735
> # good: [754c780953397dd5ee5191b7b3ca67e09088ce7a] Merge branch
> 'for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
> git bisect good 754c780953397dd5ee5191b7b3ca67e09088ce7a
> # bad: [7ef58b32f571bffb7763c6252ad7527562081f34] Merge tag
> 'devicetree-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux
> git bisect bad 7ef58b32f571bffb7763c6252ad7527562081f34
> # good: [53429290a054b30e4683297409fc4627b2592315] Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
> git bisect good 53429290a054b30e4683297409fc4627b2592315
> # good: [3a647c1d7ab08145cee4b650f5e797d168846c51] Merge tag
> 'drivers-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> git bisect good 3a647c1d7ab08145cee4b650f5e797d168846c51
> # bad: [1366f5d3129f2abde606214de7afc3dd61781fa3] Merge branch
> 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
> git bisect bad 1366f5d3129f2abde606214de7afc3dd61781fa3
> # good: [151cd97630f87451cab412e40750d0e5f7581c98] Merge tag
> 'defconfig-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> git bisect good 151cd97630f87451cab412e40750d0e5f7581c98
> # good: [ecb50f0afd35a51ef487e8a54b976052eb03d729] Merge branch
> 'irq-core-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good ecb50f0afd35a51ef487e8a54b976052eb03d729
> # bad: [3a5dc1fafb016560315fe45bb4ef8bde259dd1bc] Merge branch
> 'x86-microcode-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect bad 3a5dc1fafb016560315fe45bb4ef8bde259dd1bc
> # good: [b6444bd0a18eb47343e16749ce80a6ebd521f124] Merge branch
> 'x86-boot-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good b6444bd0a18eb47343e16749ce80a6ebd521f124
> # bad: [a023748d53c10850650fe86b1c4a7d421d576451] Merge branch
> 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect bad a023748d53c10850650fe86b1c4a7d421d576451
> # good:

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-18 Thread vasvir

Hi,

Thanks for the quick answer

>
> Could you please try the most recent 4.3 kernel? There has been some
> work related to this topic after 4.2 (large page pat handling done by
> Toshi Kani and mtrr/pat handling by Luis Rodriguez).

That means I will reset the bisection. Right? Is there any other info we
can extract from there?

So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
4.3 for now.

I will do it later tonight. It will take 2 days at least to report back

>
> Another interesting information would be the exact hardware you are
> using. Maybe we can see some similarities between yours and the other
> two cases you referenced above.
>

It is an i7
Motherboard: ASROCK H97 PRO4 RETAIL
CPU INTEL CORE I7-4790 3.60GHZ LGA1150 - BOX
It has 16GB of RAM, one SSD and one HDD
I have NO external graphics card

Do you want me to run something on this like lspci, lsusb

I upgraded the BIOS of the motherboard to the latest. This is not the
problem though because I upgraded after the problem occurred as a counter
measure in case I was hit by a buggy BIOS and linux had changed its
behavior to be stricter.

I experimented with ACPI compilers/decompilers and I was tempted to fix my
ACPI tables but I didn't.

I saw the kernel command line option acpi_os=!Windows2013 but I didn't try
it. Do you thing I should try it?

> Wow! Thanks for doing this work!
>

I would like this to be fixed so I am willing to do the testing.

   Vassilis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

42 matches

Mail list logo