Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Vassilis Virvilis

On 11/23/2015 08:56 PM, Luis R. Rodriguez wrote:

Its not clear from the log who called this MTRR call for WC that failed, I
hope we didn't attempt a WC wright on a WB region. Who owns
e000-efff ?


How can I answer that? Is there any utility to run? peek inside /proc?

Here is an idea:
$dmesg | grep -i -5 e000
[0.220941] pci_bus :00: root bus resource [mem 0x000e4000-0x000e7fff 
window]
[0.220944] pci_bus :00: root bus resource [mem 0xdf20-0xfeaf 
window]
[0.220950] pci :00:00.0: [8086:0c00] type 00 class 0x06
[0.221012] pci :00:02.0: [8086:0412] type 00 class 0x03
[0.221021] pci :00:02.0: reg 0x10: [mem 0xf780-0xf7bf 64bit]
[0.221025] pci :00:02.0: reg 0x18: [mem 0xe000-0xefff 64bit 
pref]
[0.221028] pci :00:02.0: reg 0x20: [io  0xf000-0xf03f]
[0.221081] pci :00:03.0: [8086:0c0c] type 00 class 0x040300
[0.221089] pci :00:03.0: reg 0x10: [mem 0xf7c34000-0xf7c37fff 64bit]
[0.221163] pci :00:14.0: [8086:8cb1] type 00 class 0x0c0330
[0.221184] pci :00:14.0: reg 0x10: [mem 0xf7c2-0xf7c2 64bit]
--
[0.453765] calling  ioapic_init_ops+0x0/0xf @ 1
[0.453767] initcall ioapic_init_ops+0x0/0xf returned 0 after 0 usecs
[0.453770] calling  add_pcspkr+0x0/0x3b @ 1
[0.453781] initcall add_pcspkr+0x0/0x3b returned 0 after 8 usecs
[0.453783] calling  sysfb_init+0x0/0x96 @ 1
[0.453811] simple-framebuffer simple-framebuffer.0: framebuffer at 
0xe000, 0x6bb000 bytes, mapped to 0xc9000200
[0.453814] simple-framebuffer simple-framebuffer.0: format=a8r8g8b8, 
mode=1680x1050x32, linelength=6720
[0.557233] Console: switching to colour frame buffer device 210x65
[0.660632] simple-framebuffer simple-framebuffer.0: fb0: simplefb 
registered!
[0.661262] initcall sysfb_init+0x0/0x96 returned 0 after 202686 usecs
[0.661266] calling  audit_classes_init+0x0/0xaa @ 1
--
[9.744397] input: gspca_zc3xx as 
/devices/pci:00/:00:14.0/usb3/3-3/input/input18
[9.744481] usbcore: registered new interface driver gspca_zc3xx
[9.744484] initcall sd_driver_init+0x0/0x1000 [gspca_zc3xx] returned 0 
after 319 usecs
[9.745108] calling  i915_init+0x0/0xa2 [i915] @ 403
[9.745542] [drm] Memory usable by graphics device = 2048M
[9.745544] checking generic (e000 6bb000) vs hw (e000 1000)
[9.745544] fb: switching to inteldrmfb from simple
[9.745831] calling  alsa_seq_device_init+0x0/0x1000 [snd_seq_device] @ 384
[9.745842] initcall alsa_seq_device_init+0x0/0x1000 [snd_seq_device] 
returned 0 after 9 usecs
[9.746179] calling  hmac_module_init+0x0/0x1000 [hmac] @ 471
[9.746180] initcall hmac_module_init+0x0/0x1000 [hmac] returned 0 after 0 
usecs
--
[9.749840] calling  usb_audio_driver_init+0x0/0x1000 [snd_usb_audio] @ 384
[9.751163] usbcore: registered new interface driver snd-usb-audio
[9.751166] initcall usb_audio_driver_init+0x0/0x1000 [snd_usb_audio] 
returned 0 after 1292 usecs
[9.943166] Console: switching to colour dummy device 80x25
[9.943240] [drm] Replacing VGA console driver
[9.943520] mtrr: type mismatch for e000,1000 old: write-back new: 
write-combining
[9.943526] Failed to add WC MTRR for [e000-efff]; 
performance may suffer.
[9.947147] Adding 31249404k swap on /dev/sdb1.  Priority:-1 extents:1 
across:31249404k FS
[9.949724] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[9.949728] [drm] Driver supports precise vblank timestamp query.
[9.949801] vgaarb: device changed decodes: 
PCI::00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[9.965787] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: 
(null)

$lspci | grep 00:02.0
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen 
Core Processor Integrated Graphics Controller (rev 06)

Looks like it is the graphics card or the graphics driver.

I don't know if this is relevant
$ cat /proc/mtrr
reg00: base=0x0 (0MB), size=16384MB, count=1: write-back
reg01: base=0x4 (16384MB), size=  512MB, count=1: write-back
reg02: base=0x0e000 ( 3584MB), size=  512MB, count=1: uncachable
reg03: base=0x0d000 ( 3328MB), size=  256MB, count=1: uncachable
reg04: base=0x0cf00 ( 3312MB), size=   16MB, count=1: uncachable
reg05: base=0x41f00 (16880MB), size=   16MB, count=1: uncachable
reg06: base=0x41ee0 (16878MB), size=2MB, count=1: uncachable



What does your log show right before and after this? To find out try:

dmesg | grep -5 -i mtrr



See full dmesg attached

$dmesg | grep -5 -i mtrr
[0.189333] initcall arch_kdebugfs_init+0x0/0x1f returned 0 after 0 usecs
[0.189336] calling  pt_init+0x0/0x2a4 @ 1
[0.189349] initcall pt_init+0x0/0x2a4 returned -19 after 0 usecs
[0.189352] calling  bts_init+0x0/0xa4 @ 1
[0.189354] initcall bts_init+0x0/0xa4 

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-23 Thread Vassilis Virvilis

On 11/23/2015 08:56 PM, Luis R. Rodriguez wrote:

Its not clear from the log who called this MTRR call for WC that failed, I
hope we didn't attempt a WC wright on a WB region. Who owns
e000-efff ?


How can I answer that? Is there any utility to run? peek inside /proc?

Here is an idea:
$dmesg | grep -i -5 e000
[0.220941] pci_bus :00: root bus resource [mem 0x000e4000-0x000e7fff 
window]
[0.220944] pci_bus :00: root bus resource [mem 0xdf20-0xfeaf 
window]
[0.220950] pci :00:00.0: [8086:0c00] type 00 class 0x06
[0.221012] pci :00:02.0: [8086:0412] type 00 class 0x03
[0.221021] pci :00:02.0: reg 0x10: [mem 0xf780-0xf7bf 64bit]
[0.221025] pci :00:02.0: reg 0x18: [mem 0xe000-0xefff 64bit 
pref]
[0.221028] pci :00:02.0: reg 0x20: [io  0xf000-0xf03f]
[0.221081] pci :00:03.0: [8086:0c0c] type 00 class 0x040300
[0.221089] pci :00:03.0: reg 0x10: [mem 0xf7c34000-0xf7c37fff 64bit]
[0.221163] pci :00:14.0: [8086:8cb1] type 00 class 0x0c0330
[0.221184] pci :00:14.0: reg 0x10: [mem 0xf7c2-0xf7c2 64bit]
--
[0.453765] calling  ioapic_init_ops+0x0/0xf @ 1
[0.453767] initcall ioapic_init_ops+0x0/0xf returned 0 after 0 usecs
[0.453770] calling  add_pcspkr+0x0/0x3b @ 1
[0.453781] initcall add_pcspkr+0x0/0x3b returned 0 after 8 usecs
[0.453783] calling  sysfb_init+0x0/0x96 @ 1
[0.453811] simple-framebuffer simple-framebuffer.0: framebuffer at 
0xe000, 0x6bb000 bytes, mapped to 0xc9000200
[0.453814] simple-framebuffer simple-framebuffer.0: format=a8r8g8b8, 
mode=1680x1050x32, linelength=6720
[0.557233] Console: switching to colour frame buffer device 210x65
[0.660632] simple-framebuffer simple-framebuffer.0: fb0: simplefb 
registered!
[0.661262] initcall sysfb_init+0x0/0x96 returned 0 after 202686 usecs
[0.661266] calling  audit_classes_init+0x0/0xaa @ 1
--
[9.744397] input: gspca_zc3xx as 
/devices/pci:00/:00:14.0/usb3/3-3/input/input18
[9.744481] usbcore: registered new interface driver gspca_zc3xx
[9.744484] initcall sd_driver_init+0x0/0x1000 [gspca_zc3xx] returned 0 
after 319 usecs
[9.745108] calling  i915_init+0x0/0xa2 [i915] @ 403
[9.745542] [drm] Memory usable by graphics device = 2048M
[9.745544] checking generic (e000 6bb000) vs hw (e000 1000)
[9.745544] fb: switching to inteldrmfb from simple
[9.745831] calling  alsa_seq_device_init+0x0/0x1000 [snd_seq_device] @ 384
[9.745842] initcall alsa_seq_device_init+0x0/0x1000 [snd_seq_device] 
returned 0 after 9 usecs
[9.746179] calling  hmac_module_init+0x0/0x1000 [hmac] @ 471
[9.746180] initcall hmac_module_init+0x0/0x1000 [hmac] returned 0 after 0 
usecs
--
[9.749840] calling  usb_audio_driver_init+0x0/0x1000 [snd_usb_audio] @ 384
[9.751163] usbcore: registered new interface driver snd-usb-audio
[9.751166] initcall usb_audio_driver_init+0x0/0x1000 [snd_usb_audio] 
returned 0 after 1292 usecs
[9.943166] Console: switching to colour dummy device 80x25
[9.943240] [drm] Replacing VGA console driver
[9.943520] mtrr: type mismatch for e000,1000 old: write-back new: 
write-combining
[9.943526] Failed to add WC MTRR for [e000-efff]; 
performance may suffer.
[9.947147] Adding 31249404k swap on /dev/sdb1.  Priority:-1 extents:1 
across:31249404k FS
[9.949724] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[9.949728] [drm] Driver supports precise vblank timestamp query.
[9.949801] vgaarb: device changed decodes: 
PCI::00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[9.965787] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: 
(null)

$lspci | grep 00:02.0
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen 
Core Processor Integrated Graphics Controller (rev 06)

Looks like it is the graphics card or the graphics driver.

I don't know if this is relevant
$ cat /proc/mtrr
reg00: base=0x0 (0MB), size=16384MB, count=1: write-back
reg01: base=0x4 (16384MB), size=  512MB, count=1: write-back
reg02: base=0x0e000 ( 3584MB), size=  512MB, count=1: uncachable
reg03: base=0x0d000 ( 3328MB), size=  256MB, count=1: uncachable
reg04: base=0x0cf00 ( 3312MB), size=   16MB, count=1: uncachable
reg05: base=0x41f00 (16880MB), size=   16MB, count=1: uncachable
reg06: base=0x41ee0 (16878MB), size=2MB, count=1: uncachable



What does your log show right before and after this? To find out try:

dmesg | grep -5 -i mtrr



See full dmesg attached

$dmesg | grep -5 -i mtrr
[0.189333] initcall arch_kdebugfs_init+0x0/0x1f returned 0 after 0 usecs
[0.189336] calling  pt_init+0x0/0x2a4 @ 1
[0.189349] initcall pt_init+0x0/0x2a4 returned -19 after 0 usecs
[0.189352] calling  bts_init+0x0/0xa4 @ 1
[0.189354] initcall bts_init+0x0/0xa4 

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-21 Thread Vassilis Virvilis

On 11/20/2015 02:23 PM, Juergen Gross wrote:

On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:

I've just found a potential issue: In case MTRR is disabled by the BIOS
the PAT register of the boot processor won't be restored after resume.

Can you check whether pr_info("MTRR: Disabled\n") has been executed in
early boot? If yes, this might be a BIOS option.



I don't have access right now. I will test it later tonight (This is my
home machine).

Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
somewere else e.g. /proc /sys etc?


I think grepping for MTRR in dmesg should be enough.


kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the familiar (see 
previously attached image) "Calling lapic..." place.

$dmesg | grep -i mtr for 4.3 kernel with notpat
[0.189113] calling  mtrr_if_init+0x0/0x5f @ 1
[0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
[0.189222] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with a 
huge-page mapping due to MTRR override.
[0.189559] calling  mtrr_init_finialize+0x0/0x3a @ 1
[0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs
[8.994140] mtrr: type mismatch for e000,1000 old: write-back new: 
write-combining
[8.994154] Failed to add WC MTRR for [e000-efff]; 
performance may suffer.

$dmesg | grep -i mtr for 4.3 kernel with default pat enabled
[0.189368] calling  mtrr_if_init+0x0/0x5f @ 1
[0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
[0.189478] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with a 
huge-page mapping due to MTRR override.
[0.189814] calling  mtrr_init_finialize+0x0/0x3a @ 1
[0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs


I also checked my BIOS. I found nothing about mtrr. My BIOS manual is 
ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option about 
MTRR?

Question: If we assume your theory is correct about mtrr/pat, wouldn't 
lockup/hang reboot every time the system goes to hibernate/resume? Can this 
assumption explain why the first hibernation/resume cycles in rapid succession 
after system boot are working and the long ones fail somewhat more consistently?

Note: With PAT enabled the system boots up significantly faster.

In the weekend I will return to 3.18-rc2 and I will try to verify my bisection 
is correct. Double guessing your self is a terrible thing...

I will also try with nopat and I will run dmesg | grep -i mtr and post results

Unless you have any other suggestions...

Vassilis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-21 Thread Vassilis Virvilis

On 11/20/2015 02:23 PM, Juergen Gross wrote:

On 20/11/15 11:04, vas...@iit.demokritos.gr wrote:

I've just found a potential issue: In case MTRR is disabled by the BIOS
the PAT register of the boot processor won't be restored after resume.

Can you check whether pr_info("MTRR: Disabled\n") has been executed in
early boot? If yes, this might be a BIOS option.



I don't have access right now. I will test it later tonight (This is my
home machine).

Would $dmesg | grep -i mtrr suffice or I need to look for the mtrr
somewere else e.g. /proc /sys etc?


I think grepping for MTRR in dmesg should be enough.


kernel 4.3 +nopat also died on the 4th or the 5th hibernate on the familiar (see 
previously attached image) "Calling lapic..." place.

$dmesg | grep -i mtr for 4.3 kernel with notpat
[0.189113] calling  mtrr_if_init+0x0/0x5f @ 1
[0.189116] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
[0.189222] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with a 
huge-page mapping due to MTRR override.
[0.189559] calling  mtrr_init_finialize+0x0/0x3a @ 1
[0.189560] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs
[8.994140] mtrr: type mismatch for e000,1000 old: write-back new: 
write-combining
[8.994154] Failed to add WC MTRR for [e000-efff]; 
performance may suffer.

$dmesg | grep -i mtr for 4.3 kernel with default pat enabled
[0.189368] calling  mtrr_if_init+0x0/0x5f @ 1
[0.189370] initcall mtrr_if_init+0x0/0x5f returned 0 after 0 usecs
[0.189478] pmd_set_huge: Cannot satisfy [mem 0xf800-0xf820] with a 
huge-page mapping due to MTRR override.
[0.189814] calling  mtrr_init_finialize+0x0/0x3a @ 1
[0.189815] initcall mtrr_init_finialize+0x0/0x3a returned 0 after 0 usecs


I also checked my BIOS. I found nothing about mtrr. My BIOS manual is 
ftp://europe.asrock.com/Manual/H97%20Pro4.pdf. Can you see any option about 
MTRR?

Question: If we assume your theory is correct about mtrr/pat, wouldn't 
lockup/hang reboot every time the system goes to hibernate/resume? Can this 
assumption explain why the first hibernation/resume cycles in rapid succession 
after system boot are working and the long ones fail somewhat more consistently?

Note: With PAT enabled the system boots up significantly faster.

In the weekend I will return to 3.18-rc2 and I will try to verify my bisection 
is correct. Double guessing your self is a terrible thing...

I will also try with nopat and I will run dmesg | grep -i mtr and post results

Unless you have any other suggestions...

Vassilis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Vassilis Virvilis

On 11/19/2015 10:35 PM, Vassilis Virvilis wrote:


I compiled and I am running 4.3 right now.



It failed this morning. Last night I did 3 hibernate / resume cycles. In the 
last one I I also turned off the PSU (this seems to push it over the edge - but 
it may be random behavior) and it worked. This morning 7h later failed to 
resume - but it didn't hang on _lapic_resume. This time it rebooted - and I 
seem to recall this behavior for 4.2+ kernels. I forgot to mention it because 
my testing with 4.x kernels were one month before.

So 4.3 kernel - reboots on resume after a long hibernation time.

I am testing with 4.3 and nopat right now.

 Vassilis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Vassilis Virvilis

On 11/19/2015 11:10 AM, Juergen Gross wrote:


So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
4.3 for now.


I think 4.3 is okay.


I will do it later tonight. It will take 2 days at least to report back


I compiled and I am running 4.3 right now.

If it fails I will try with the nopat option.

If it fails I will try 3.18-rc2+nopat to see if that fails.



Do you want me to run something on this like lspci, lsusb


Yes, please post the output of both.



Here they are. See attachments




I would like this to be fixed so I am willing to do the testing.


I appreciate this spirit. :-)



I appreciate the guidance. :-)


Vassilis
Bus 004 Device 002: ID 8087:8001 Intel Corp. 
Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 002: ID 8087:8009 Intel Corp. 
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 046d:089d Logitech, Inc. QuickCam E2500 series
Bus 001 Device 003: ID 045e:0745 Microsoft Corp. Nano Transceiver v1.0 for 
Bluetooth
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 004 Device 002: ID 8087:8001 Intel Corp. 
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 1 Single TT
  bMaxPacketSize064
  idVendor   0x8087 Intel Corp.
  idProduct  0x8001 
  bcdDevice0.00
  iManufacturer   0 
  iProduct0 
  iSerial 0 
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81  EP 1 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0002  1x 2 bytes
bInterval  12
Hub Descriptor:
  bLength  11
  bDescriptorType  41
  nNbrPorts 8
  wHubCharacteristic 0x0009
Per-port power switching
Per-port overcurrent protection
TT think time 8 FS bits
  bPwrOn2PwrGood0 * 2 milli seconds
  bHubContrCurrent  0 milli Ampere
  DeviceRemovable0x00 0x00
  PortPwrCtrlMask0xff 0xff
 Hub Port Status:
   Port 1: .0100 power
   Port 2: .0100 power
   Port 3: .0100 power
   Port 4: .0100 power
   Port 5: .0100 power
   Port 6: .0100 power
   Port 7: .0100 power
   Port 8: .0100 power
Device Qualifier (for other device speed):
  bLength10
  bDescriptorType 6
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 0 Full speed (or root) hub
  bMaxPacketSize064
  bNumConfigurations  1
Device Status: 0x0001
  Self Powered

Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 0 Full speed (or root) hub
  bMaxPacketSize064
  idVendor   0x1d6b Linux Foundation
  idProduct  0x0002 2.0 root hub
  bcdDevice4.03
  iManufacturer   3 Linux 4.3.0+ ehci_hcd
  iProduct2 EHCI Host Controller
  iSerial 1 :00:1d.0
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
   

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Vassilis Virvilis

On 11/19/2015 11:10 AM, Juergen Gross wrote:


So Do you want me to test 4.3 or 4.4-pre/rc*/latest linus tree. I assume
4.3 for now.


I think 4.3 is okay.


I will do it later tonight. It will take 2 days at least to report back


I compiled and I am running 4.3 right now.

If it fails I will try with the nopat option.

If it fails I will try 3.18-rc2+nopat to see if that fails.



Do you want me to run something on this like lspci, lsusb


Yes, please post the output of both.



Here they are. See attachments




I would like this to be fixed so I am willing to do the testing.


I appreciate this spirit. :-)



I appreciate the guidance. :-)


Vassilis
Bus 004 Device 002: ID 8087:8001 Intel Corp. 
Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 002: ID 8087:8009 Intel Corp. 
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 046d:089d Logitech, Inc. QuickCam E2500 series
Bus 001 Device 003: ID 045e:0745 Microsoft Corp. Nano Transceiver v1.0 for 
Bluetooth
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 004 Device 002: ID 8087:8001 Intel Corp. 
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 1 Single TT
  bMaxPacketSize064
  idVendor   0x8087 Intel Corp.
  idProduct  0x8001 
  bcdDevice0.00
  iManufacturer   0 
  iProduct0 
  iSerial 0 
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81  EP 1 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0002  1x 2 bytes
bInterval  12
Hub Descriptor:
  bLength  11
  bDescriptorType  41
  nNbrPorts 8
  wHubCharacteristic 0x0009
Per-port power switching
Per-port overcurrent protection
TT think time 8 FS bits
  bPwrOn2PwrGood0 * 2 milli seconds
  bHubContrCurrent  0 milli Ampere
  DeviceRemovable0x00 0x00
  PortPwrCtrlMask0xff 0xff
 Hub Port Status:
   Port 1: .0100 power
   Port 2: .0100 power
   Port 3: .0100 power
   Port 4: .0100 power
   Port 5: .0100 power
   Port 6: .0100 power
   Port 7: .0100 power
   Port 8: .0100 power
Device Qualifier (for other device speed):
  bLength10
  bDescriptorType 6
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 0 Full speed (or root) hub
  bMaxPacketSize064
  bNumConfigurations  1
Device Status: 0x0001
  Self Powered

Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass9 Hub
  bDeviceSubClass 0 Unused
  bDeviceProtocol 0 Full speed (or root) hub
  bMaxPacketSize064
  idVendor   0x1d6b Linux Foundation
  idProduct  0x0002 2.0 root hub
  bcdDevice4.03
  iManufacturer   3 Linux 4.3.0+ ehci_hcd
  iProduct2 EHCI Host Controller
  iSerial 1 :00:1d.0
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   25
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower0mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 9 Hub
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0 Full speed (or root) hub
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
   

Re: Hibernate resume bug around 3,18-rc2 - Full PAT support

2015-11-19 Thread Vassilis Virvilis

On 11/19/2015 10:35 PM, Vassilis Virvilis wrote:


I compiled and I am running 4.3 right now.



It failed this morning. Last night I did 3 hibernate / resume cycles. In the 
last one I I also turned off the PSU (this seems to push it over the edge - but 
it may be random behavior) and it worked. This morning 7h later failed to 
resume - but it didn't hang on _lapic_resume. This time it rebooted - and I 
seem to recall this behavior for 4.2+ kernels. I forgot to mention it because 
my testing with 4.x kernels were one month before.

So 4.3 kernel - reboots on resume after a long hibernation time.

I am testing with 4.3 and nopat right now.

 Vassilis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Debugging COW (copy on write) memory after fork: Is it possible to dump only the private anonymous memory of a process?

2013-04-08 Thread Vassilis Virvilis

On 04/06/2013 09:11 PM, Bruno Prémont wrote:

On Fri, 05 April 2013 Vassilis Virvilis  wrote:


Question


Is it possible to dump only the private anonymous memory of a process?


I don't know if that's possible, but from your background you could
probably work around it be mmap()ing the memory you need and once
initialized mark all of that memory read-only (if you mmap very large
chunks you can even benefit from huge-pages).

Any of the forked processes that tried to access the memory would then
get a signal if they ever tried to write to the data (and thus unsharing it)



I can't do that. We are talking about an existing system (in perl with C 
modules) that has been parallelized in a second step.



If you allocate and initialize all of your memory in little malloc()'ed
chunks it's possibly glibc's memory housekeeping that unshares all those
pages over time.


Yes I suppose it is a series of mallocs. I could easily verify that with 
strace. However if glibc's memory housekeeping undermines the COW 
behaviour that would be very bad.


I have unit tests that I was able to work around the usual perl problems 
that cause memory unsharing such as the reference counting and hash 
accessing. Garbage collector shouldn't be a problem because there is 
nothing to collect from the shared memory, only private local variables 
that go out of scope. The problem is when I am employing these 
workarounds in the live system (with considerable IO) I am getting 
massive unsharing. So I thought to have a look and see what's going in 
two or three consecutive private memory dumps.


The point is I need to locate the source of the memory unsharing. Any 
ideas how this can be done?


At this point I could try in house compiled kernels if I can enable some 
logging to track this behavior. Does any knob like this exist? Even as 
an #ifdef?


Vassilis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Debugging COW (copy on write) memory after fork: Is it possible to dump only the private anonymous memory of a process?

2013-04-08 Thread Vassilis Virvilis

On 04/06/2013 09:11 PM, Bruno Prémont wrote:

On Fri, 05 April 2013 Vassilis Virvilisv.virvi...@biovista.com  wrote:


Question


Is it possible to dump only the private anonymous memory of a process?


I don't know if that's possible, but from your background you could
probably work around it be mmap()ing the memory you need and once
initialized mark all of that memory read-only (if you mmap very large
chunks you can even benefit from huge-pages).

Any of the forked processes that tried to access the memory would then
get a signal if they ever tried to write to the data (and thus unsharing it)



I can't do that. We are talking about an existing system (in perl with C 
modules) that has been parallelized in a second step.



If you allocate and initialize all of your memory in little malloc()'ed
chunks it's possibly glibc's memory housekeeping that unshares all those
pages over time.


Yes I suppose it is a series of mallocs. I could easily verify that with 
strace. However if glibc's memory housekeeping undermines the COW 
behaviour that would be very bad.


I have unit tests that I was able to work around the usual perl problems 
that cause memory unsharing such as the reference counting and hash 
accessing. Garbage collector shouldn't be a problem because there is 
nothing to collect from the shared memory, only private local variables 
that go out of scope. The problem is when I am employing these 
workarounds in the live system (with considerable IO) I am getting 
massive unsharing. So I thought to have a look and see what's going in 
two or three consecutive private memory dumps.


The point is I need to locate the source of the memory unsharing. Any 
ideas how this can be done?


At this point I could try in house compiled kernels if I can enable some 
logging to track this behavior. Does any knob like this exist? Even as 
an #ifdef?


Vassilis
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Debugging COW (copy on write) memory after fork: Is it possible to dump only the private anonymous memory of a process?

2013-04-05 Thread Vassilis Virvilis
Hello, sorry if this is off topic. Just point me to the right direction. 
Please cc me also in the reply.


Question


Is it possible to dump only the private anonymous memory of a process?

Background
--

I have a process where it reads and it initializes a large portion of 
the memory (around 2.3GB). This memory is effectively read only from 
that point and on. After the initialization I fork the process to 
several children in order to take advantage of the multicore 
architecture of modern cpus. The problem is that finally the program 
ends up requiring number_of_process * 2.3GB memory effectively entering 
swap thrashing and destroying the performance.


Steps so far


The first thing I did is to monitor the memory. I found about 
/proc/$pid/smaps and the http://wingolog.org/pub/mem_usage.py.


What happens is the following

The program starts reads from disk and has 2.3GB of private mappings
The program forks. Immediately the 2.3GB become shared mapping 
between the parent and the child. Excellent so far.
As the time goes and the child starts performing its tasks the 
shared memory is slowly migrating to the private mappings of each 
process effectively blowing up the memory requirements.


I thought that if I could see (dump) the private mappings of each 
process I could see from the data why the shared mappings are being 
touched so I tried to dump the core with gcore and by playing with 
/proc/$pid/coredump_filter like this


echo 0x1 > /proc/$pid/coredump_filter
gcore $pid

Unfortunately it always dumps 2.3GB despite the setting in 
/proc/$pid/coredump_filter which says private anonymous mappings.


I have researched the question in google.

I even posted it in stack overflow.

Any other ideas?

Thanks in advance

    Vassilis Virvilis

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Debugging COW (copy on write) memory after fork: Is it possible to dump only the private anonymous memory of a process?

2013-04-05 Thread Vassilis Virvilis
Hello, sorry if this is off topic. Just point me to the right direction. 
Please cc me also in the reply.


Question


Is it possible to dump only the private anonymous memory of a process?

Background
--

I have a process where it reads and it initializes a large portion of 
the memory (around 2.3GB). This memory is effectively read only from 
that point and on. After the initialization I fork the process to 
several children in order to take advantage of the multicore 
architecture of modern cpus. The problem is that finally the program 
ends up requiring number_of_process * 2.3GB memory effectively entering 
swap thrashing and destroying the performance.


Steps so far


The first thing I did is to monitor the memory. I found about 
/proc/$pid/smaps and the http://wingolog.org/pub/mem_usage.py.


What happens is the following

The program starts reads from disk and has 2.3GB of private mappings
The program forks. Immediately the 2.3GB become shared mapping 
between the parent and the child. Excellent so far.
As the time goes and the child starts performing its tasks the 
shared memory is slowly migrating to the private mappings of each 
process effectively blowing up the memory requirements.


I thought that if I could see (dump) the private mappings of each 
process I could see from the data why the shared mappings are being 
touched so I tried to dump the core with gcore and by playing with 
/proc/$pid/coredump_filter like this


echo 0x1  /proc/$pid/coredump_filter
gcore $pid

Unfortunately it always dumps 2.3GB despite the setting in 
/proc/$pid/coredump_filter which says private anonymous mappings.


I have researched the question in google.

I even posted it in stack overflow.

Any other ideas?

Thanks in advance

Vassilis Virvilis

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/