Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)

2016-06-27 Thread Borislav Petkov
On Mon, Jun 27, 2016 at 04:24:22PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
> image restoration
> 
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
> 
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables.  Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
> 
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
> 
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch.  Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
> 
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too.  That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables.  Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
> 
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
> too to reflect the image header format change.
> 
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
> 
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
> 
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm=146372852823760=2
> Reported-by: Logan Gunthorpe 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  arch/x86/power/hibernate_64.c |   90 
> --
>  arch/x86/power/hibernate_asm_64.S |   55 ++-
>  2 files changed, 102 insertions(+), 43 deletions(-)

Reported-and-tested-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)

2016-06-27 Thread Borislav Petkov
On Mon, Jun 27, 2016 at 04:24:22PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
> image restoration
> 
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
> 
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables.  Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
> 
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
> 
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch.  Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
> 
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too.  That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables.  Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
> 
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
> too to reflect the image header format change.
> 
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
> 
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
> 
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm=146372852823760=2
> Reported-by: Logan Gunthorpe 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  arch/x86/power/hibernate_64.c |   90 
> --
>  arch/x86/power/hibernate_asm_64.S |   55 ++-
>  2 files changed, 102 insertions(+), 43 deletions(-)

Reported-and-tested-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


[PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)

2016-06-27 Thread Rafael J. Wysocki
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe  wrote:
> > Hey Rafael,
> >
> > This patch appears to be working on my laptop. Thanks.
> 
> Same for me: resume still works with KASLR in my tests too.

Unfortunately, Boris still sees post-resume memory corruption with the patch
you have tested, but that turns out to be a result of some super-weird
corruption of a pointer on the stack which leads to a store at a wrong address
(and there's no way it can be related to the rest of the patch).

We have verified that it can be avoided by rearranging 
set_up_temporary_text_mapping()
to use fewer local variables and the appended patch contains this modification.

I also went on and changed relocate_restore_code() slightly in a similar 
fashion,
but all of those changes should not affect the behavior (unless there's 
something
insane going on behind the curtains, that is).

Kees, Logan, Boris, please try this one and let me know if it works for you.

Thanks,
Rafael


---
From: Rafael J. Wysocki 
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm=146372852823760=2
Reported-by: Logan Gunthorpe 
Signed-off-by: Rafael J. Wysocki 
---
 arch/x86/power/hibernate_64.c |   90 --
 arch/x86/power/hibernate_asm_64.S |   55 ++-
 2 files changed, 102 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
  * kernel's text (this value is passed in the image header).
  */
 unsigned long restore_jump_address __visible;
+unsigned long 

[PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)

2016-06-27 Thread Rafael J. Wysocki
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe  wrote:
> > Hey Rafael,
> >
> > This patch appears to be working on my laptop. Thanks.
> 
> Same for me: resume still works with KASLR in my tests too.

Unfortunately, Boris still sees post-resume memory corruption with the patch
you have tested, but that turns out to be a result of some super-weird
corruption of a pointer on the stack which leads to a store at a wrong address
(and there's no way it can be related to the rest of the patch).

We have verified that it can be avoided by rearranging 
set_up_temporary_text_mapping()
to use fewer local variables and the appended patch contains this modification.

I also went on and changed relocate_restore_code() slightly in a similar 
fashion,
but all of those changes should not affect the behavior (unless there's 
something
insane going on behind the curtains, that is).

Kees, Logan, Boris, please try this one and let me know if it works for you.

Thanks,
Rafael


---
From: Rafael J. Wysocki 
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm=146372852823760=2
Reported-by: Logan Gunthorpe 
Signed-off-by: Rafael J. Wysocki 
---
 arch/x86/power/hibernate_64.c |   90 --
 arch/x86/power/hibernate_asm_64.S |   55 ++-
 2 files changed, 102 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
  * kernel's text (this value is passed in the image header).
  */
 unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;
 
 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ 

Re: ktime_get_ts64() splat during resume

2016-06-21 Thread Rafael J. Wysocki
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe  wrote:
> > Hey Rafael,
> >
> > This patch appears to be working on my laptop. Thanks.
> 
> Same for me: resume still works with KASLR in my tests too.

Thanks for the confirmation!

Rafael



Re: ktime_get_ts64() splat during resume

2016-06-21 Thread Rafael J. Wysocki
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe  wrote:
> > Hey Rafael,
> >
> > This patch appears to be working on my laptop. Thanks.
> 
> Same for me: resume still works with KASLR in my tests too.

Thanks for the confirmation!

Rafael



Re: ktime_get_ts64() splat during resume

2016-06-21 Thread Kees Cook
On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe  wrote:
> Hey Rafael,
>
> This patch appears to be working on my laptop. Thanks.

Same for me: resume still works with KASLR in my tests too.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: ktime_get_ts64() splat during resume

2016-06-21 Thread Kees Cook
On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe  wrote:
> Hey Rafael,
>
> This patch appears to be working on my laptop. Thanks.

Same for me: resume still works with KASLR in my tests too.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: ktime_get_ts64() splat during resume

2016-06-21 Thread Rafael J. Wysocki
On Monday, June 20, 2016 10:35:16 PM Logan Gunthorpe wrote:
> Hey Rafael,

Hi,

> This patch appears to be working on my laptop. Thanks.

Thanks for the confirmation!

Rafael



Re: ktime_get_ts64() splat during resume

2016-06-21 Thread Rafael J. Wysocki
On Monday, June 20, 2016 10:35:16 PM Logan Gunthorpe wrote:
> Hey Rafael,

Hi,

> This patch appears to be working on my laptop. Thanks.

Thanks for the confirmation!

Rafael



Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Logan Gunthorpe

Hey Rafael,

This patch appears to be working on my laptop. Thanks.

Logan

On 20/06/16 07:22 PM, Rafael J. Wysocki wrote:

On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:

On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:

On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
 wrote:

On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  wrote:


Overall, we seem to be heading towards the "really weird" territory here.


So the whole commit that Boris bisected down to is weird to me.

Why isn't the temporary text mapping just set up unconditionally in
the temp_level4_pgt?

Why does it have that insane "let's leave the temp_level4_pgt alone
until we actually switch to it, and save away restore_pgd_addr and the
restore_pgd, to then be set up at restore time"?

All the other temporary mappings are set up statically in the
temp_level4_pgt, why not that one?


The text mapping in temp_level4_pgt has to map the image kernel's
physical entry address to the same virtual address that the image
kernel had for it, because the image kernel will switch over to its
own page tables first and it will use its own kernel text mapping from
that point on.  That may not be the same as the text mapping of the
(currently running) restore (or "boot") kernel.

So if we set up the text mapping in temp_level4_pgt upfront, we
basically can't reference the original kernel text (or do any
addressing relative to it) any more after switching over to
temp_level4_pgt.

For some reason I thought that was not doable, but now that I look at
the code it looks like it can be done.  I'll try doing that.


I recalled what the problem was. :-)

In principle, the kernel text mapping in the image kernel may be different
from the kernel text mapping in the restore ("boot") kernel, but the patch
I posted a couple of hours ago actually assumed them to be the same, because
it switched to temp_level4_pgt before jumping to the relocated code.

To get rid of that implicit assumption it has to switch to temp_level4_pgt
from the relocated code itself, but for that to work, the page containing
the relocated code must be executable in the original page tables (it isn't
usually).

So updated patch is appended.

Again, it works for me, but I'm wondering about everybody else.

Thanks,
Rafael

---
From: Rafael J. Wysocki 
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set 

Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Logan Gunthorpe

Hey Rafael,

This patch appears to be working on my laptop. Thanks.

Logan

On 20/06/16 07:22 PM, Rafael J. Wysocki wrote:

On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:

On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:

On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
 wrote:

On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  wrote:


Overall, we seem to be heading towards the "really weird" territory here.


So the whole commit that Boris bisected down to is weird to me.

Why isn't the temporary text mapping just set up unconditionally in
the temp_level4_pgt?

Why does it have that insane "let's leave the temp_level4_pgt alone
until we actually switch to it, and save away restore_pgd_addr and the
restore_pgd, to then be set up at restore time"?

All the other temporary mappings are set up statically in the
temp_level4_pgt, why not that one?


The text mapping in temp_level4_pgt has to map the image kernel's
physical entry address to the same virtual address that the image
kernel had for it, because the image kernel will switch over to its
own page tables first and it will use its own kernel text mapping from
that point on.  That may not be the same as the text mapping of the
(currently running) restore (or "boot") kernel.

So if we set up the text mapping in temp_level4_pgt upfront, we
basically can't reference the original kernel text (or do any
addressing relative to it) any more after switching over to
temp_level4_pgt.

For some reason I thought that was not doable, but now that I look at
the code it looks like it can be done.  I'll try doing that.


I recalled what the problem was. :-)

In principle, the kernel text mapping in the image kernel may be different
from the kernel text mapping in the restore ("boot") kernel, but the patch
I posted a couple of hours ago actually assumed them to be the same, because
it switched to temp_level4_pgt before jumping to the relocated code.

To get rid of that implicit assumption it has to switch to temp_level4_pgt
from the relocated code itself, but for that to work, the page containing
the relocated code must be executable in the original page tables (it isn't
usually).

So updated patch is appended.

Again, it works for me, but I'm wondering about everybody else.

Thanks,
Rafael

---
From: Rafael J. Wysocki 
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten 

Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:
> On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
> > On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
> >  wrote:
> > > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  
> > > wrote:
> > >>
> > >> Overall, we seem to be heading towards the "really weird" territory here.
> > >
> > > So the whole commit that Boris bisected down to is weird to me.
> > >
> > > Why isn't the temporary text mapping just set up unconditionally in
> > > the temp_level4_pgt?
> > >
> > > Why does it have that insane "let's leave the temp_level4_pgt alone
> > > until we actually switch to it, and save away restore_pgd_addr and the
> > > restore_pgd, to then be set up at restore time"?
> > >
> > > All the other temporary mappings are set up statically in the
> > > temp_level4_pgt, why not that one?
> > 
> > The text mapping in temp_level4_pgt has to map the image kernel's
> > physical entry address to the same virtual address that the image
> > kernel had for it, because the image kernel will switch over to its
> > own page tables first and it will use its own kernel text mapping from
> > that point on.  That may not be the same as the text mapping of the
> > (currently running) restore (or "boot") kernel.
> > 
> > So if we set up the text mapping in temp_level4_pgt upfront, we
> > basically can't reference the original kernel text (or do any
> > addressing relative to it) any more after switching over to
> > temp_level4_pgt.
> > 
> > For some reason I thought that was not doable, but now that I look at
> > the code it looks like it can be done.  I'll try doing that.

I recalled what the problem was. :-)

In principle, the kernel text mapping in the image kernel may be different
from the kernel text mapping in the restore ("boot") kernel, but the patch
I posted a couple of hours ago actually assumed them to be the same, because
it switched to temp_level4_pgt before jumping to the relocated code.

To get rid of that implicit assumption it has to switch to temp_level4_pgt
from the relocated code itself, but for that to work, the page containing
the relocated code must be executable in the original page tables (it isn't
usually).

So updated patch is appended.

Again, it works for me, but I'm wondering about everybody else.

Thanks,
Rafael

---
From: Rafael J. Wysocki 
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point 

Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:
> On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
> > On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
> >  wrote:
> > > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  
> > > wrote:
> > >>
> > >> Overall, we seem to be heading towards the "really weird" territory here.
> > >
> > > So the whole commit that Boris bisected down to is weird to me.
> > >
> > > Why isn't the temporary text mapping just set up unconditionally in
> > > the temp_level4_pgt?
> > >
> > > Why does it have that insane "let's leave the temp_level4_pgt alone
> > > until we actually switch to it, and save away restore_pgd_addr and the
> > > restore_pgd, to then be set up at restore time"?
> > >
> > > All the other temporary mappings are set up statically in the
> > > temp_level4_pgt, why not that one?
> > 
> > The text mapping in temp_level4_pgt has to map the image kernel's
> > physical entry address to the same virtual address that the image
> > kernel had for it, because the image kernel will switch over to its
> > own page tables first and it will use its own kernel text mapping from
> > that point on.  That may not be the same as the text mapping of the
> > (currently running) restore (or "boot") kernel.
> > 
> > So if we set up the text mapping in temp_level4_pgt upfront, we
> > basically can't reference the original kernel text (or do any
> > addressing relative to it) any more after switching over to
> > temp_level4_pgt.
> > 
> > For some reason I thought that was not doable, but now that I look at
> > the code it looks like it can be done.  I'll try doing that.

I recalled what the problem was. :-)

In principle, the kernel text mapping in the image kernel may be different
from the kernel text mapping in the restore ("boot") kernel, but the patch
I posted a couple of hours ago actually assumed them to be the same, because
it switched to temp_level4_pgt before jumping to the relocated code.

To get rid of that implicit assumption it has to switch to temp_level4_pgt
from the relocated code itself, but for that to work, the page containing
the relocated code must be executable in the original page tables (it isn't
usually).

So updated patch is appended.

Again, it works for me, but I'm wondering about everybody else.

Thanks,
Rafael

---
From: Rafael J. Wysocki 
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during 
image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory 

Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
> On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
>  wrote:
> > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  
> > wrote:
> >>
> >> Overall, we seem to be heading towards the "really weird" territory here.
> >
> > So the whole commit that Boris bisected down to is weird to me.
> >
> > Why isn't the temporary text mapping just set up unconditionally in
> > the temp_level4_pgt?
> >
> > Why does it have that insane "let's leave the temp_level4_pgt alone
> > until we actually switch to it, and save away restore_pgd_addr and the
> > restore_pgd, to then be set up at restore time"?
> >
> > All the other temporary mappings are set up statically in the
> > temp_level4_pgt, why not that one?
> 
> The text mapping in temp_level4_pgt has to map the image kernel's
> physical entry address to the same virtual address that the image
> kernel had for it, because the image kernel will switch over to its
> own page tables first and it will use its own kernel text mapping from
> that point on.  That may not be the same as the text mapping of the
> (currently running) restore (or "boot") kernel.
> 
> So if we set up the text mapping in temp_level4_pgt upfront, we
> basically can't reference the original kernel text (or do any
> addressing relative to it) any more after switching over to
> temp_level4_pgt.
> 
> For some reason I thought that was not doable, but now that I look at
> the code it looks like it can be done.  I'll try doing that.
> 
> > I suspect whatever corruption happens boils down to the same issue
> > that made people do that odd decision in the first place.
> 
> The breakage appears to happen regardless of these changes, though.
> 
> > And regardless, those games are too ugly to live. So I would suggest
> > that that original commit should just be considered broken, and
> > reverted (or just removed - I'm not sure if it's in a stable branch or
> > not) and the fix be rethought so that the code mapping can be done
> > once and for all and *without* the extra games.
> 
> OK

Below is a new simplified version of the offending patch (without the
restore_pgd_addr and restore_pgd things).  It works for me. :-)

Logan, Kees, can you please check if this one works for you too?

Thanks,
Rafael

---
From: Rafael J. Wysocki 
Subject: [PATCH] x86/power/64: Fix kernel text mapping corruption during image 
restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a 

Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
> On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
>  wrote:
> > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  
> > wrote:
> >>
> >> Overall, we seem to be heading towards the "really weird" territory here.
> >
> > So the whole commit that Boris bisected down to is weird to me.
> >
> > Why isn't the temporary text mapping just set up unconditionally in
> > the temp_level4_pgt?
> >
> > Why does it have that insane "let's leave the temp_level4_pgt alone
> > until we actually switch to it, and save away restore_pgd_addr and the
> > restore_pgd, to then be set up at restore time"?
> >
> > All the other temporary mappings are set up statically in the
> > temp_level4_pgt, why not that one?
> 
> The text mapping in temp_level4_pgt has to map the image kernel's
> physical entry address to the same virtual address that the image
> kernel had for it, because the image kernel will switch over to its
> own page tables first and it will use its own kernel text mapping from
> that point on.  That may not be the same as the text mapping of the
> (currently running) restore (or "boot") kernel.
> 
> So if we set up the text mapping in temp_level4_pgt upfront, we
> basically can't reference the original kernel text (or do any
> addressing relative to it) any more after switching over to
> temp_level4_pgt.
> 
> For some reason I thought that was not doable, but now that I look at
> the code it looks like it can be done.  I'll try doing that.
> 
> > I suspect whatever corruption happens boils down to the same issue
> > that made people do that odd decision in the first place.
> 
> The breakage appears to happen regardless of these changes, though.
> 
> > And regardless, those games are too ugly to live. So I would suggest
> > that that original commit should just be considered broken, and
> > reverted (or just removed - I'm not sure if it's in a stable branch or
> > not) and the fix be rethought so that the code mapping can be done
> > once and for all and *without* the extra games.
> 
> OK

Below is a new simplified version of the offending patch (without the
restore_pgd_addr and restore_pgd things).  It works for me. :-)

Logan, Kees, can you please check if this one works for you too?

Thanks,
Rafael

---
From: Rafael J. Wysocki 
Subject: [PATCH] x86/power/64: Fix kernel text mapping corruption during image 
restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the 

Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
 wrote:
> On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  wrote:
>>
>> Overall, we seem to be heading towards the "really weird" territory here.
>
> So the whole commit that Boris bisected down to is weird to me.
>
> Why isn't the temporary text mapping just set up unconditionally in
> the temp_level4_pgt?
>
> Why does it have that insane "let's leave the temp_level4_pgt alone
> until we actually switch to it, and save away restore_pgd_addr and the
> restore_pgd, to then be set up at restore time"?
>
> All the other temporary mappings are set up statically in the
> temp_level4_pgt, why not that one?

The text mapping in temp_level4_pgt has to map the image kernel's
physical entry address to the same virtual address that the image
kernel had for it, because the image kernel will switch over to its
own page tables first and it will use its own kernel text mapping from
that point on.  That may not be the same as the text mapping of the
(currently running) restore (or "boot") kernel.

So if we set up the text mapping in temp_level4_pgt upfront, we
basically can't reference the original kernel text (or do any
addressing relative to it) any more after switching over to
temp_level4_pgt.

For some reason I thought that was not doable, but now that I look at
the code it looks like it can be done.  I'll try doing that.

> I suspect whatever corruption happens boils down to the same issue
> that made people do that odd decision in the first place.

The breakage appears to happen regardless of these changes, though.

> And regardless, those games are too ugly to live. So I would suggest
> that that original commit should just be considered broken, and
> reverted (or just removed - I'm not sure if it's in a stable branch or
> not) and the fix be rethought so that the code mapping can be done
> once and for all and *without* the extra games.

OK

Thanks,
Rafael


Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
 wrote:
> On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  wrote:
>>
>> Overall, we seem to be heading towards the "really weird" territory here.
>
> So the whole commit that Boris bisected down to is weird to me.
>
> Why isn't the temporary text mapping just set up unconditionally in
> the temp_level4_pgt?
>
> Why does it have that insane "let's leave the temp_level4_pgt alone
> until we actually switch to it, and save away restore_pgd_addr and the
> restore_pgd, to then be set up at restore time"?
>
> All the other temporary mappings are set up statically in the
> temp_level4_pgt, why not that one?

The text mapping in temp_level4_pgt has to map the image kernel's
physical entry address to the same virtual address that the image
kernel had for it, because the image kernel will switch over to its
own page tables first and it will use its own kernel text mapping from
that point on.  That may not be the same as the text mapping of the
(currently running) restore (or "boot") kernel.

So if we set up the text mapping in temp_level4_pgt upfront, we
basically can't reference the original kernel text (or do any
addressing relative to it) any more after switching over to
temp_level4_pgt.

For some reason I thought that was not doable, but now that I look at
the code it looks like it can be done.  I'll try doing that.

> I suspect whatever corruption happens boils down to the same issue
> that made people do that odd decision in the first place.

The breakage appears to happen regardless of these changes, though.

> And regardless, those games are too ugly to live. So I would suggest
> that that original commit should just be considered broken, and
> reverted (or just removed - I'm not sure if it's in a stable branch or
> not) and the fix be rethought so that the code mapping can be done
> once and for all and *without* the extra games.

OK

Thanks,
Rafael


Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Linus Torvalds
On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  wrote:
>
> Overall, we seem to be heading towards the "really weird" territory here.

So the whole commit that Boris bisected down to is weird to me.

Why isn't the temporary text mapping just set up unconditionally in
the temp_level4_pgt?

Why does it have that insane "let's leave the temp_level4_pgt alone
until we actually switch to it, and save away restore_pgd_addr and the
restore_pgd, to then be set up at restore time"?

All the other temporary mappings are set up statically in the
temp_level4_pgt, why not that one?

I suspect whatever corruption happens boils down to the same issue
that made people do that odd decision in the first place.

And regardless, those games are too ugly to live. So I would suggest
that that original commit should just be considered broken, and
reverted (or just removed - I'm not sure if it's in a stable branch or
not) and the fix be rethought so that the code mapping can be done
once and for all and *without* the extra games.

 Linus


Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Linus Torvalds
On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki  wrote:
>
> Overall, we seem to be heading towards the "really weird" territory here.

So the whole commit that Boris bisected down to is weird to me.

Why isn't the temporary text mapping just set up unconditionally in
the temp_level4_pgt?

Why does it have that insane "let's leave the temp_level4_pgt alone
until we actually switch to it, and save away restore_pgd_addr and the
restore_pgd, to then be set up at restore time"?

All the other temporary mappings are set up statically in the
temp_level4_pgt, why not that one?

I suspect whatever corruption happens boils down to the same issue
that made people do that odd decision in the first place.

And regardless, those games are too ugly to live. So I would suggest
that that original commit should just be considered broken, and
reverted (or just removed - I'm not sure if it's in a stable branch or
not) and the fix be rethought so that the code mapping can be done
once and for all and *without* the extra games.

 Linus


Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Friday, June 17, 2016 11:03:46 PM Rafael J. Wysocki wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov  wrote:
> > On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
> >> A couple of questions:
> >> - I guess this is reproducible 100% of the time?
> >
> > Yap.
> >
> > I took latest Linus + tip/master which has your commit.
> >
> >> - If you do "echo disk > /sys/power/state" instead of using s2disk,
> >> does it still crash in the same way?
> >
> > My suspend to disk script does:
> >
> > echo 3 > /proc/sys/vm/drop_caches
> > echo "shutdown" > /sys/power/disk
> > echo "disk" > /sys/power/state
> >
> > I don't use anything else for years now.
> >
> >> - Are both the image and boot kernels the same binary?
> >
> > Yep.
> 
> OK, we need to find out what's wrong, then.

Boris is traveling this week, so the investigation will continue when he's
back, but we spent quite some time on this during the weekend, so below is
a summary of what we found plus some comments.

Overall, we seem to be heading towards the "really weird" territory here.

> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.

First of all, reverting the changes in hibernate_asm_64.S doesn't make the
breakage go away, which means that the most essential changes in
"x86/power/64: Fix crash whan the hibernation code passes control to the image
kernel" don't make a difference.

Moreover, after some back-and-forth we narrowed down the memory corruption to
a single line in the (new) prepare_temporary_text_mapping() function and it
turned out that it is triggered by scribbling on a page returned by
get_safe_page(), immediately after obtaining it.  Memory is corrupted no matter
how the page is written to.  In particular, filling that page with all ones
triggers memory corruption later, for example.

So appended is the last tested patch sufficient to trigger the breakage on the
Boris' system (modulo some comments).  Quite evidently, the breakage is
triggered by the memset() line in prepare_temporary_text_mapping().

In addition to the above, there are some interesting observations from the
investigation so far.

First, what is corrupted is the image kernel memory.  This means that the line
in question scribbles on image data (which are stored in memory at that point)
somewhere.  Nevertheless, the image kernel is evidently able to continue after
it has received control and carry out the full device resume up to the point
in which user space is thawed and then it crashes as a result of page tables
corruption.  How those page tables are corrupted depends on what is written to
the page pointed to by pmd in prepare_temporary_text_mapping() (but, of course,
that write happens in the "boot" or "restore" kernel, before jumping to the
image one).

The above is what we know and now conclusions that it seems to lead to.

Generally, I see two possible scenarios here.

One is that the changes not directly related to prepare_temporary_text_mapping()
in the patch below, eg. the changes in arch_hibernation_header_save/restore()
(and related), make a difference, but that doesn't look likely to me.  In any
case, that will be the first thing to try next, as it is relatively easy.

The second scenario is fundamentally less attractive, because it means that the
address we end up with in pmd in prepare_temporary_text_mapping() is bogus.

My first reaction to that option would be "Well, what if get_safe_page() is
broken somehow?", but that's not so simple.  Namely, for the observed corruption
to happen, get_safe_page() would need to return a page that (1) had already
been allocated once before and (2) such that image data had been written to
it already.  Still, the hibernation core only writes image data to pages that
have been explicitly allocated by it and it never frees them individually
(in particular, they are never freed before jumping to the image kernel at
all).  It sets two bits in two different bitmaps for every page allocated by it
and checks those two bits every time before writing image data to any page.
One may think "Well, maybe the bitmaps don't work correctly then and the bits
are set when they shouldn't be sometimes?", but the problem with that line of
reasoning is that get_safe_page() checks one of those very bits before returning
a page and the page is only returned if that bit is clear.  Thus, even if the
bits were set when they shouldn't be sometimes, get_safe_page() would still
not return such a page to the caller.

The next question to ask might be "What if, horror shock, get_zeroed_page()
called by get_safe_page() returns a page that has been allocated already?", but
then the bitmap check mentioned above would still prevent get_safe_page() from
returning that page (the bit in question would be set for it).

Hence, the only way get_safe_page() might return a 

Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Friday, June 17, 2016 11:03:46 PM Rafael J. Wysocki wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov  wrote:
> > On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
> >> A couple of questions:
> >> - I guess this is reproducible 100% of the time?
> >
> > Yap.
> >
> > I took latest Linus + tip/master which has your commit.
> >
> >> - If you do "echo disk > /sys/power/state" instead of using s2disk,
> >> does it still crash in the same way?
> >
> > My suspend to disk script does:
> >
> > echo 3 > /proc/sys/vm/drop_caches
> > echo "shutdown" > /sys/power/disk
> > echo "disk" > /sys/power/state
> >
> > I don't use anything else for years now.
> >
> >> - Are both the image and boot kernels the same binary?
> >
> > Yep.
> 
> OK, we need to find out what's wrong, then.

Boris is traveling this week, so the investigation will continue when he's
back, but we spent quite some time on this during the weekend, so below is
a summary of what we found plus some comments.

Overall, we seem to be heading towards the "really weird" territory here.

> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.

First of all, reverting the changes in hibernate_asm_64.S doesn't make the
breakage go away, which means that the most essential changes in
"x86/power/64: Fix crash whan the hibernation code passes control to the image
kernel" don't make a difference.

Moreover, after some back-and-forth we narrowed down the memory corruption to
a single line in the (new) prepare_temporary_text_mapping() function and it
turned out that it is triggered by scribbling on a page returned by
get_safe_page(), immediately after obtaining it.  Memory is corrupted no matter
how the page is written to.  In particular, filling that page with all ones
triggers memory corruption later, for example.

So appended is the last tested patch sufficient to trigger the breakage on the
Boris' system (modulo some comments).  Quite evidently, the breakage is
triggered by the memset() line in prepare_temporary_text_mapping().

In addition to the above, there are some interesting observations from the
investigation so far.

First, what is corrupted is the image kernel memory.  This means that the line
in question scribbles on image data (which are stored in memory at that point)
somewhere.  Nevertheless, the image kernel is evidently able to continue after
it has received control and carry out the full device resume up to the point
in which user space is thawed and then it crashes as a result of page tables
corruption.  How those page tables are corrupted depends on what is written to
the page pointed to by pmd in prepare_temporary_text_mapping() (but, of course,
that write happens in the "boot" or "restore" kernel, before jumping to the
image one).

The above is what we know and now conclusions that it seems to lead to.

Generally, I see two possible scenarios here.

One is that the changes not directly related to prepare_temporary_text_mapping()
in the patch below, eg. the changes in arch_hibernation_header_save/restore()
(and related), make a difference, but that doesn't look likely to me.  In any
case, that will be the first thing to try next, as it is relatively easy.

The second scenario is fundamentally less attractive, because it means that the
address we end up with in pmd in prepare_temporary_text_mapping() is bogus.

My first reaction to that option would be "Well, what if get_safe_page() is
broken somehow?", but that's not so simple.  Namely, for the observed corruption
to happen, get_safe_page() would need to return a page that (1) had already
been allocated once before and (2) such that image data had been written to
it already.  Still, the hibernation core only writes image data to pages that
have been explicitly allocated by it and it never frees them individually
(in particular, they are never freed before jumping to the image kernel at
all).  It sets two bits in two different bitmaps for every page allocated by it
and checks those two bits every time before writing image data to any page.
One may think "Well, maybe the bitmaps don't work correctly then and the bits
are set when they shouldn't be sometimes?", but the problem with that line of
reasoning is that get_safe_page() checks one of those very bits before returning
a page and the page is only returned if that bit is clear.  Thus, even if the
bits were set when they shouldn't be sometimes, get_safe_page() would still
not return such a page to the caller.

The next question to ask might be "What if, horror shock, get_zeroed_page()
called by get_safe_page() returns a page that has been allocated already?", but
then the bitmap check mentioned above would still prevent get_safe_page() from
returning that page (the bit in question would be set for it).

Hence, the only way get_safe_page() might return a wrong page in

Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Monday, June 20, 2016 04:17:13 PM chenyu wrote:
> On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki  wrote:
> > On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov  wrote:
> >> Ok,
> >>
> >> bisect is done, full log below.
> >>
> >> Rafael, that fix
> >>
> >>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes 
> >> control to the image kernel")
> >>
> >> breaks s2disk here. It explodes during resume and a statically allocated
> >> struct's member is NULL. See
> >>
> >> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic
> >>
> >> for the splat and some debugging attempts.
> >>
> >> Reverting 70595b479ce1 fixes the issue here.
> >
> > Quite evidently, memory is corrupted in the image kernel, but this
> > particular commit only affects the boot kernel, so it can't really
> > corrupt anything in the image one.
> >
> In previous patch,
> before we jump to the new kernel entry, we add the
> text mapping to temp_level4_pgt,
> 
>  /* switch over to the temporary kernel text mapping */
> movq%r8, (%r9)
> If I understand correctly,  r9 contains the virtual address
> of restore_pgd_addr, since the page table for restore_pgd_addr might be
> incoherent across hibernation(as NX patch changes the kernel text mapping
> to dynamically mapping), so we might write pmd entry to an incorrect place in
> temp_level4_pgt?

I'm not sure what you mean.

r9 contains the address previously stored in restore_pgd_addr.  This is the
address of the cell in (the page pointed to by temp_level4_pgt) to write the
new pgd entry to.

Why do you think it may be incorrect?

In any case, the corruption reported by Boris also happens if all of the
assembly changes in commit 70595b479ce1 are reverted, in which case the code
should work as before that commit modulo some additional actions that
shouldn't matter.  Those additional actions turn out to matter, though.

I'll write about that in detail shortly.

Thanks,
Rafael



Re: ktime_get_ts64() splat during resume

2016-06-20 Thread Rafael J. Wysocki
On Monday, June 20, 2016 04:17:13 PM chenyu wrote:
> On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki  wrote:
> > On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov  wrote:
> >> Ok,
> >>
> >> bisect is done, full log below.
> >>
> >> Rafael, that fix
> >>
> >>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes 
> >> control to the image kernel")
> >>
> >> breaks s2disk here. It explodes during resume and a statically allocated
> >> struct's member is NULL. See
> >>
> >> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic
> >>
> >> for the splat and some debugging attempts.
> >>
> >> Reverting 70595b479ce1 fixes the issue here.
> >
> > Quite evidently, memory is corrupted in the image kernel, but this
> > particular commit only affects the boot kernel, so it can't really
> > corrupt anything in the image one.
> >
> In previous patch,
> before we jump to the new kernel entry, we add the
> text mapping to temp_level4_pgt,
> 
>  /* switch over to the temporary kernel text mapping */
> movq%r8, (%r9)
> If I understand correctly,  r9 contains the virtual address
> of restore_pgd_addr, since the page table for restore_pgd_addr might be
> incoherent across hibernation(as NX patch changes the kernel text mapping
> to dynamically mapping), so we might write pmd entry to an incorrect place in
> temp_level4_pgt?

I'm not sure what you mean.

r9 contains the address previously stored in restore_pgd_addr.  This is the
address of the cell in (the page pointed to by temp_level4_pgt) to write the
new pgd entry to.

Why do you think it may be incorrect?

In any case, the corruption reported by Boris also happens if all of the
assembly changes in commit 70595b479ce1 are reverted, in which case the code
should work as before that commit modulo some additional actions that
shouldn't matter.  Those additional actions turn out to matter, though.

I'll write about that in detail shortly.

Thanks,
Rafael



Re: ktime_get_ts64() splat during resume

2016-06-20 Thread chenyu
On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki  wrote:
> On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov  wrote:
>> Ok,
>>
>> bisect is done, full log below.
>>
>> Rafael, that fix
>>
>>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes 
>> control to the image kernel")
>>
>> breaks s2disk here. It explodes during resume and a statically allocated
>> struct's member is NULL. See
>>
>> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic
>>
>> for the splat and some debugging attempts.
>>
>> Reverting 70595b479ce1 fixes the issue here.
>
> Quite evidently, memory is corrupted in the image kernel, but this
> particular commit only affects the boot kernel, so it can't really
> corrupt anything in the image one.
>
In previous patch,
before we jump to the new kernel entry, we add the
text mapping to temp_level4_pgt,

 /* switch over to the temporary kernel text mapping */
movq%r8, (%r9)
If I understand correctly,  r9 contains the virtual address
of restore_pgd_addr, since the page table for restore_pgd_addr might be
incoherent across hibernation(as NX patch changes the kernel text mapping
to dynamically mapping), so we might write pmd entry to an incorrect place in
temp_level4_pgt?

Yu


Re: ktime_get_ts64() splat during resume

2016-06-20 Thread chenyu
On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki  wrote:
> On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov  wrote:
>> Ok,
>>
>> bisect is done, full log below.
>>
>> Rafael, that fix
>>
>>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes 
>> control to the image kernel")
>>
>> breaks s2disk here. It explodes during resume and a statically allocated
>> struct's member is NULL. See
>>
>> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic
>>
>> for the splat and some debugging attempts.
>>
>> Reverting 70595b479ce1 fixes the issue here.
>
> Quite evidently, memory is corrupted in the image kernel, but this
> particular commit only affects the boot kernel, so it can't really
> corrupt anything in the image one.
>
In previous patch,
before we jump to the new kernel entry, we add the
text mapping to temp_level4_pgt,

 /* switch over to the temporary kernel text mapping */
movq%r8, (%r9)
If I understand correctly,  r9 contains the virtual address
of restore_pgd_addr, since the page table for restore_pgd_addr might be
incoherent across hibernation(as NX patch changes the kernel text mapping
to dynamically mapping), so we might write pmd entry to an incorrect place in
temp_level4_pgt?

Yu


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Rafael J. Wysocki
On Fri, Jun 17, 2016 at 11:03 PM, Rafael J. Wysocki  wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov  wrote:
>> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>>> A couple of questions:
>>> - I guess this is reproducible 100% of the time?
>>
>> Yap.
>>
>> I took latest Linus + tip/master which has your commit.
>>
>>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>>> does it still crash in the same way?
>>
>> My suspend to disk script does:
>>
>> echo 3 > /proc/sys/vm/drop_caches
>> echo "shutdown" > /sys/power/disk
>> echo "disk" > /sys/power/state
>>
>> I don't use anything else for years now.
>>
>>> - Are both the image and boot kernels the same binary?
>>
>> Yep.
>
> OK, we need to find out what's wrong, then.
>
> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.

Due to the nature of the memory corruption you are seeing (the same
address appears to be corrupted every time in the same way) with 100%
reproducibility and due to the fact that new code added by the commit
in question only writes to dynamically allocated memory (and you've
already verified that https://patchwork.kernel.org/patch/9185165/
doesn't help), it is quite unlikely that the memory corruption comes
from that commit itself.

However, I see a couple of ways in which that commit might uncover a latent bug.

First, it changed the layout of the kernel text by adding the
PAGE_SIZE alignment of restore_registers().  That likely pushed stuff
behind it to new offsets, possibly including the static struct field
that is now corrupted.  Now, say that the memory corruption has always
happened at the same memory location, but previously there was nothing
in there or whatever was in there, wasn't vital.  In that case the
memory corruption might have gone unnoticed until the commit in
question caused things to move to new locations and the corrupted
location contains a vital piece of data now.  This is my current
theory.

Second, it added two invocations of get_safe_page() that in theory
might push things a bit too far towards the limit and they started to
break there.  I don't see how that can happen ATM, but I'm not
excluding this possibility yet.  It seems, though, that in that case
the corruption would be more random and I certainly wouldn't expect it
to happen at the same location every time.

One more indicator is that multiple people reported success with that
commit and in many hibernation runs, so the problem appears to be very
specific to your system and/or kernel configuration.  It also is
interesting that the memory corruption only becomes visible during the
thawing of tasks and given the piece of data that is corrupted, it
should become visible much earlier if the memory was corrupted during
image restoration by the boot kernel.

In any case, reverting the changes in hibernate_asm_64.S alone should
show us the direction, but if it makes things work again, I would try
to change the restore_registers() alignment to something smaller, like
64 (which should be safe) and see what happens then.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Rafael J. Wysocki
On Fri, Jun 17, 2016 at 11:03 PM, Rafael J. Wysocki  wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov  wrote:
>> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>>> A couple of questions:
>>> - I guess this is reproducible 100% of the time?
>>
>> Yap.
>>
>> I took latest Linus + tip/master which has your commit.
>>
>>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>>> does it still crash in the same way?
>>
>> My suspend to disk script does:
>>
>> echo 3 > /proc/sys/vm/drop_caches
>> echo "shutdown" > /sys/power/disk
>> echo "disk" > /sys/power/state
>>
>> I don't use anything else for years now.
>>
>>> - Are both the image and boot kernels the same binary?
>>
>> Yep.
>
> OK, we need to find out what's wrong, then.
>
> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.

Due to the nature of the memory corruption you are seeing (the same
address appears to be corrupted every time in the same way) with 100%
reproducibility and due to the fact that new code added by the commit
in question only writes to dynamically allocated memory (and you've
already verified that https://patchwork.kernel.org/patch/9185165/
doesn't help), it is quite unlikely that the memory corruption comes
from that commit itself.

However, I see a couple of ways in which that commit might uncover a latent bug.

First, it changed the layout of the kernel text by adding the
PAGE_SIZE alignment of restore_registers().  That likely pushed stuff
behind it to new offsets, possibly including the static struct field
that is now corrupted.  Now, say that the memory corruption has always
happened at the same memory location, but previously there was nothing
in there or whatever was in there, wasn't vital.  In that case the
memory corruption might have gone unnoticed until the commit in
question caused things to move to new locations and the corrupted
location contains a vital piece of data now.  This is my current
theory.

Second, it added two invocations of get_safe_page() that in theory
might push things a bit too far towards the limit and they started to
break there.  I don't see how that can happen ATM, but I'm not
excluding this possibility yet.  It seems, though, that in that case
the corruption would be more random and I certainly wouldn't expect it
to happen at the same location every time.

One more indicator is that multiple people reported success with that
commit and in many hibernation runs, so the problem appears to be very
specific to your system and/or kernel configuration.  It also is
interesting that the memory corruption only becomes visible during the
thawing of tasks and given the piece of data that is corrupted, it
should become visible much earlier if the memory was corrupted during
image restoration by the boot kernel.

In any case, reverting the changes in hibernate_asm_64.S alone should
show us the direction, but if it makes things work again, I would try
to change the restore_registers() alignment to something smaller, like
64 (which should be safe) and see what happens then.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Rafael J. Wysocki
On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov  wrote:
> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>> A couple of questions:
>> - I guess this is reproducible 100% of the time?
>
> Yap.
>
> I took latest Linus + tip/master which has your commit.
>
>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>> does it still crash in the same way?
>
> My suspend to disk script does:
>
> echo 3 > /proc/sys/vm/drop_caches
> echo "shutdown" > /sys/power/disk
> echo "disk" > /sys/power/state
>
> I don't use anything else for years now.
>
>> - Are both the image and boot kernels the same binary?
>
> Yep.

OK, we need to find out what's wrong, then.

First, please revert the changes in hibernate_asm_64.S alone and see
if that makes any difference.

Hibernation should still work then most of the time, but the bug fixed
by this commit will be back.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Rafael J. Wysocki
On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov  wrote:
> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>> A couple of questions:
>> - I guess this is reproducible 100% of the time?
>
> Yap.
>
> I took latest Linus + tip/master which has your commit.
>
>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>> does it still crash in the same way?
>
> My suspend to disk script does:
>
> echo 3 > /proc/sys/vm/drop_caches
> echo "shutdown" > /sys/power/disk
> echo "disk" > /sys/power/state
>
> I don't use anything else for years now.
>
>> - Are both the image and boot kernels the same binary?
>
> Yep.

OK, we need to find out what's wrong, then.

First, please revert the changes in hibernate_asm_64.S alone and see
if that makes any difference.

Hibernation should still work then most of the time, but the bug fixed
by this commit will be back.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Borislav Petkov
On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
> A couple of questions:
> - I guess this is reproducible 100% of the time?

Yap.

I took latest Linus + tip/master which has your commit.

> - If you do "echo disk > /sys/power/state" instead of using s2disk,
> does it still crash in the same way?

My suspend to disk script does:

echo 3 > /proc/sys/vm/drop_caches
echo "shutdown" > /sys/power/disk
echo "disk" > /sys/power/state

I don't use anything else for years now.

> - Are both the image and boot kernels the same binary?

Yep.

> Sure.  I thought Tony would pick them up.

Oh ok, next time I'll pester him.

:-)

But seriously, should we route the RAS-relevant stuff touching
drivers/acpi/apei/ through our tree instead? Provided you're fine with
them?

This way we'll unload some of the burden off you...

Thanks!

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Borislav Petkov
On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
> A couple of questions:
> - I guess this is reproducible 100% of the time?

Yap.

I took latest Linus + tip/master which has your commit.

> - If you do "echo disk > /sys/power/state" instead of using s2disk,
> does it still crash in the same way?

My suspend to disk script does:

echo 3 > /proc/sys/vm/drop_caches
echo "shutdown" > /sys/power/disk
echo "disk" > /sys/power/state

I don't use anything else for years now.

> - Are both the image and boot kernels the same binary?

Yep.

> Sure.  I thought Tony would pick them up.

Oh ok, next time I'll pester him.

:-)

But seriously, should we route the RAS-relevant stuff touching
drivers/acpi/apei/ through our tree instead? Provided you're fine with
them?

This way we'll unload some of the burden off you...

Thanks!

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Rafael J. Wysocki
On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov  wrote:
> Ok,
>
> bisect is done, full log below.
>
> Rafael, that fix
>
>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes 
> control to the image kernel")
>
> breaks s2disk here. It explodes during resume and a statically allocated
> struct's member is NULL. See
>
> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic
>
> for the splat and some debugging attempts.
>
> Reverting 70595b479ce1 fixes the issue here.

Quite evidently, memory is corrupted in the image kernel, but this
particular commit only affects the boot kernel, so it can't really
corrupt anything in the image one.

A couple of questions:
- I guess this is reproducible 100% of the time?
- If you do "echo disk > /sys/power/state" instead of using s2disk,
does it still crash in the same way?
- Are both the image and boot kernels the same binary?

> @stable folk: You might want to hold off on picking up that one yet...
>
>
> Now, I have serial connected to that box so if you want me to dump stuff
> and try patches, let me know.
>
> Btw, while I have your ear, can you pick up my einj fixes please?
>
> :-)
>
> -> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email...@alien8.de

Sure.  I thought Tony would pick them up.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Rafael J. Wysocki
On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov  wrote:
> Ok,
>
> bisect is done, full log below.
>
> Rafael, that fix
>
>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes 
> control to the image kernel")
>
> breaks s2disk here. It explodes during resume and a statically allocated
> struct's member is NULL. See
>
> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic
>
> for the splat and some debugging attempts.
>
> Reverting 70595b479ce1 fixes the issue here.

Quite evidently, memory is corrupted in the image kernel, but this
particular commit only affects the boot kernel, so it can't really
corrupt anything in the image one.

A couple of questions:
- I guess this is reproducible 100% of the time?
- If you do "echo disk > /sys/power/state" instead of using s2disk,
does it still crash in the same way?
- Are both the image and boot kernels the same binary?

> @stable folk: You might want to hold off on picking up that one yet...
>
>
> Now, I have serial connected to that box so if you want me to dump stuff
> and try patches, let me know.
>
> Btw, while I have your ear, can you pick up my einj fixes please?
>
> :-)
>
> -> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email...@alien8.de

Sure.  I thought Tony would pick them up.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Borislav Petkov
Ok,

bisect is done, full log below.

Rafael, that fix

  70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes 
control to the image kernel")

breaks s2disk here. It explodes during resume and a statically allocated
struct's member is NULL. See

https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic

for the splat and some debugging attempts.

Reverting 70595b479ce1 fixes the issue here.

@stable folk: You might want to hold off on picking up that one yet...


Now, I have serial connected to that box so if you want me to dump stuff
and try patches, let me know.

Btw, while I have your ear, can you pick up my einj fixes please?

:-)

-> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email...@alien8.de

Thanks!

git bisect start
# bad: [f0e52976a6fad5aa2c8868d5fa91deee91299ebb] Merge remote-tracking branch 
'tip/master' into tip-timers-splat
git bisect bad f0e52976a6fad5aa2c8868d5fa91deee91299ebb
# good: [bb967271c0317de8eed56b20aae978d31507b033] Merge tag 'pwm/for-4.7-rc4' 
of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
git bisect good bb967271c0317de8eed56b20aae978d31507b033
# good: [bb41dcd3fd81971ce17f7c0e5020d343d905358f] Merge branch 'sched/core'
git bisect good bb41dcd3fd81971ce17f7c0e5020d343d905358f
# bad: [457642d3e6e1f7412c8174dbaeb06232618b2698] Merge branch 'x86/mm'
git bisect bad 457642d3e6e1f7412c8174dbaeb06232618b2698
# bad: [d818e3414b9def094a93ce57caad9cb75ecb2a17] Merge branch 'x86/boot'
git bisect bad d818e3414b9def094a93ce57caad9cb75ecb2a17
# good: [02e8fda2cc00419a11cf38199afea4c0d7172be8] x86/signals: Add build-time 
checks to the siginfo compat code
git bisect good 02e8fda2cc00419a11cf38199afea4c0d7172be8
# good: [c702ff2fc395d8d9949eef8f72fd8726943c5aa1] Merge branch 'timers/core'
git bisect good c702ff2fc395d8d9949eef8f72fd8726943c5aa1
# good: [0ef1f6adccfea59c600edca7632ff0834ddfee5b] Merge branch 'x86/asm'
git bisect good 0ef1f6adccfea59c600edca7632ff0834ddfee5b
# bad: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan 
the hibernation code passes control to the image kernel
git bisect bad 70595b479ce173425dd5cb347dc6c8b1060dfb2c
# good: [8d950d2fb23b696d393020486ab6a350bcb2c582] MAINTAINERS: Update the 
Calgary IOMMU entry
git bisect good 8d950d2fb23b696d393020486ab6a350bcb2c582
# first bad commit: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: 
Fix crash whan the hibernation code passes control to the image kernel

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Borislav Petkov
Ok,

bisect is done, full log below.

Rafael, that fix

  70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes 
control to the image kernel")

breaks s2disk here. It explodes during resume and a statically allocated
struct's member is NULL. See

https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic

for the splat and some debugging attempts.

Reverting 70595b479ce1 fixes the issue here.

@stable folk: You might want to hold off on picking up that one yet...


Now, I have serial connected to that box so if you want me to dump stuff
and try patches, let me know.

Btw, while I have your ear, can you pick up my einj fixes please?

:-)

-> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email...@alien8.de

Thanks!

git bisect start
# bad: [f0e52976a6fad5aa2c8868d5fa91deee91299ebb] Merge remote-tracking branch 
'tip/master' into tip-timers-splat
git bisect bad f0e52976a6fad5aa2c8868d5fa91deee91299ebb
# good: [bb967271c0317de8eed56b20aae978d31507b033] Merge tag 'pwm/for-4.7-rc4' 
of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
git bisect good bb967271c0317de8eed56b20aae978d31507b033
# good: [bb41dcd3fd81971ce17f7c0e5020d343d905358f] Merge branch 'sched/core'
git bisect good bb41dcd3fd81971ce17f7c0e5020d343d905358f
# bad: [457642d3e6e1f7412c8174dbaeb06232618b2698] Merge branch 'x86/mm'
git bisect bad 457642d3e6e1f7412c8174dbaeb06232618b2698
# bad: [d818e3414b9def094a93ce57caad9cb75ecb2a17] Merge branch 'x86/boot'
git bisect bad d818e3414b9def094a93ce57caad9cb75ecb2a17
# good: [02e8fda2cc00419a11cf38199afea4c0d7172be8] x86/signals: Add build-time 
checks to the siginfo compat code
git bisect good 02e8fda2cc00419a11cf38199afea4c0d7172be8
# good: [c702ff2fc395d8d9949eef8f72fd8726943c5aa1] Merge branch 'timers/core'
git bisect good c702ff2fc395d8d9949eef8f72fd8726943c5aa1
# good: [0ef1f6adccfea59c600edca7632ff0834ddfee5b] Merge branch 'x86/asm'
git bisect good 0ef1f6adccfea59c600edca7632ff0834ddfee5b
# bad: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan 
the hibernation code passes control to the image kernel
git bisect bad 70595b479ce173425dd5cb347dc6c8b1060dfb2c
# good: [8d950d2fb23b696d393020486ab6a350bcb2c582] MAINTAINERS: Update the 
Calgary IOMMU entry
git bisect good 8d950d2fb23b696d393020486ab6a350bcb2c582
# first bad commit: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: 
Fix crash whan the hibernation code passes control to the image kernel

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Borislav Petkov
On Fri, Jun 17, 2016 at 01:53:53PM +0200, Thomas Gleixner wrote:
> It must be initialized otherwise you won't reach suspend. I have no idea how
> that can happen.

Btw, there's one other thing I'm seeing in the boot kernel, while it suspends.
It hardly is related though:

[   42.046585] kvm: exiting hardware virtualization
[   42.062446] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 
address=0x20001000 fl]
[   42.255531] sd 3:0:0:0: [sdd] Synchronizing SCSI cache
[   42.263159] sd 3:0:0:0: [sdd] Stopping disk
[   43.090022] irq 16: nobody cared (try booting with the "irqpoll" option)
[   43.099122] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.7.0-rc3+ #4
[   43.107784] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/23
[   43.120143]   88043dd83e68 812aacf3 
88003790e400
[   43.130145]  88003790e49c 88043dd83e90 810a5c3e 
88003790e400
[   43.140118]  [   43.140201] sd 2:0:0:0: [sdc] Synchronizing 
SCSI cache
[   43.140592] sd 2:0:0:0: [sdc] Stopping disk

[   43.157217]  0010 88043dd83ec8 810a5fe7
[   43.168302] Call Trace:
[   43.173210][] dump_stack+0x67/0x94
[   43.181481]  [] __report_bad_irq+0x3e/0xd0
[   43.189626]  [] note_interrupt+0x257/0x2a0
[   43.197797]  [] handle_irq_event_percpu+0xa9/0x1f0
[   43.206655]  [] handle_irq_event+0x40/0x70
[   43.214776]  [] handle_fasteoi_irq+0xb6/0x170
[   43.223166]  [] handle_irq+0x1a/0x30
[   43.230748]  [] do_IRQ+0x66/0x100
[   43.238045]  [] common_interrupt+0x86/0x86
[   43.246135][] ? acpi_safe_halt+0x24/0x2c
[   43.254884]  [] ? acpi_safe_halt+0x22/0x2c
[   43.262987]  [] acpi_idle_do_entry+0x20/0x30
[   43.271298]  [] acpi_idle_enter+0x1dc/0x1fc
[   43.279462]  [] cpuidle_enter_state+0x10e/0x300
[   43.287980]  [] cpuidle_enter+0x17/0x20
[   43.295799]  [] call_cpuidle+0x2a/0x40
[   43.303528]  [] cpu_startup_entry+0x2dd/0x3c0
[   43.304279] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[   43.306109] sd 1:0:0:0: [sdb] Stopping disk
[   43.325777]  [] start_secondary+0x13a/0x150
[   43.333868] handlers:
[   43.338471] [] azx_interrupt
[   43.345298] Disabling IRQ #16
[   43.848595] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   43.858068] sd 0:0:0:0: [sda] Stopping disk
[   44.052892] pcieport :00:04.0: System wakeup enabled by ACPI
[   44.075715] ACPI: Preparing to enter system sleep state S5
[   44.083818] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query honored via cmdline
[   44.092258] reboot: Power down
[   44.096326] acpi_power_off called

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Borislav Petkov
On Fri, Jun 17, 2016 at 01:53:53PM +0200, Thomas Gleixner wrote:
> It must be initialized otherwise you won't reach suspend. I have no idea how
> that can happen.

Btw, there's one other thing I'm seeing in the boot kernel, while it suspends.
It hardly is related though:

[   42.046585] kvm: exiting hardware virtualization
[   42.062446] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 
address=0x20001000 fl]
[   42.255531] sd 3:0:0:0: [sdd] Synchronizing SCSI cache
[   42.263159] sd 3:0:0:0: [sdd] Stopping disk
[   43.090022] irq 16: nobody cared (try booting with the "irqpoll" option)
[   43.099122] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.7.0-rc3+ #4
[   43.107784] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/23
[   43.120143]   88043dd83e68 812aacf3 
88003790e400
[   43.130145]  88003790e49c 88043dd83e90 810a5c3e 
88003790e400
[   43.140118]  [   43.140201] sd 2:0:0:0: [sdc] Synchronizing 
SCSI cache
[   43.140592] sd 2:0:0:0: [sdc] Stopping disk

[   43.157217]  0010 88043dd83ec8 810a5fe7
[   43.168302] Call Trace:
[   43.173210][] dump_stack+0x67/0x94
[   43.181481]  [] __report_bad_irq+0x3e/0xd0
[   43.189626]  [] note_interrupt+0x257/0x2a0
[   43.197797]  [] handle_irq_event_percpu+0xa9/0x1f0
[   43.206655]  [] handle_irq_event+0x40/0x70
[   43.214776]  [] handle_fasteoi_irq+0xb6/0x170
[   43.223166]  [] handle_irq+0x1a/0x30
[   43.230748]  [] do_IRQ+0x66/0x100
[   43.238045]  [] common_interrupt+0x86/0x86
[   43.246135][] ? acpi_safe_halt+0x24/0x2c
[   43.254884]  [] ? acpi_safe_halt+0x22/0x2c
[   43.262987]  [] acpi_idle_do_entry+0x20/0x30
[   43.271298]  [] acpi_idle_enter+0x1dc/0x1fc
[   43.279462]  [] cpuidle_enter_state+0x10e/0x300
[   43.287980]  [] cpuidle_enter+0x17/0x20
[   43.295799]  [] call_cpuidle+0x2a/0x40
[   43.303528]  [] cpu_startup_entry+0x2dd/0x3c0
[   43.304279] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[   43.306109] sd 1:0:0:0: [sdb] Stopping disk
[   43.325777]  [] start_secondary+0x13a/0x150
[   43.333868] handlers:
[   43.338471] [] azx_interrupt
[   43.345298] Disabling IRQ #16
[   43.848595] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   43.858068] sd 0:0:0:0: [sda] Stopping disk
[   44.052892] pcieport :00:04.0: System wakeup enabled by ACPI
[   44.075715] ACPI: Preparing to enter system sleep state S5
[   44.083818] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query honored via cmdline
[   44.092258] reboot: Power down
[   44.096326] acpi_power_off called

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Thomas Gleixner
On Fri, 17 Jun 2016, Borislav Petkov wrote:
> look what I've found this morning during resume:
> 
> [   45.746236] BUG: unable to handle kernel done.
> [   45.752542] NULL pointer dereference at 0001
> [   45.752544] IP: [<0001>] 0x1

> static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr)
> {
> cycle_t cycle_now, delta;
> 
> /* read clocksource */
> cycle_now = tkr->read(tkr->clock);
>   ^^
> 
> Ring any bells about something corrupting tk_core.timekeeper.tkr_mono or
> it being uninitialized after suspend?

It must be initialized otherwise you won't reach suspend. I have no idea how
that can happen.

Thanks,

tglx


Re: ktime_get_ts64() splat during resume

2016-06-17 Thread Thomas Gleixner
On Fri, 17 Jun 2016, Borislav Petkov wrote:
> look what I've found this morning during resume:
> 
> [   45.746236] BUG: unable to handle kernel done.
> [   45.752542] NULL pointer dereference at 0001
> [   45.752544] IP: [<0001>] 0x1

> static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr)
> {
> cycle_t cycle_now, delta;
> 
> /* read clocksource */
> cycle_now = tkr->read(tkr->clock);
>   ^^
> 
> Ring any bells about something corrupting tk_core.timekeeper.tkr_mono or
> it being uninitialized after suspend?

It must be initialized otherwise you won't reach suspend. I have no idea how
that can happen.

Thanks,

tglx


ktime_get_ts64() splat during resume

2016-06-17 Thread Borislav Petkov
Hi guys,

look what I've found this morning during resume:

[   45.732934] PM: Image restored successfully.
[   45.738064] PM: Basic memory bitmaps freed
[   45.742914] Restarting tasks ... 
[   45.746236] BUG: unable to handle kernel done.
[   45.752542] NULL pointer dereference at 0001
[   45.752544] IP: [<0001>] 0x1
[   45.752547] PGD 37922067 PUD 3791a067 PMD 0 
[   45.752548] Oops: 0010 [#1] PREEMPT SMP
[   45.752557] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod 
edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm amdkfd amd_iommu_v2 irqbypass 
crc32_pclmul radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper 
cryptd fam15h_power k10temp acpi_cpufreq
[   45.752559] CPU: 2 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #1
[   45.752560] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[   45.752560] task: 88042b958000 ti: 88042b954000 task.ti: 
88042b954000
[   45.752562] RIP: 0010:[<0001>]  [<0001>] 0x1
[   45.752562] RSP: 0018:88042b957e50  EFLAGS: 00010246
[   45.752563] RAX:  RBX: 81181e1e RCX: fdfe
[   45.752564] RDX: 88042b958000 RSI: 88042b954000 RDI: 8168aeac
[   45.752564] RBP: 0430 R08:  R09: 0002
[   45.752565] R10:  R11: 0001 R12: ff9c0002
[   45.752565] R13: 88042b09a300 R14: 811782bf R15: 0011
[   45.752566] FS:  7f9d470ab800() GS:88043dc8() 
knlGS:
[   45.752567] CS:  0010 DS:  ES:  CR0: 80050033
[   45.752567] CR2: 0001 CR3: 37923000 CR4: 000406e0
[   45.752568] Stack:
[   45.752569]  1180  88042b957e88 
810bb33a
[   45.752570]  fdfe 88042b957f00 7ffcfa91b1f0 
88042b957ee8
[   45.752571]  81185fc7 0003 21b8e1cd 
0003
[   45.752571] Call Trace:
[   45.752576]  [] ? ktime_get_ts64+0x4a/0xf0
[   45.752578]  [] ? poll_select_copy_remaining+0xe7/0x130
[   45.752581]  [] ? exit_to_usermode_loop+0x8a/0xb0
[   45.752582]  [] ? syscall_return_slowpath+0x5b/0x70
[   45.752584]  [] ? entry_SYSCALL_64_fastpath+0xa5/0xa7
[   45.752587] Code:  Bad RIP value.
[   45.752588] RIP  [<0001>] 0x1
[   45.752589]  RSP 
[   45.752589] CR2: 0001
[   45.752597] ---[ end trace 5334fe9eec2bfca9 ]---
[   45.752737] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0009
[   45.752737] 
[   45.758773] Kernel Offset: disabled

rIP is gone.

That's rc3+tip/master from Wed:

commit 043ae67b6fdd9db7fb54564124d2f2fa833c5ee6 (refs/remotes/tip/master)
Merge: 0bbd6fbba8c7 00688272157d
Author: Ingo Molnar 
Date:   Wed Jun 15 12:53:02 2016 +0200

Merge branch 'x86/platform'

The top function in the stacktrace is:

[   45.752576]  [] ? ktime_get_ts64+0x4a/0xf0

and that address is:

810bb2f0 :
810bb2f0:   e8 6b 21 5d 00  callq  8168d460 
<__fentry__>
810bb2f5:   55  push   %rbp
810bb2f6:   8b 05 14 52 bf 00   mov0xbf5214(%rip),%eax  
  # 81cb0510 
810bb2fc:   48 89 e5mov%rsp,%rbp
810bb2ff:   41 55   push   %r13
810bb301:   85 c0   test   %eax,%eax
810bb303:   41 54   push   %r12
810bb305:   53  push   %rbx
810bb306:   48 89 fbmov%rdi,%rbx
810bb309:   0f 85 b6 00 00 00   jne810bb3c5 

810bb30f:   45 31 e4xor%r12d,%r12d
810bb312:   44 8b 2d a7 a9 f5 00mov0xf5a9a7(%rip),%r13d 
   # 82015cc0 
810bb319:   41 f6 c5 01 test   $0x1,%r13b
810bb31d:   0f 85 9b 00 00 00   jne810bb3be 

810bb323:   48 8b 05 0e aa f5 00mov0xf5aa0e(%rip),%rax  
  # 82015d38 
810bb32a:   48 89 03mov%rax,(%rbx)
810bb32d:   48 8b 3d 94 a9 f5 00mov0xf5a994(%rip),%rdi  
  # 82015cc8 
810bb334:   ff 15 96 a9 f5 00   callq  *0xf5a996(%rip)# 
82015cd0 
810bb33a:   48 2b 05 9f a9 f5 00sub0xf5a99f(%rip),%rax  
  # 82015ce0  <--- HERE
810bb341:   48 8b 15 90 a9 f5 00mov0xf5a990(%rip),%rdx  
  # 82015cd8 
810bb348:   44 8b 05 99 a9 f5 00mov0xf5a999(%rip),%r8d  
  # 82015ce8 
810bb34f:   48 8b 35 9a a9 f5 00mov0xf5a99a(%rip),%rsi  
  # 82015cf0 

ktime_get_ts64() splat during resume

2016-06-17 Thread Borislav Petkov
Hi guys,

look what I've found this morning during resume:

[   45.732934] PM: Image restored successfully.
[   45.738064] PM: Basic memory bitmaps freed
[   45.742914] Restarting tasks ... 
[   45.746236] BUG: unable to handle kernel done.
[   45.752542] NULL pointer dereference at 0001
[   45.752544] IP: [<0001>] 0x1
[   45.752547] PGD 37922067 PUD 3791a067 PMD 0 
[   45.752548] Oops: 0010 [#1] PREEMPT SMP
[   45.752557] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod 
edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm amdkfd amd_iommu_v2 irqbypass 
crc32_pclmul radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper 
cryptd fam15h_power k10temp acpi_cpufreq
[   45.752559] CPU: 2 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #1
[   45.752560] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[   45.752560] task: 88042b958000 ti: 88042b954000 task.ti: 
88042b954000
[   45.752562] RIP: 0010:[<0001>]  [<0001>] 0x1
[   45.752562] RSP: 0018:88042b957e50  EFLAGS: 00010246
[   45.752563] RAX:  RBX: 81181e1e RCX: fdfe
[   45.752564] RDX: 88042b958000 RSI: 88042b954000 RDI: 8168aeac
[   45.752564] RBP: 0430 R08:  R09: 0002
[   45.752565] R10:  R11: 0001 R12: ff9c0002
[   45.752565] R13: 88042b09a300 R14: 811782bf R15: 0011
[   45.752566] FS:  7f9d470ab800() GS:88043dc8() 
knlGS:
[   45.752567] CS:  0010 DS:  ES:  CR0: 80050033
[   45.752567] CR2: 0001 CR3: 37923000 CR4: 000406e0
[   45.752568] Stack:
[   45.752569]  1180  88042b957e88 
810bb33a
[   45.752570]  fdfe 88042b957f00 7ffcfa91b1f0 
88042b957ee8
[   45.752571]  81185fc7 0003 21b8e1cd 
0003
[   45.752571] Call Trace:
[   45.752576]  [] ? ktime_get_ts64+0x4a/0xf0
[   45.752578]  [] ? poll_select_copy_remaining+0xe7/0x130
[   45.752581]  [] ? exit_to_usermode_loop+0x8a/0xb0
[   45.752582]  [] ? syscall_return_slowpath+0x5b/0x70
[   45.752584]  [] ? entry_SYSCALL_64_fastpath+0xa5/0xa7
[   45.752587] Code:  Bad RIP value.
[   45.752588] RIP  [<0001>] 0x1
[   45.752589]  RSP 
[   45.752589] CR2: 0001
[   45.752597] ---[ end trace 5334fe9eec2bfca9 ]---
[   45.752737] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0009
[   45.752737] 
[   45.758773] Kernel Offset: disabled

rIP is gone.

That's rc3+tip/master from Wed:

commit 043ae67b6fdd9db7fb54564124d2f2fa833c5ee6 (refs/remotes/tip/master)
Merge: 0bbd6fbba8c7 00688272157d
Author: Ingo Molnar 
Date:   Wed Jun 15 12:53:02 2016 +0200

Merge branch 'x86/platform'

The top function in the stacktrace is:

[   45.752576]  [] ? ktime_get_ts64+0x4a/0xf0

and that address is:

810bb2f0 :
810bb2f0:   e8 6b 21 5d 00  callq  8168d460 
<__fentry__>
810bb2f5:   55  push   %rbp
810bb2f6:   8b 05 14 52 bf 00   mov0xbf5214(%rip),%eax  
  # 81cb0510 
810bb2fc:   48 89 e5mov%rsp,%rbp
810bb2ff:   41 55   push   %r13
810bb301:   85 c0   test   %eax,%eax
810bb303:   41 54   push   %r12
810bb305:   53  push   %rbx
810bb306:   48 89 fbmov%rdi,%rbx
810bb309:   0f 85 b6 00 00 00   jne810bb3c5 

810bb30f:   45 31 e4xor%r12d,%r12d
810bb312:   44 8b 2d a7 a9 f5 00mov0xf5a9a7(%rip),%r13d 
   # 82015cc0 
810bb319:   41 f6 c5 01 test   $0x1,%r13b
810bb31d:   0f 85 9b 00 00 00   jne810bb3be 

810bb323:   48 8b 05 0e aa f5 00mov0xf5aa0e(%rip),%rax  
  # 82015d38 
810bb32a:   48 89 03mov%rax,(%rbx)
810bb32d:   48 8b 3d 94 a9 f5 00mov0xf5a994(%rip),%rdi  
  # 82015cc8 
810bb334:   ff 15 96 a9 f5 00   callq  *0xf5a996(%rip)# 
82015cd0 
810bb33a:   48 2b 05 9f a9 f5 00sub0xf5a99f(%rip),%rax  
  # 82015ce0  <--- HERE
810bb341:   48 8b 15 90 a9 f5 00mov0xf5a990(%rip),%rdx  
  # 82015cd8 
810bb348:   44 8b 05 99 a9 f5 00mov0xf5a999(%rip),%r8d  
  # 82015ce8 
810bb34f:   48 8b 35 9a a9 f5 00mov0xf5a99a(%rip),%rsi  
  # 82015cf0 
810bb356:   8b 0d 90 a9 f5 00   mov0xf5a990(%rip),%ecx  
  # 82015cec 
810bb35c:   48 8b 3d e5 a9 f5 00