Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)
On Mon, Jun 27, 2016 at 04:24:22PM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during > image restoration > > Logan Gunthorpe reports that hibernation stopped working reliably for > him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table > and rodata). > > That turns out to be a consequence of a long-standing issue with the > 64-bit image restoration code on x86, which is that the temporary > page tables set up by it to avoid page tables corruption when the > last bits of the image kernel's memory contents are copied into > their original page frames re-use the boot kernel's text mapping, > but that mapping may very well get corrupted just like any other > part of the page tables. Of course, if that happens, the final > jump to the image kernel's entry point will go to nowhere. > > The exact reason why commit ab76f7b4ab23 matters here is that it > sometimes causes a PMD of a large page to be split into PTEs > that are allocated dynamically and get corrupted during image > restoration as described above. > > To fix that issue note that the code copying the last bits of the > image kernel's memory contents to the page frames occupied by them > previoulsy doesn't use the kernel text mapping, because it runs from > a special page covered by the identity mapping set up for that code > from scratch. Hence, the kernel text mapping is only needed before > that code starts to run and then it will only be used just for the > final jump to the image kernel's entry point. > > Accordingly, the temporary page tables set up in swsusp_arch_resume() > on x86-64 need to contain the kernel text mapping too. That mapping > is only going to be used for the final jump to the image kernel, so > it only needs to cover the image kernel's entry point, because the > first thing the image kernel does after getting control back is to > switch over to its own original page tables. Moreover, the virtual > address of the image kernel's entry point in that mapping has to be > the same as the one mapped by the image kernel's page tables. > > With that in mind, modify the x86-64's arch_hibernation_header_save() > and arch_hibernation_header_restore() routines to pass the physical > address of the image kernel's entry point (in addition to its virtual > address) to the boot kernel (a small piece of assembly code involved > in passing the entry point's virtual address to the image kernel is > not necessary any more after that, so drop it). Update RESTORE_MAGIC > too to reflect the image header format change. > > Next, in set_up_temporary_mappings(), use the physical and virtual > addresses of the image kernel's entry point passed in the image > header to set up a minimum kernel text mapping (using memory pages > that won't be overwritten by the image kernel's memory contents) that > will map those addresses to each other as appropriate. > > This makes the concern about the possible corruption of the original > boot kernel text mapping go away and if the the minimum kernel text > mapping used for the final jump marks the image kernel's entry point > memory as executable, the jump to it is guaraneed to succeed. > > Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata) > Link: http://marc.info/?l=linux-pm=146372852823760=2 > Reported-by: Logan Gunthorpe > Signed-off-by: Rafael J. Wysocki > --- > arch/x86/power/hibernate_64.c | 90 > -- > arch/x86/power/hibernate_asm_64.S | 55 ++- > 2 files changed, 102 insertions(+), 43 deletions(-) Reported-and-tested-by: Borislav Petkov -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)
On Mon, Jun 27, 2016 at 04:24:22PM +0200, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during > image restoration > > Logan Gunthorpe reports that hibernation stopped working reliably for > him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table > and rodata). > > That turns out to be a consequence of a long-standing issue with the > 64-bit image restoration code on x86, which is that the temporary > page tables set up by it to avoid page tables corruption when the > last bits of the image kernel's memory contents are copied into > their original page frames re-use the boot kernel's text mapping, > but that mapping may very well get corrupted just like any other > part of the page tables. Of course, if that happens, the final > jump to the image kernel's entry point will go to nowhere. > > The exact reason why commit ab76f7b4ab23 matters here is that it > sometimes causes a PMD of a large page to be split into PTEs > that are allocated dynamically and get corrupted during image > restoration as described above. > > To fix that issue note that the code copying the last bits of the > image kernel's memory contents to the page frames occupied by them > previoulsy doesn't use the kernel text mapping, because it runs from > a special page covered by the identity mapping set up for that code > from scratch. Hence, the kernel text mapping is only needed before > that code starts to run and then it will only be used just for the > final jump to the image kernel's entry point. > > Accordingly, the temporary page tables set up in swsusp_arch_resume() > on x86-64 need to contain the kernel text mapping too. That mapping > is only going to be used for the final jump to the image kernel, so > it only needs to cover the image kernel's entry point, because the > first thing the image kernel does after getting control back is to > switch over to its own original page tables. Moreover, the virtual > address of the image kernel's entry point in that mapping has to be > the same as the one mapped by the image kernel's page tables. > > With that in mind, modify the x86-64's arch_hibernation_header_save() > and arch_hibernation_header_restore() routines to pass the physical > address of the image kernel's entry point (in addition to its virtual > address) to the boot kernel (a small piece of assembly code involved > in passing the entry point's virtual address to the image kernel is > not necessary any more after that, so drop it). Update RESTORE_MAGIC > too to reflect the image header format change. > > Next, in set_up_temporary_mappings(), use the physical and virtual > addresses of the image kernel's entry point passed in the image > header to set up a minimum kernel text mapping (using memory pages > that won't be overwritten by the image kernel's memory contents) that > will map those addresses to each other as appropriate. > > This makes the concern about the possible corruption of the original > boot kernel text mapping go away and if the the minimum kernel text > mapping used for the final jump marks the image kernel's entry point > memory as executable, the jump to it is guaraneed to succeed. > > Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata) > Link: http://marc.info/?l=linux-pm=146372852823760=2 > Reported-by: Logan Gunthorpe > Signed-off-by: Rafael J. Wysocki > --- > arch/x86/power/hibernate_64.c | 90 > -- > arch/x86/power/hibernate_asm_64.S | 55 ++- > 2 files changed, 102 insertions(+), 43 deletions(-) Reported-and-tested-by: Borislav Petkov -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
[PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote: > On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpewrote: > > Hey Rafael, > > > > This patch appears to be working on my laptop. Thanks. > > Same for me: resume still works with KASLR in my tests too. Unfortunately, Boris still sees post-resume memory corruption with the patch you have tested, but that turns out to be a result of some super-weird corruption of a pointer on the stack which leads to a store at a wrong address (and there's no way it can be related to the rest of the patch). We have verified that it can be avoided by rearranging set_up_temporary_text_mapping() to use fewer local variables and the appended patch contains this modification. I also went on and changed relocate_restore_code() slightly in a similar fashion, but all of those changes should not affect the behavior (unless there's something insane going on behind the curtains, that is). Kees, Logan, Boris, please try this one and let me know if it works for you. Thanks, Rafael --- From: Rafael J. Wysocki Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a minimum kernel text mapping (using memory pages that won't be overwritten by the image kernel's memory contents) that will map those addresses to each other as appropriate. This makes the concern about the possible corruption of the original boot kernel text mapping go away and if the the minimum kernel text mapping used for the final jump marks the image kernel's entry point memory as executable, the jump to it is guaraneed to succeed. Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata) Link: http://marc.info/?l=linux-pm=146372852823760=2 Reported-by: Logan Gunthorpe Signed-off-by: Rafael J. Wysocki --- arch/x86/power/hibernate_64.c | 90 -- arch/x86/power/hibernate_asm_64.S | 55 ++- 2 files changed, 102 insertions(+), 43 deletions(-) Index: linux-pm/arch/x86/power/hibernate_64.c === --- linux-pm.orig/arch/x86/power/hibernate_64.c +++ linux-pm/arch/x86/power/hibernate_64.c @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_ * kernel's text (this value is passed in the image header). */ unsigned long restore_jump_address __visible; +unsigned long
[PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote: > On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe wrote: > > Hey Rafael, > > > > This patch appears to be working on my laptop. Thanks. > > Same for me: resume still works with KASLR in my tests too. Unfortunately, Boris still sees post-resume memory corruption with the patch you have tested, but that turns out to be a result of some super-weird corruption of a pointer on the stack which leads to a store at a wrong address (and there's no way it can be related to the rest of the patch). We have verified that it can be avoided by rearranging set_up_temporary_text_mapping() to use fewer local variables and the appended patch contains this modification. I also went on and changed relocate_restore_code() slightly in a similar fashion, but all of those changes should not affect the behavior (unless there's something insane going on behind the curtains, that is). Kees, Logan, Boris, please try this one and let me know if it works for you. Thanks, Rafael --- From: Rafael J. Wysocki Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a minimum kernel text mapping (using memory pages that won't be overwritten by the image kernel's memory contents) that will map those addresses to each other as appropriate. This makes the concern about the possible corruption of the original boot kernel text mapping go away and if the the minimum kernel text mapping used for the final jump marks the image kernel's entry point memory as executable, the jump to it is guaraneed to succeed. Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata) Link: http://marc.info/?l=linux-pm=146372852823760=2 Reported-by: Logan Gunthorpe Signed-off-by: Rafael J. Wysocki --- arch/x86/power/hibernate_64.c | 90 -- arch/x86/power/hibernate_asm_64.S | 55 ++- 2 files changed, 102 insertions(+), 43 deletions(-) Index: linux-pm/arch/x86/power/hibernate_64.c === --- linux-pm.orig/arch/x86/power/hibernate_64.c +++ linux-pm/arch/x86/power/hibernate_64.c @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_ * kernel's text (this value is passed in the image header). */ unsigned long restore_jump_address __visible; +unsigned long jump_address_phys; /* * Value of the cr3 register from before the hibernation (this value is passed @@
Re: ktime_get_ts64() splat during resume
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote: > On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpewrote: > > Hey Rafael, > > > > This patch appears to be working on my laptop. Thanks. > > Same for me: resume still works with KASLR in my tests too. Thanks for the confirmation! Rafael
Re: ktime_get_ts64() splat during resume
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote: > On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe wrote: > > Hey Rafael, > > > > This patch appears to be working on my laptop. Thanks. > > Same for me: resume still works with KASLR in my tests too. Thanks for the confirmation! Rafael
Re: ktime_get_ts64() splat during resume
On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpewrote: > Hey Rafael, > > This patch appears to be working on my laptop. Thanks. Same for me: resume still works with KASLR in my tests too. -Kees -- Kees Cook Chrome OS & Brillo Security
Re: ktime_get_ts64() splat during resume
On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe wrote: > Hey Rafael, > > This patch appears to be working on my laptop. Thanks. Same for me: resume still works with KASLR in my tests too. -Kees -- Kees Cook Chrome OS & Brillo Security
Re: ktime_get_ts64() splat during resume
On Monday, June 20, 2016 10:35:16 PM Logan Gunthorpe wrote: > Hey Rafael, Hi, > This patch appears to be working on my laptop. Thanks. Thanks for the confirmation! Rafael
Re: ktime_get_ts64() splat during resume
On Monday, June 20, 2016 10:35:16 PM Logan Gunthorpe wrote: > Hey Rafael, Hi, > This patch appears to be working on my laptop. Thanks. Thanks for the confirmation! Rafael
Re: ktime_get_ts64() splat during resume
Hey Rafael, This patch appears to be working on my laptop. Thanks. Logan On 20/06/16 07:22 PM, Rafael J. Wysocki wrote: On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote: On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote: On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvaldswrote: On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki wrote: Overall, we seem to be heading towards the "really weird" territory here. So the whole commit that Boris bisected down to is weird to me. Why isn't the temporary text mapping just set up unconditionally in the temp_level4_pgt? Why does it have that insane "let's leave the temp_level4_pgt alone until we actually switch to it, and save away restore_pgd_addr and the restore_pgd, to then be set up at restore time"? All the other temporary mappings are set up statically in the temp_level4_pgt, why not that one? The text mapping in temp_level4_pgt has to map the image kernel's physical entry address to the same virtual address that the image kernel had for it, because the image kernel will switch over to its own page tables first and it will use its own kernel text mapping from that point on. That may not be the same as the text mapping of the (currently running) restore (or "boot") kernel. So if we set up the text mapping in temp_level4_pgt upfront, we basically can't reference the original kernel text (or do any addressing relative to it) any more after switching over to temp_level4_pgt. For some reason I thought that was not doable, but now that I look at the code it looks like it can be done. I'll try doing that. I recalled what the problem was. :-) In principle, the kernel text mapping in the image kernel may be different from the kernel text mapping in the restore ("boot") kernel, but the patch I posted a couple of hours ago actually assumed them to be the same, because it switched to temp_level4_pgt before jumping to the relocated code. To get rid of that implicit assumption it has to switch to temp_level4_pgt from the relocated code itself, but for that to work, the page containing the relocated code must be executable in the original page tables (it isn't usually). So updated patch is appended. Again, it works for me, but I'm wondering about everybody else. Thanks, Rafael --- From: Rafael J. Wysocki Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set
Re: ktime_get_ts64() splat during resume
Hey Rafael, This patch appears to be working on my laptop. Thanks. Logan On 20/06/16 07:22 PM, Rafael J. Wysocki wrote: On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote: On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote: On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds wrote: On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki wrote: Overall, we seem to be heading towards the "really weird" territory here. So the whole commit that Boris bisected down to is weird to me. Why isn't the temporary text mapping just set up unconditionally in the temp_level4_pgt? Why does it have that insane "let's leave the temp_level4_pgt alone until we actually switch to it, and save away restore_pgd_addr and the restore_pgd, to then be set up at restore time"? All the other temporary mappings are set up statically in the temp_level4_pgt, why not that one? The text mapping in temp_level4_pgt has to map the image kernel's physical entry address to the same virtual address that the image kernel had for it, because the image kernel will switch over to its own page tables first and it will use its own kernel text mapping from that point on. That may not be the same as the text mapping of the (currently running) restore (or "boot") kernel. So if we set up the text mapping in temp_level4_pgt upfront, we basically can't reference the original kernel text (or do any addressing relative to it) any more after switching over to temp_level4_pgt. For some reason I thought that was not doable, but now that I look at the code it looks like it can be done. I'll try doing that. I recalled what the problem was. :-) In principle, the kernel text mapping in the image kernel may be different from the kernel text mapping in the restore ("boot") kernel, but the patch I posted a couple of hours ago actually assumed them to be the same, because it switched to temp_level4_pgt before jumping to the relocated code. To get rid of that implicit assumption it has to switch to temp_level4_pgt from the relocated code itself, but for that to work, the page containing the relocated code must be executable in the original page tables (it isn't usually). So updated patch is appended. Again, it works for me, but I'm wondering about everybody else. Thanks, Rafael --- From: Rafael J. Wysocki Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a minimum kernel text mapping (using memory pages that won't be overwritten
Re: ktime_get_ts64() splat during resume
On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote: > On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote: > > On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds > >wrote: > > > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki > > > wrote: > > >> > > >> Overall, we seem to be heading towards the "really weird" territory here. > > > > > > So the whole commit that Boris bisected down to is weird to me. > > > > > > Why isn't the temporary text mapping just set up unconditionally in > > > the temp_level4_pgt? > > > > > > Why does it have that insane "let's leave the temp_level4_pgt alone > > > until we actually switch to it, and save away restore_pgd_addr and the > > > restore_pgd, to then be set up at restore time"? > > > > > > All the other temporary mappings are set up statically in the > > > temp_level4_pgt, why not that one? > > > > The text mapping in temp_level4_pgt has to map the image kernel's > > physical entry address to the same virtual address that the image > > kernel had for it, because the image kernel will switch over to its > > own page tables first and it will use its own kernel text mapping from > > that point on. That may not be the same as the text mapping of the > > (currently running) restore (or "boot") kernel. > > > > So if we set up the text mapping in temp_level4_pgt upfront, we > > basically can't reference the original kernel text (or do any > > addressing relative to it) any more after switching over to > > temp_level4_pgt. > > > > For some reason I thought that was not doable, but now that I look at > > the code it looks like it can be done. I'll try doing that. I recalled what the problem was. :-) In principle, the kernel text mapping in the image kernel may be different from the kernel text mapping in the restore ("boot") kernel, but the patch I posted a couple of hours ago actually assumed them to be the same, because it switched to temp_level4_pgt before jumping to the relocated code. To get rid of that implicit assumption it has to switch to temp_level4_pgt from the relocated code itself, but for that to work, the page containing the relocated code must be executable in the original page tables (it isn't usually). So updated patch is appended. Again, it works for me, but I'm wondering about everybody else. Thanks, Rafael --- From: Rafael J. Wysocki Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point
Re: ktime_get_ts64() splat during resume
On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote: > On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote: > > On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds > > wrote: > > > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki > > > wrote: > > >> > > >> Overall, we seem to be heading towards the "really weird" territory here. > > > > > > So the whole commit that Boris bisected down to is weird to me. > > > > > > Why isn't the temporary text mapping just set up unconditionally in > > > the temp_level4_pgt? > > > > > > Why does it have that insane "let's leave the temp_level4_pgt alone > > > until we actually switch to it, and save away restore_pgd_addr and the > > > restore_pgd, to then be set up at restore time"? > > > > > > All the other temporary mappings are set up statically in the > > > temp_level4_pgt, why not that one? > > > > The text mapping in temp_level4_pgt has to map the image kernel's > > physical entry address to the same virtual address that the image > > kernel had for it, because the image kernel will switch over to its > > own page tables first and it will use its own kernel text mapping from > > that point on. That may not be the same as the text mapping of the > > (currently running) restore (or "boot") kernel. > > > > So if we set up the text mapping in temp_level4_pgt upfront, we > > basically can't reference the original kernel text (or do any > > addressing relative to it) any more after switching over to > > temp_level4_pgt. > > > > For some reason I thought that was not doable, but now that I look at > > the code it looks like it can be done. I'll try doing that. I recalled what the problem was. :-) In principle, the kernel text mapping in the image kernel may be different from the kernel text mapping in the restore ("boot") kernel, but the patch I posted a couple of hours ago actually assumed them to be the same, because it switched to temp_level4_pgt before jumping to the relocated code. To get rid of that implicit assumption it has to switch to temp_level4_pgt from the relocated code itself, but for that to work, the page containing the relocated code must be executable in the original page tables (it isn't usually). So updated patch is appended. Again, it works for me, but I'm wondering about everybody else. Thanks, Rafael --- From: Rafael J. Wysocki Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a minimum kernel text mapping (using memory
Re: ktime_get_ts64() splat during resume
On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote: > On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds >wrote: > > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki > > wrote: > >> > >> Overall, we seem to be heading towards the "really weird" territory here. > > > > So the whole commit that Boris bisected down to is weird to me. > > > > Why isn't the temporary text mapping just set up unconditionally in > > the temp_level4_pgt? > > > > Why does it have that insane "let's leave the temp_level4_pgt alone > > until we actually switch to it, and save away restore_pgd_addr and the > > restore_pgd, to then be set up at restore time"? > > > > All the other temporary mappings are set up statically in the > > temp_level4_pgt, why not that one? > > The text mapping in temp_level4_pgt has to map the image kernel's > physical entry address to the same virtual address that the image > kernel had for it, because the image kernel will switch over to its > own page tables first and it will use its own kernel text mapping from > that point on. That may not be the same as the text mapping of the > (currently running) restore (or "boot") kernel. > > So if we set up the text mapping in temp_level4_pgt upfront, we > basically can't reference the original kernel text (or do any > addressing relative to it) any more after switching over to > temp_level4_pgt. > > For some reason I thought that was not doable, but now that I look at > the code it looks like it can be done. I'll try doing that. > > > I suspect whatever corruption happens boils down to the same issue > > that made people do that odd decision in the first place. > > The breakage appears to happen regardless of these changes, though. > > > And regardless, those games are too ugly to live. So I would suggest > > that that original commit should just be considered broken, and > > reverted (or just removed - I'm not sure if it's in a stable branch or > > not) and the fix be rethought so that the code mapping can be done > > once and for all and *without* the extra games. > > OK Below is a new simplified version of the offending patch (without the restore_pgd_addr and restore_pgd things). It works for me. :-) Logan, Kees, can you please check if this one works for you too? Thanks, Rafael --- From: Rafael J. Wysocki Subject: [PATCH] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a
Re: ktime_get_ts64() splat during resume
On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote: > On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds > wrote: > > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki > > wrote: > >> > >> Overall, we seem to be heading towards the "really weird" territory here. > > > > So the whole commit that Boris bisected down to is weird to me. > > > > Why isn't the temporary text mapping just set up unconditionally in > > the temp_level4_pgt? > > > > Why does it have that insane "let's leave the temp_level4_pgt alone > > until we actually switch to it, and save away restore_pgd_addr and the > > restore_pgd, to then be set up at restore time"? > > > > All the other temporary mappings are set up statically in the > > temp_level4_pgt, why not that one? > > The text mapping in temp_level4_pgt has to map the image kernel's > physical entry address to the same virtual address that the image > kernel had for it, because the image kernel will switch over to its > own page tables first and it will use its own kernel text mapping from > that point on. That may not be the same as the text mapping of the > (currently running) restore (or "boot") kernel. > > So if we set up the text mapping in temp_level4_pgt upfront, we > basically can't reference the original kernel text (or do any > addressing relative to it) any more after switching over to > temp_level4_pgt. > > For some reason I thought that was not doable, but now that I look at > the code it looks like it can be done. I'll try doing that. > > > I suspect whatever corruption happens boils down to the same issue > > that made people do that odd decision in the first place. > > The breakage appears to happen regardless of these changes, though. > > > And regardless, those games are too ugly to live. So I would suggest > > that that original commit should just be considered broken, and > > reverted (or just removed - I'm not sure if it's in a stable branch or > > not) and the fix be rethought so that the code mapping can be done > > once and for all and *without* the extra games. > > OK Below is a new simplified version of the offending patch (without the restore_pgd_addr and restore_pgd things). It works for me. :-) Logan, Kees, can you please check if this one works for you too? Thanks, Rafael --- From: Rafael J. Wysocki Subject: [PATCH] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe reports that hibernation stopped working reliably for him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata). That turns out to be a consequence of a long-standing issue with the 64-bit image restoration code on x86, which is that the temporary page tables set up by it to avoid page tables corruption when the last bits of the image kernel's memory contents are copied into their original page frames re-use the boot kernel's text mapping, but that mapping may very well get corrupted just like any other part of the page tables. Of course, if that happens, the final jump to the image kernel's entry point will go to nowhere. The exact reason why commit ab76f7b4ab23 matters here is that it sometimes causes a PMD of a large page to be split into PTEs that are allocated dynamically and get corrupted during image restoration as described above. To fix that issue note that the code copying the last bits of the image kernel's memory contents to the page frames occupied by them previoulsy doesn't use the kernel text mapping, because it runs from a special page covered by the identity mapping set up for that code from scratch. Hence, the kernel text mapping is only needed before that code starts to run and then it will only be used just for the final jump to the image kernel's entry point. Accordingly, the temporary page tables set up in swsusp_arch_resume() on x86-64 need to contain the kernel text mapping too. That mapping is only going to be used for the final jump to the image kernel, so it only needs to cover the image kernel's entry point, because the first thing the image kernel does after getting control back is to switch over to its own original page tables. Moreover, the virtual address of the image kernel's entry point in that mapping has to be the same as the one mapped by the image kernel's page tables. With that in mind, modify the x86-64's arch_hibernation_header_save() and arch_hibernation_header_restore() routines to pass the physical address of the image kernel's entry point (in addition to its virtual address) to the boot kernel (a small piece of assembly code involved in passing the entry point's virtual address to the image kernel is not necessary any more after that, so drop it). Update RESTORE_MAGIC too to reflect the image header format change. Next, in set_up_temporary_mappings(), use the physical and virtual addresses of the image kernel's entry point passed in the image header to set up a minimum kernel text mapping (using memory pages that won't be overwritten by the
Re: ktime_get_ts64() splat during resume
On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvaldswrote: > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki wrote: >> >> Overall, we seem to be heading towards the "really weird" territory here. > > So the whole commit that Boris bisected down to is weird to me. > > Why isn't the temporary text mapping just set up unconditionally in > the temp_level4_pgt? > > Why does it have that insane "let's leave the temp_level4_pgt alone > until we actually switch to it, and save away restore_pgd_addr and the > restore_pgd, to then be set up at restore time"? > > All the other temporary mappings are set up statically in the > temp_level4_pgt, why not that one? The text mapping in temp_level4_pgt has to map the image kernel's physical entry address to the same virtual address that the image kernel had for it, because the image kernel will switch over to its own page tables first and it will use its own kernel text mapping from that point on. That may not be the same as the text mapping of the (currently running) restore (or "boot") kernel. So if we set up the text mapping in temp_level4_pgt upfront, we basically can't reference the original kernel text (or do any addressing relative to it) any more after switching over to temp_level4_pgt. For some reason I thought that was not doable, but now that I look at the code it looks like it can be done. I'll try doing that. > I suspect whatever corruption happens boils down to the same issue > that made people do that odd decision in the first place. The breakage appears to happen regardless of these changes, though. > And regardless, those games are too ugly to live. So I would suggest > that that original commit should just be considered broken, and > reverted (or just removed - I'm not sure if it's in a stable branch or > not) and the fix be rethought so that the code mapping can be done > once and for all and *without* the extra games. OK Thanks, Rafael
Re: ktime_get_ts64() splat during resume
On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds wrote: > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki wrote: >> >> Overall, we seem to be heading towards the "really weird" territory here. > > So the whole commit that Boris bisected down to is weird to me. > > Why isn't the temporary text mapping just set up unconditionally in > the temp_level4_pgt? > > Why does it have that insane "let's leave the temp_level4_pgt alone > until we actually switch to it, and save away restore_pgd_addr and the > restore_pgd, to then be set up at restore time"? > > All the other temporary mappings are set up statically in the > temp_level4_pgt, why not that one? The text mapping in temp_level4_pgt has to map the image kernel's physical entry address to the same virtual address that the image kernel had for it, because the image kernel will switch over to its own page tables first and it will use its own kernel text mapping from that point on. That may not be the same as the text mapping of the (currently running) restore (or "boot") kernel. So if we set up the text mapping in temp_level4_pgt upfront, we basically can't reference the original kernel text (or do any addressing relative to it) any more after switching over to temp_level4_pgt. For some reason I thought that was not doable, but now that I look at the code it looks like it can be done. I'll try doing that. > I suspect whatever corruption happens boils down to the same issue > that made people do that odd decision in the first place. The breakage appears to happen regardless of these changes, though. > And regardless, those games are too ugly to live. So I would suggest > that that original commit should just be considered broken, and > reverted (or just removed - I'm not sure if it's in a stable branch or > not) and the fix be rethought so that the code mapping can be done > once and for all and *without* the extra games. OK Thanks, Rafael
Re: ktime_get_ts64() splat during resume
On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysockiwrote: > > Overall, we seem to be heading towards the "really weird" territory here. So the whole commit that Boris bisected down to is weird to me. Why isn't the temporary text mapping just set up unconditionally in the temp_level4_pgt? Why does it have that insane "let's leave the temp_level4_pgt alone until we actually switch to it, and save away restore_pgd_addr and the restore_pgd, to then be set up at restore time"? All the other temporary mappings are set up statically in the temp_level4_pgt, why not that one? I suspect whatever corruption happens boils down to the same issue that made people do that odd decision in the first place. And regardless, those games are too ugly to live. So I would suggest that that original commit should just be considered broken, and reverted (or just removed - I'm not sure if it's in a stable branch or not) and the fix be rethought so that the code mapping can be done once and for all and *without* the extra games. Linus
Re: ktime_get_ts64() splat during resume
On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki wrote: > > Overall, we seem to be heading towards the "really weird" territory here. So the whole commit that Boris bisected down to is weird to me. Why isn't the temporary text mapping just set up unconditionally in the temp_level4_pgt? Why does it have that insane "let's leave the temp_level4_pgt alone until we actually switch to it, and save away restore_pgd_addr and the restore_pgd, to then be set up at restore time"? All the other temporary mappings are set up statically in the temp_level4_pgt, why not that one? I suspect whatever corruption happens boils down to the same issue that made people do that odd decision in the first place. And regardless, those games are too ugly to live. So I would suggest that that original commit should just be considered broken, and reverted (or just removed - I'm not sure if it's in a stable branch or not) and the fix be rethought so that the code mapping can be done once and for all and *without* the extra games. Linus
Re: ktime_get_ts64() splat during resume
On Friday, June 17, 2016 11:03:46 PM Rafael J. Wysocki wrote: > On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkovwrote: > > On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote: > >> A couple of questions: > >> - I guess this is reproducible 100% of the time? > > > > Yap. > > > > I took latest Linus + tip/master which has your commit. > > > >> - If you do "echo disk > /sys/power/state" instead of using s2disk, > >> does it still crash in the same way? > > > > My suspend to disk script does: > > > > echo 3 > /proc/sys/vm/drop_caches > > echo "shutdown" > /sys/power/disk > > echo "disk" > /sys/power/state > > > > I don't use anything else for years now. > > > >> - Are both the image and boot kernels the same binary? > > > > Yep. > > OK, we need to find out what's wrong, then. Boris is traveling this week, so the investigation will continue when he's back, but we spent quite some time on this during the weekend, so below is a summary of what we found plus some comments. Overall, we seem to be heading towards the "really weird" territory here. > First, please revert the changes in hibernate_asm_64.S alone and see > if that makes any difference. > > Hibernation should still work then most of the time, but the bug fixed > by this commit will be back. First of all, reverting the changes in hibernate_asm_64.S doesn't make the breakage go away, which means that the most essential changes in "x86/power/64: Fix crash whan the hibernation code passes control to the image kernel" don't make a difference. Moreover, after some back-and-forth we narrowed down the memory corruption to a single line in the (new) prepare_temporary_text_mapping() function and it turned out that it is triggered by scribbling on a page returned by get_safe_page(), immediately after obtaining it. Memory is corrupted no matter how the page is written to. In particular, filling that page with all ones triggers memory corruption later, for example. So appended is the last tested patch sufficient to trigger the breakage on the Boris' system (modulo some comments). Quite evidently, the breakage is triggered by the memset() line in prepare_temporary_text_mapping(). In addition to the above, there are some interesting observations from the investigation so far. First, what is corrupted is the image kernel memory. This means that the line in question scribbles on image data (which are stored in memory at that point) somewhere. Nevertheless, the image kernel is evidently able to continue after it has received control and carry out the full device resume up to the point in which user space is thawed and then it crashes as a result of page tables corruption. How those page tables are corrupted depends on what is written to the page pointed to by pmd in prepare_temporary_text_mapping() (but, of course, that write happens in the "boot" or "restore" kernel, before jumping to the image one). The above is what we know and now conclusions that it seems to lead to. Generally, I see two possible scenarios here. One is that the changes not directly related to prepare_temporary_text_mapping() in the patch below, eg. the changes in arch_hibernation_header_save/restore() (and related), make a difference, but that doesn't look likely to me. In any case, that will be the first thing to try next, as it is relatively easy. The second scenario is fundamentally less attractive, because it means that the address we end up with in pmd in prepare_temporary_text_mapping() is bogus. My first reaction to that option would be "Well, what if get_safe_page() is broken somehow?", but that's not so simple. Namely, for the observed corruption to happen, get_safe_page() would need to return a page that (1) had already been allocated once before and (2) such that image data had been written to it already. Still, the hibernation core only writes image data to pages that have been explicitly allocated by it and it never frees them individually (in particular, they are never freed before jumping to the image kernel at all). It sets two bits in two different bitmaps for every page allocated by it and checks those two bits every time before writing image data to any page. One may think "Well, maybe the bitmaps don't work correctly then and the bits are set when they shouldn't be sometimes?", but the problem with that line of reasoning is that get_safe_page() checks one of those very bits before returning a page and the page is only returned if that bit is clear. Thus, even if the bits were set when they shouldn't be sometimes, get_safe_page() would still not return such a page to the caller. The next question to ask might be "What if, horror shock, get_zeroed_page() called by get_safe_page() returns a page that has been allocated already?", but then the bitmap check mentioned above would still prevent get_safe_page() from returning that page (the bit in question would be set for it). Hence, the only way get_safe_page() might return a
Re: ktime_get_ts64() splat during resume
On Friday, June 17, 2016 11:03:46 PM Rafael J. Wysocki wrote: > On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov wrote: > > On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote: > >> A couple of questions: > >> - I guess this is reproducible 100% of the time? > > > > Yap. > > > > I took latest Linus + tip/master which has your commit. > > > >> - If you do "echo disk > /sys/power/state" instead of using s2disk, > >> does it still crash in the same way? > > > > My suspend to disk script does: > > > > echo 3 > /proc/sys/vm/drop_caches > > echo "shutdown" > /sys/power/disk > > echo "disk" > /sys/power/state > > > > I don't use anything else for years now. > > > >> - Are both the image and boot kernels the same binary? > > > > Yep. > > OK, we need to find out what's wrong, then. Boris is traveling this week, so the investigation will continue when he's back, but we spent quite some time on this during the weekend, so below is a summary of what we found plus some comments. Overall, we seem to be heading towards the "really weird" territory here. > First, please revert the changes in hibernate_asm_64.S alone and see > if that makes any difference. > > Hibernation should still work then most of the time, but the bug fixed > by this commit will be back. First of all, reverting the changes in hibernate_asm_64.S doesn't make the breakage go away, which means that the most essential changes in "x86/power/64: Fix crash whan the hibernation code passes control to the image kernel" don't make a difference. Moreover, after some back-and-forth we narrowed down the memory corruption to a single line in the (new) prepare_temporary_text_mapping() function and it turned out that it is triggered by scribbling on a page returned by get_safe_page(), immediately after obtaining it. Memory is corrupted no matter how the page is written to. In particular, filling that page with all ones triggers memory corruption later, for example. So appended is the last tested patch sufficient to trigger the breakage on the Boris' system (modulo some comments). Quite evidently, the breakage is triggered by the memset() line in prepare_temporary_text_mapping(). In addition to the above, there are some interesting observations from the investigation so far. First, what is corrupted is the image kernel memory. This means that the line in question scribbles on image data (which are stored in memory at that point) somewhere. Nevertheless, the image kernel is evidently able to continue after it has received control and carry out the full device resume up to the point in which user space is thawed and then it crashes as a result of page tables corruption. How those page tables are corrupted depends on what is written to the page pointed to by pmd in prepare_temporary_text_mapping() (but, of course, that write happens in the "boot" or "restore" kernel, before jumping to the image one). The above is what we know and now conclusions that it seems to lead to. Generally, I see two possible scenarios here. One is that the changes not directly related to prepare_temporary_text_mapping() in the patch below, eg. the changes in arch_hibernation_header_save/restore() (and related), make a difference, but that doesn't look likely to me. In any case, that will be the first thing to try next, as it is relatively easy. The second scenario is fundamentally less attractive, because it means that the address we end up with in pmd in prepare_temporary_text_mapping() is bogus. My first reaction to that option would be "Well, what if get_safe_page() is broken somehow?", but that's not so simple. Namely, for the observed corruption to happen, get_safe_page() would need to return a page that (1) had already been allocated once before and (2) such that image data had been written to it already. Still, the hibernation core only writes image data to pages that have been explicitly allocated by it and it never frees them individually (in particular, they are never freed before jumping to the image kernel at all). It sets two bits in two different bitmaps for every page allocated by it and checks those two bits every time before writing image data to any page. One may think "Well, maybe the bitmaps don't work correctly then and the bits are set when they shouldn't be sometimes?", but the problem with that line of reasoning is that get_safe_page() checks one of those very bits before returning a page and the page is only returned if that bit is clear. Thus, even if the bits were set when they shouldn't be sometimes, get_safe_page() would still not return such a page to the caller. The next question to ask might be "What if, horror shock, get_zeroed_page() called by get_safe_page() returns a page that has been allocated already?", but then the bitmap check mentioned above would still prevent get_safe_page() from returning that page (the bit in question would be set for it). Hence, the only way get_safe_page() might return a wrong page in
Re: ktime_get_ts64() splat during resume
On Monday, June 20, 2016 04:17:13 PM chenyu wrote: > On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysockiwrote: > > On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov wrote: > >> Ok, > >> > >> bisect is done, full log below. > >> > >> Rafael, that fix > >> > >> 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes > >> control to the image kernel") > >> > >> breaks s2disk here. It explodes during resume and a statically allocated > >> struct's member is NULL. See > >> > >> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic > >> > >> for the splat and some debugging attempts. > >> > >> Reverting 70595b479ce1 fixes the issue here. > > > > Quite evidently, memory is corrupted in the image kernel, but this > > particular commit only affects the boot kernel, so it can't really > > corrupt anything in the image one. > > > In previous patch, > before we jump to the new kernel entry, we add the > text mapping to temp_level4_pgt, > > /* switch over to the temporary kernel text mapping */ > movq%r8, (%r9) > If I understand correctly, r9 contains the virtual address > of restore_pgd_addr, since the page table for restore_pgd_addr might be > incoherent across hibernation(as NX patch changes the kernel text mapping > to dynamically mapping), so we might write pmd entry to an incorrect place in > temp_level4_pgt? I'm not sure what you mean. r9 contains the address previously stored in restore_pgd_addr. This is the address of the cell in (the page pointed to by temp_level4_pgt) to write the new pgd entry to. Why do you think it may be incorrect? In any case, the corruption reported by Boris also happens if all of the assembly changes in commit 70595b479ce1 are reverted, in which case the code should work as before that commit modulo some additional actions that shouldn't matter. Those additional actions turn out to matter, though. I'll write about that in detail shortly. Thanks, Rafael
Re: ktime_get_ts64() splat during resume
On Monday, June 20, 2016 04:17:13 PM chenyu wrote: > On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki wrote: > > On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov wrote: > >> Ok, > >> > >> bisect is done, full log below. > >> > >> Rafael, that fix > >> > >> 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes > >> control to the image kernel") > >> > >> breaks s2disk here. It explodes during resume and a statically allocated > >> struct's member is NULL. See > >> > >> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic > >> > >> for the splat and some debugging attempts. > >> > >> Reverting 70595b479ce1 fixes the issue here. > > > > Quite evidently, memory is corrupted in the image kernel, but this > > particular commit only affects the boot kernel, so it can't really > > corrupt anything in the image one. > > > In previous patch, > before we jump to the new kernel entry, we add the > text mapping to temp_level4_pgt, > > /* switch over to the temporary kernel text mapping */ > movq%r8, (%r9) > If I understand correctly, r9 contains the virtual address > of restore_pgd_addr, since the page table for restore_pgd_addr might be > incoherent across hibernation(as NX patch changes the kernel text mapping > to dynamically mapping), so we might write pmd entry to an incorrect place in > temp_level4_pgt? I'm not sure what you mean. r9 contains the address previously stored in restore_pgd_addr. This is the address of the cell in (the page pointed to by temp_level4_pgt) to write the new pgd entry to. Why do you think it may be incorrect? In any case, the corruption reported by Boris also happens if all of the assembly changes in commit 70595b479ce1 are reverted, in which case the code should work as before that commit modulo some additional actions that shouldn't matter. Those additional actions turn out to matter, though. I'll write about that in detail shortly. Thanks, Rafael
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysockiwrote: > On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov wrote: >> Ok, >> >> bisect is done, full log below. >> >> Rafael, that fix >> >> 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes >> control to the image kernel") >> >> breaks s2disk here. It explodes during resume and a statically allocated >> struct's member is NULL. See >> >> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic >> >> for the splat and some debugging attempts. >> >> Reverting 70595b479ce1 fixes the issue here. > > Quite evidently, memory is corrupted in the image kernel, but this > particular commit only affects the boot kernel, so it can't really > corrupt anything in the image one. > In previous patch, before we jump to the new kernel entry, we add the text mapping to temp_level4_pgt, /* switch over to the temporary kernel text mapping */ movq%r8, (%r9) If I understand correctly, r9 contains the virtual address of restore_pgd_addr, since the page table for restore_pgd_addr might be incoherent across hibernation(as NX patch changes the kernel text mapping to dynamically mapping), so we might write pmd entry to an incorrect place in temp_level4_pgt? Yu
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki wrote: > On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov wrote: >> Ok, >> >> bisect is done, full log below. >> >> Rafael, that fix >> >> 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes >> control to the image kernel") >> >> breaks s2disk here. It explodes during resume and a statically allocated >> struct's member is NULL. See >> >> https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic >> >> for the splat and some debugging attempts. >> >> Reverting 70595b479ce1 fixes the issue here. > > Quite evidently, memory is corrupted in the image kernel, but this > particular commit only affects the boot kernel, so it can't really > corrupt anything in the image one. > In previous patch, before we jump to the new kernel entry, we add the text mapping to temp_level4_pgt, /* switch over to the temporary kernel text mapping */ movq%r8, (%r9) If I understand correctly, r9 contains the virtual address of restore_pgd_addr, since the page table for restore_pgd_addr might be incoherent across hibernation(as NX patch changes the kernel text mapping to dynamically mapping), so we might write pmd entry to an incorrect place in temp_level4_pgt? Yu
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 11:03 PM, Rafael J. Wysockiwrote: > On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov wrote: >> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote: >>> A couple of questions: >>> - I guess this is reproducible 100% of the time? >> >> Yap. >> >> I took latest Linus + tip/master which has your commit. >> >>> - If you do "echo disk > /sys/power/state" instead of using s2disk, >>> does it still crash in the same way? >> >> My suspend to disk script does: >> >> echo 3 > /proc/sys/vm/drop_caches >> echo "shutdown" > /sys/power/disk >> echo "disk" > /sys/power/state >> >> I don't use anything else for years now. >> >>> - Are both the image and boot kernels the same binary? >> >> Yep. > > OK, we need to find out what's wrong, then. > > First, please revert the changes in hibernate_asm_64.S alone and see > if that makes any difference. > > Hibernation should still work then most of the time, but the bug fixed > by this commit will be back. Due to the nature of the memory corruption you are seeing (the same address appears to be corrupted every time in the same way) with 100% reproducibility and due to the fact that new code added by the commit in question only writes to dynamically allocated memory (and you've already verified that https://patchwork.kernel.org/patch/9185165/ doesn't help), it is quite unlikely that the memory corruption comes from that commit itself. However, I see a couple of ways in which that commit might uncover a latent bug. First, it changed the layout of the kernel text by adding the PAGE_SIZE alignment of restore_registers(). That likely pushed stuff behind it to new offsets, possibly including the static struct field that is now corrupted. Now, say that the memory corruption has always happened at the same memory location, but previously there was nothing in there or whatever was in there, wasn't vital. In that case the memory corruption might have gone unnoticed until the commit in question caused things to move to new locations and the corrupted location contains a vital piece of data now. This is my current theory. Second, it added two invocations of get_safe_page() that in theory might push things a bit too far towards the limit and they started to break there. I don't see how that can happen ATM, but I'm not excluding this possibility yet. It seems, though, that in that case the corruption would be more random and I certainly wouldn't expect it to happen at the same location every time. One more indicator is that multiple people reported success with that commit and in many hibernation runs, so the problem appears to be very specific to your system and/or kernel configuration. It also is interesting that the memory corruption only becomes visible during the thawing of tasks and given the piece of data that is corrupted, it should become visible much earlier if the memory was corrupted during image restoration by the boot kernel. In any case, reverting the changes in hibernate_asm_64.S alone should show us the direction, but if it makes things work again, I would try to change the restore_registers() alignment to something smaller, like 64 (which should be safe) and see what happens then.
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 11:03 PM, Rafael J. Wysocki wrote: > On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov wrote: >> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote: >>> A couple of questions: >>> - I guess this is reproducible 100% of the time? >> >> Yap. >> >> I took latest Linus + tip/master which has your commit. >> >>> - If you do "echo disk > /sys/power/state" instead of using s2disk, >>> does it still crash in the same way? >> >> My suspend to disk script does: >> >> echo 3 > /proc/sys/vm/drop_caches >> echo "shutdown" > /sys/power/disk >> echo "disk" > /sys/power/state >> >> I don't use anything else for years now. >> >>> - Are both the image and boot kernels the same binary? >> >> Yep. > > OK, we need to find out what's wrong, then. > > First, please revert the changes in hibernate_asm_64.S alone and see > if that makes any difference. > > Hibernation should still work then most of the time, but the bug fixed > by this commit will be back. Due to the nature of the memory corruption you are seeing (the same address appears to be corrupted every time in the same way) with 100% reproducibility and due to the fact that new code added by the commit in question only writes to dynamically allocated memory (and you've already verified that https://patchwork.kernel.org/patch/9185165/ doesn't help), it is quite unlikely that the memory corruption comes from that commit itself. However, I see a couple of ways in which that commit might uncover a latent bug. First, it changed the layout of the kernel text by adding the PAGE_SIZE alignment of restore_registers(). That likely pushed stuff behind it to new offsets, possibly including the static struct field that is now corrupted. Now, say that the memory corruption has always happened at the same memory location, but previously there was nothing in there or whatever was in there, wasn't vital. In that case the memory corruption might have gone unnoticed until the commit in question caused things to move to new locations and the corrupted location contains a vital piece of data now. This is my current theory. Second, it added two invocations of get_safe_page() that in theory might push things a bit too far towards the limit and they started to break there. I don't see how that can happen ATM, but I'm not excluding this possibility yet. It seems, though, that in that case the corruption would be more random and I certainly wouldn't expect it to happen at the same location every time. One more indicator is that multiple people reported success with that commit and in many hibernation runs, so the problem appears to be very specific to your system and/or kernel configuration. It also is interesting that the memory corruption only becomes visible during the thawing of tasks and given the piece of data that is corrupted, it should become visible much earlier if the memory was corrupted during image restoration by the boot kernel. In any case, reverting the changes in hibernate_asm_64.S alone should show us the direction, but if it makes things work again, I would try to change the restore_registers() alignment to something smaller, like 64 (which should be safe) and see what happens then.
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkovwrote: > On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote: >> A couple of questions: >> - I guess this is reproducible 100% of the time? > > Yap. > > I took latest Linus + tip/master which has your commit. > >> - If you do "echo disk > /sys/power/state" instead of using s2disk, >> does it still crash in the same way? > > My suspend to disk script does: > > echo 3 > /proc/sys/vm/drop_caches > echo "shutdown" > /sys/power/disk > echo "disk" > /sys/power/state > > I don't use anything else for years now. > >> - Are both the image and boot kernels the same binary? > > Yep. OK, we need to find out what's wrong, then. First, please revert the changes in hibernate_asm_64.S alone and see if that makes any difference. Hibernation should still work then most of the time, but the bug fixed by this commit will be back.
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov wrote: > On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote: >> A couple of questions: >> - I guess this is reproducible 100% of the time? > > Yap. > > I took latest Linus + tip/master which has your commit. > >> - If you do "echo disk > /sys/power/state" instead of using s2disk, >> does it still crash in the same way? > > My suspend to disk script does: > > echo 3 > /proc/sys/vm/drop_caches > echo "shutdown" > /sys/power/disk > echo "disk" > /sys/power/state > > I don't use anything else for years now. > >> - Are both the image and boot kernels the same binary? > > Yep. OK, we need to find out what's wrong, then. First, please revert the changes in hibernate_asm_64.S alone and see if that makes any difference. Hibernation should still work then most of the time, but the bug fixed by this commit will be back.
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote: > A couple of questions: > - I guess this is reproducible 100% of the time? Yap. I took latest Linus + tip/master which has your commit. > - If you do "echo disk > /sys/power/state" instead of using s2disk, > does it still crash in the same way? My suspend to disk script does: echo 3 > /proc/sys/vm/drop_caches echo "shutdown" > /sys/power/disk echo "disk" > /sys/power/state I don't use anything else for years now. > - Are both the image and boot kernels the same binary? Yep. > Sure. I thought Tony would pick them up. Oh ok, next time I'll pester him. :-) But seriously, should we route the RAS-relevant stuff touching drivers/acpi/apei/ through our tree instead? Provided you're fine with them? This way we'll unload some of the burden off you... Thanks! -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote: > A couple of questions: > - I guess this is reproducible 100% of the time? Yap. I took latest Linus + tip/master which has your commit. > - If you do "echo disk > /sys/power/state" instead of using s2disk, > does it still crash in the same way? My suspend to disk script does: echo 3 > /proc/sys/vm/drop_caches echo "shutdown" > /sys/power/disk echo "disk" > /sys/power/state I don't use anything else for years now. > - Are both the image and boot kernels the same binary? Yep. > Sure. I thought Tony would pick them up. Oh ok, next time I'll pester him. :-) But seriously, should we route the RAS-relevant stuff touching drivers/acpi/apei/ through our tree instead? Provided you're fine with them? This way we'll unload some of the burden off you... Thanks! -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkovwrote: > Ok, > > bisect is done, full log below. > > Rafael, that fix > > 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes > control to the image kernel") > > breaks s2disk here. It explodes during resume and a statically allocated > struct's member is NULL. See > > https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic > > for the splat and some debugging attempts. > > Reverting 70595b479ce1 fixes the issue here. Quite evidently, memory is corrupted in the image kernel, but this particular commit only affects the boot kernel, so it can't really corrupt anything in the image one. A couple of questions: - I guess this is reproducible 100% of the time? - If you do "echo disk > /sys/power/state" instead of using s2disk, does it still crash in the same way? - Are both the image and boot kernels the same binary? > @stable folk: You might want to hold off on picking up that one yet... > > > Now, I have serial connected to that box so if you want me to dump stuff > and try patches, let me know. > > Btw, while I have your ear, can you pick up my einj fixes please? > > :-) > > -> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email...@alien8.de Sure. I thought Tony would pick them up.
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov wrote: > Ok, > > bisect is done, full log below. > > Rafael, that fix > > 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes > control to the image kernel") > > breaks s2disk here. It explodes during resume and a statically allocated > struct's member is NULL. See > > https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic > > for the splat and some debugging attempts. > > Reverting 70595b479ce1 fixes the issue here. Quite evidently, memory is corrupted in the image kernel, but this particular commit only affects the boot kernel, so it can't really corrupt anything in the image one. A couple of questions: - I guess this is reproducible 100% of the time? - If you do "echo disk > /sys/power/state" instead of using s2disk, does it still crash in the same way? - Are both the image and boot kernels the same binary? > @stable folk: You might want to hold off on picking up that one yet... > > > Now, I have serial connected to that box so if you want me to dump stuff > and try patches, let me know. > > Btw, while I have your ear, can you pick up my einj fixes please? > > :-) > > -> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email...@alien8.de Sure. I thought Tony would pick them up.
Re: ktime_get_ts64() splat during resume
Ok, bisect is done, full log below. Rafael, that fix 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel") breaks s2disk here. It explodes during resume and a statically allocated struct's member is NULL. See https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic for the splat and some debugging attempts. Reverting 70595b479ce1 fixes the issue here. @stable folk: You might want to hold off on picking up that one yet... Now, I have serial connected to that box so if you want me to dump stuff and try patches, let me know. Btw, while I have your ear, can you pick up my einj fixes please? :-) -> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email...@alien8.de Thanks! git bisect start # bad: [f0e52976a6fad5aa2c8868d5fa91deee91299ebb] Merge remote-tracking branch 'tip/master' into tip-timers-splat git bisect bad f0e52976a6fad5aa2c8868d5fa91deee91299ebb # good: [bb967271c0317de8eed56b20aae978d31507b033] Merge tag 'pwm/for-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm git bisect good bb967271c0317de8eed56b20aae978d31507b033 # good: [bb41dcd3fd81971ce17f7c0e5020d343d905358f] Merge branch 'sched/core' git bisect good bb41dcd3fd81971ce17f7c0e5020d343d905358f # bad: [457642d3e6e1f7412c8174dbaeb06232618b2698] Merge branch 'x86/mm' git bisect bad 457642d3e6e1f7412c8174dbaeb06232618b2698 # bad: [d818e3414b9def094a93ce57caad9cb75ecb2a17] Merge branch 'x86/boot' git bisect bad d818e3414b9def094a93ce57caad9cb75ecb2a17 # good: [02e8fda2cc00419a11cf38199afea4c0d7172be8] x86/signals: Add build-time checks to the siginfo compat code git bisect good 02e8fda2cc00419a11cf38199afea4c0d7172be8 # good: [c702ff2fc395d8d9949eef8f72fd8726943c5aa1] Merge branch 'timers/core' git bisect good c702ff2fc395d8d9949eef8f72fd8726943c5aa1 # good: [0ef1f6adccfea59c600edca7632ff0834ddfee5b] Merge branch 'x86/asm' git bisect good 0ef1f6adccfea59c600edca7632ff0834ddfee5b # bad: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan the hibernation code passes control to the image kernel git bisect bad 70595b479ce173425dd5cb347dc6c8b1060dfb2c # good: [8d950d2fb23b696d393020486ab6a350bcb2c582] MAINTAINERS: Update the Calgary IOMMU entry git bisect good 8d950d2fb23b696d393020486ab6a350bcb2c582 # first bad commit: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan the hibernation code passes control to the image kernel -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
Re: ktime_get_ts64() splat during resume
Ok, bisect is done, full log below. Rafael, that fix 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel") breaks s2disk here. It explodes during resume and a statically allocated struct's member is NULL. See https://lkml.kernel.org/r/20160617105435.gb15...@pd.tnic for the splat and some debugging attempts. Reverting 70595b479ce1 fixes the issue here. @stable folk: You might want to hold off on picking up that one yet... Now, I have serial connected to that box so if you want me to dump stuff and try patches, let me know. Btw, while I have your ear, can you pick up my einj fixes please? :-) -> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email...@alien8.de Thanks! git bisect start # bad: [f0e52976a6fad5aa2c8868d5fa91deee91299ebb] Merge remote-tracking branch 'tip/master' into tip-timers-splat git bisect bad f0e52976a6fad5aa2c8868d5fa91deee91299ebb # good: [bb967271c0317de8eed56b20aae978d31507b033] Merge tag 'pwm/for-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm git bisect good bb967271c0317de8eed56b20aae978d31507b033 # good: [bb41dcd3fd81971ce17f7c0e5020d343d905358f] Merge branch 'sched/core' git bisect good bb41dcd3fd81971ce17f7c0e5020d343d905358f # bad: [457642d3e6e1f7412c8174dbaeb06232618b2698] Merge branch 'x86/mm' git bisect bad 457642d3e6e1f7412c8174dbaeb06232618b2698 # bad: [d818e3414b9def094a93ce57caad9cb75ecb2a17] Merge branch 'x86/boot' git bisect bad d818e3414b9def094a93ce57caad9cb75ecb2a17 # good: [02e8fda2cc00419a11cf38199afea4c0d7172be8] x86/signals: Add build-time checks to the siginfo compat code git bisect good 02e8fda2cc00419a11cf38199afea4c0d7172be8 # good: [c702ff2fc395d8d9949eef8f72fd8726943c5aa1] Merge branch 'timers/core' git bisect good c702ff2fc395d8d9949eef8f72fd8726943c5aa1 # good: [0ef1f6adccfea59c600edca7632ff0834ddfee5b] Merge branch 'x86/asm' git bisect good 0ef1f6adccfea59c600edca7632ff0834ddfee5b # bad: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan the hibernation code passes control to the image kernel git bisect bad 70595b479ce173425dd5cb347dc6c8b1060dfb2c # good: [8d950d2fb23b696d393020486ab6a350bcb2c582] MAINTAINERS: Update the Calgary IOMMU entry git bisect good 8d950d2fb23b696d393020486ab6a350bcb2c582 # first bad commit: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan the hibernation code passes control to the image kernel -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 01:53:53PM +0200, Thomas Gleixner wrote: > It must be initialized otherwise you won't reach suspend. I have no idea how > that can happen. Btw, there's one other thing I'm seeing in the boot kernel, while it suspends. It hardly is related though: [ 42.046585] kvm: exiting hardware virtualization [ 42.062446] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x20001000 fl] [ 42.255531] sd 3:0:0:0: [sdd] Synchronizing SCSI cache [ 42.263159] sd 3:0:0:0: [sdd] Stopping disk [ 43.090022] irq 16: nobody cared (try booting with the "irqpoll" option) [ 43.099122] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.7.0-rc3+ #4 [ 43.107784] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/23 [ 43.120143] 88043dd83e68 812aacf3 88003790e400 [ 43.130145] 88003790e49c 88043dd83e90 810a5c3e 88003790e400 [ 43.140118] [ 43.140201] sd 2:0:0:0: [sdc] Synchronizing SCSI cache [ 43.140592] sd 2:0:0:0: [sdc] Stopping disk [ 43.157217] 0010 88043dd83ec8 810a5fe7 [ 43.168302] Call Trace: [ 43.173210][] dump_stack+0x67/0x94 [ 43.181481] [] __report_bad_irq+0x3e/0xd0 [ 43.189626] [] note_interrupt+0x257/0x2a0 [ 43.197797] [] handle_irq_event_percpu+0xa9/0x1f0 [ 43.206655] [] handle_irq_event+0x40/0x70 [ 43.214776] [] handle_fasteoi_irq+0xb6/0x170 [ 43.223166] [] handle_irq+0x1a/0x30 [ 43.230748] [] do_IRQ+0x66/0x100 [ 43.238045] [] common_interrupt+0x86/0x86 [ 43.246135][] ? acpi_safe_halt+0x24/0x2c [ 43.254884] [] ? acpi_safe_halt+0x22/0x2c [ 43.262987] [] acpi_idle_do_entry+0x20/0x30 [ 43.271298] [] acpi_idle_enter+0x1dc/0x1fc [ 43.279462] [] cpuidle_enter_state+0x10e/0x300 [ 43.287980] [] cpuidle_enter+0x17/0x20 [ 43.295799] [] call_cpuidle+0x2a/0x40 [ 43.303528] [] cpu_startup_entry+0x2dd/0x3c0 [ 43.304279] sd 1:0:0:0: [sdb] Synchronizing SCSI cache [ 43.306109] sd 1:0:0:0: [sdb] Stopping disk [ 43.325777] [] start_secondary+0x13a/0x150 [ 43.333868] handlers: [ 43.338471] [] azx_interrupt [ 43.345298] Disabling IRQ #16 [ 43.848595] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 43.858068] sd 0:0:0:0: [sda] Stopping disk [ 44.052892] pcieport :00:04.0: System wakeup enabled by ACPI [ 44.075715] ACPI: Preparing to enter system sleep state S5 [ 44.083818] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query honored via cmdline [ 44.092258] reboot: Power down [ 44.096326] acpi_power_off called -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: ktime_get_ts64() splat during resume
On Fri, Jun 17, 2016 at 01:53:53PM +0200, Thomas Gleixner wrote: > It must be initialized otherwise you won't reach suspend. I have no idea how > that can happen. Btw, there's one other thing I'm seeing in the boot kernel, while it suspends. It hardly is related though: [ 42.046585] kvm: exiting hardware virtualization [ 42.062446] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x20001000 fl] [ 42.255531] sd 3:0:0:0: [sdd] Synchronizing SCSI cache [ 42.263159] sd 3:0:0:0: [sdd] Stopping disk [ 43.090022] irq 16: nobody cared (try booting with the "irqpoll" option) [ 43.099122] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.7.0-rc3+ #4 [ 43.107784] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/23 [ 43.120143] 88043dd83e68 812aacf3 88003790e400 [ 43.130145] 88003790e49c 88043dd83e90 810a5c3e 88003790e400 [ 43.140118] [ 43.140201] sd 2:0:0:0: [sdc] Synchronizing SCSI cache [ 43.140592] sd 2:0:0:0: [sdc] Stopping disk [ 43.157217] 0010 88043dd83ec8 810a5fe7 [ 43.168302] Call Trace: [ 43.173210][] dump_stack+0x67/0x94 [ 43.181481] [] __report_bad_irq+0x3e/0xd0 [ 43.189626] [] note_interrupt+0x257/0x2a0 [ 43.197797] [] handle_irq_event_percpu+0xa9/0x1f0 [ 43.206655] [] handle_irq_event+0x40/0x70 [ 43.214776] [] handle_fasteoi_irq+0xb6/0x170 [ 43.223166] [] handle_irq+0x1a/0x30 [ 43.230748] [] do_IRQ+0x66/0x100 [ 43.238045] [] common_interrupt+0x86/0x86 [ 43.246135][] ? acpi_safe_halt+0x24/0x2c [ 43.254884] [] ? acpi_safe_halt+0x22/0x2c [ 43.262987] [] acpi_idle_do_entry+0x20/0x30 [ 43.271298] [] acpi_idle_enter+0x1dc/0x1fc [ 43.279462] [] cpuidle_enter_state+0x10e/0x300 [ 43.287980] [] cpuidle_enter+0x17/0x20 [ 43.295799] [] call_cpuidle+0x2a/0x40 [ 43.303528] [] cpu_startup_entry+0x2dd/0x3c0 [ 43.304279] sd 1:0:0:0: [sdb] Synchronizing SCSI cache [ 43.306109] sd 1:0:0:0: [sdb] Stopping disk [ 43.325777] [] start_secondary+0x13a/0x150 [ 43.333868] handlers: [ 43.338471] [] azx_interrupt [ 43.345298] Disabling IRQ #16 [ 43.848595] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 43.858068] sd 0:0:0:0: [sda] Stopping disk [ 44.052892] pcieport :00:04.0: System wakeup enabled by ACPI [ 44.075715] ACPI: Preparing to enter system sleep state S5 [ 44.083818] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query honored via cmdline [ 44.092258] reboot: Power down [ 44.096326] acpi_power_off called -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: ktime_get_ts64() splat during resume
On Fri, 17 Jun 2016, Borislav Petkov wrote: > look what I've found this morning during resume: > > [ 45.746236] BUG: unable to handle kernel done. > [ 45.752542] NULL pointer dereference at 0001 > [ 45.752544] IP: [<0001>] 0x1 > static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr) > { > cycle_t cycle_now, delta; > > /* read clocksource */ > cycle_now = tkr->read(tkr->clock); > ^^ > > Ring any bells about something corrupting tk_core.timekeeper.tkr_mono or > it being uninitialized after suspend? It must be initialized otherwise you won't reach suspend. I have no idea how that can happen. Thanks, tglx
Re: ktime_get_ts64() splat during resume
On Fri, 17 Jun 2016, Borislav Petkov wrote: > look what I've found this morning during resume: > > [ 45.746236] BUG: unable to handle kernel done. > [ 45.752542] NULL pointer dereference at 0001 > [ 45.752544] IP: [<0001>] 0x1 > static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr) > { > cycle_t cycle_now, delta; > > /* read clocksource */ > cycle_now = tkr->read(tkr->clock); > ^^ > > Ring any bells about something corrupting tk_core.timekeeper.tkr_mono or > it being uninitialized after suspend? It must be initialized otherwise you won't reach suspend. I have no idea how that can happen. Thanks, tglx
ktime_get_ts64() splat during resume
Hi guys, look what I've found this morning during resume: [ 45.732934] PM: Image restored successfully. [ 45.738064] PM: Basic memory bitmaps freed [ 45.742914] Restarting tasks ... [ 45.746236] BUG: unable to handle kernel done. [ 45.752542] NULL pointer dereference at 0001 [ 45.752544] IP: [<0001>] 0x1 [ 45.752547] PGD 37922067 PUD 3791a067 PMD 0 [ 45.752548] Oops: 0010 [#1] PREEMPT SMP [ 45.752557] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm amdkfd amd_iommu_v2 irqbypass crc32_pclmul radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power k10temp acpi_cpufreq [ 45.752559] CPU: 2 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #1 [ 45.752560] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013 [ 45.752560] task: 88042b958000 ti: 88042b954000 task.ti: 88042b954000 [ 45.752562] RIP: 0010:[<0001>] [<0001>] 0x1 [ 45.752562] RSP: 0018:88042b957e50 EFLAGS: 00010246 [ 45.752563] RAX: RBX: 81181e1e RCX: fdfe [ 45.752564] RDX: 88042b958000 RSI: 88042b954000 RDI: 8168aeac [ 45.752564] RBP: 0430 R08: R09: 0002 [ 45.752565] R10: R11: 0001 R12: ff9c0002 [ 45.752565] R13: 88042b09a300 R14: 811782bf R15: 0011 [ 45.752566] FS: 7f9d470ab800() GS:88043dc8() knlGS: [ 45.752567] CS: 0010 DS: ES: CR0: 80050033 [ 45.752567] CR2: 0001 CR3: 37923000 CR4: 000406e0 [ 45.752568] Stack: [ 45.752569] 1180 88042b957e88 810bb33a [ 45.752570] fdfe 88042b957f00 7ffcfa91b1f0 88042b957ee8 [ 45.752571] 81185fc7 0003 21b8e1cd 0003 [ 45.752571] Call Trace: [ 45.752576] [] ? ktime_get_ts64+0x4a/0xf0 [ 45.752578] [] ? poll_select_copy_remaining+0xe7/0x130 [ 45.752581] [] ? exit_to_usermode_loop+0x8a/0xb0 [ 45.752582] [] ? syscall_return_slowpath+0x5b/0x70 [ 45.752584] [] ? entry_SYSCALL_64_fastpath+0xa5/0xa7 [ 45.752587] Code: Bad RIP value. [ 45.752588] RIP [<0001>] 0x1 [ 45.752589] RSP [ 45.752589] CR2: 0001 [ 45.752597] ---[ end trace 5334fe9eec2bfca9 ]--- [ 45.752737] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009 [ 45.752737] [ 45.758773] Kernel Offset: disabled rIP is gone. That's rc3+tip/master from Wed: commit 043ae67b6fdd9db7fb54564124d2f2fa833c5ee6 (refs/remotes/tip/master) Merge: 0bbd6fbba8c7 00688272157d Author: Ingo MolnarDate: Wed Jun 15 12:53:02 2016 +0200 Merge branch 'x86/platform' The top function in the stacktrace is: [ 45.752576] [] ? ktime_get_ts64+0x4a/0xf0 and that address is: 810bb2f0 : 810bb2f0: e8 6b 21 5d 00 callq 8168d460 <__fentry__> 810bb2f5: 55 push %rbp 810bb2f6: 8b 05 14 52 bf 00 mov0xbf5214(%rip),%eax # 81cb0510 810bb2fc: 48 89 e5mov%rsp,%rbp 810bb2ff: 41 55 push %r13 810bb301: 85 c0 test %eax,%eax 810bb303: 41 54 push %r12 810bb305: 53 push %rbx 810bb306: 48 89 fbmov%rdi,%rbx 810bb309: 0f 85 b6 00 00 00 jne810bb3c5 810bb30f: 45 31 e4xor%r12d,%r12d 810bb312: 44 8b 2d a7 a9 f5 00mov0xf5a9a7(%rip),%r13d # 82015cc0 810bb319: 41 f6 c5 01 test $0x1,%r13b 810bb31d: 0f 85 9b 00 00 00 jne810bb3be 810bb323: 48 8b 05 0e aa f5 00mov0xf5aa0e(%rip),%rax # 82015d38 810bb32a: 48 89 03mov%rax,(%rbx) 810bb32d: 48 8b 3d 94 a9 f5 00mov0xf5a994(%rip),%rdi # 82015cc8 810bb334: ff 15 96 a9 f5 00 callq *0xf5a996(%rip)# 82015cd0 810bb33a: 48 2b 05 9f a9 f5 00sub0xf5a99f(%rip),%rax # 82015ce0 <--- HERE 810bb341: 48 8b 15 90 a9 f5 00mov0xf5a990(%rip),%rdx # 82015cd8 810bb348: 44 8b 05 99 a9 f5 00mov0xf5a999(%rip),%r8d # 82015ce8 810bb34f: 48 8b 35 9a a9 f5 00mov0xf5a99a(%rip),%rsi # 82015cf0
ktime_get_ts64() splat during resume
Hi guys, look what I've found this morning during resume: [ 45.732934] PM: Image restored successfully. [ 45.738064] PM: Basic memory bitmaps freed [ 45.742914] Restarting tasks ... [ 45.746236] BUG: unable to handle kernel done. [ 45.752542] NULL pointer dereference at 0001 [ 45.752544] IP: [<0001>] 0x1 [ 45.752547] PGD 37922067 PUD 3791a067 PMD 0 [ 45.752548] Oops: 0010 [#1] PREEMPT SMP [ 45.752557] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm amdkfd amd_iommu_v2 irqbypass crc32_pclmul radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power k10temp acpi_cpufreq [ 45.752559] CPU: 2 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #1 [ 45.752560] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013 [ 45.752560] task: 88042b958000 ti: 88042b954000 task.ti: 88042b954000 [ 45.752562] RIP: 0010:[<0001>] [<0001>] 0x1 [ 45.752562] RSP: 0018:88042b957e50 EFLAGS: 00010246 [ 45.752563] RAX: RBX: 81181e1e RCX: fdfe [ 45.752564] RDX: 88042b958000 RSI: 88042b954000 RDI: 8168aeac [ 45.752564] RBP: 0430 R08: R09: 0002 [ 45.752565] R10: R11: 0001 R12: ff9c0002 [ 45.752565] R13: 88042b09a300 R14: 811782bf R15: 0011 [ 45.752566] FS: 7f9d470ab800() GS:88043dc8() knlGS: [ 45.752567] CS: 0010 DS: ES: CR0: 80050033 [ 45.752567] CR2: 0001 CR3: 37923000 CR4: 000406e0 [ 45.752568] Stack: [ 45.752569] 1180 88042b957e88 810bb33a [ 45.752570] fdfe 88042b957f00 7ffcfa91b1f0 88042b957ee8 [ 45.752571] 81185fc7 0003 21b8e1cd 0003 [ 45.752571] Call Trace: [ 45.752576] [] ? ktime_get_ts64+0x4a/0xf0 [ 45.752578] [] ? poll_select_copy_remaining+0xe7/0x130 [ 45.752581] [] ? exit_to_usermode_loop+0x8a/0xb0 [ 45.752582] [] ? syscall_return_slowpath+0x5b/0x70 [ 45.752584] [] ? entry_SYSCALL_64_fastpath+0xa5/0xa7 [ 45.752587] Code: Bad RIP value. [ 45.752588] RIP [<0001>] 0x1 [ 45.752589] RSP [ 45.752589] CR2: 0001 [ 45.752597] ---[ end trace 5334fe9eec2bfca9 ]--- [ 45.752737] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009 [ 45.752737] [ 45.758773] Kernel Offset: disabled rIP is gone. That's rc3+tip/master from Wed: commit 043ae67b6fdd9db7fb54564124d2f2fa833c5ee6 (refs/remotes/tip/master) Merge: 0bbd6fbba8c7 00688272157d Author: Ingo Molnar Date: Wed Jun 15 12:53:02 2016 +0200 Merge branch 'x86/platform' The top function in the stacktrace is: [ 45.752576] [] ? ktime_get_ts64+0x4a/0xf0 and that address is: 810bb2f0 : 810bb2f0: e8 6b 21 5d 00 callq 8168d460 <__fentry__> 810bb2f5: 55 push %rbp 810bb2f6: 8b 05 14 52 bf 00 mov0xbf5214(%rip),%eax # 81cb0510 810bb2fc: 48 89 e5mov%rsp,%rbp 810bb2ff: 41 55 push %r13 810bb301: 85 c0 test %eax,%eax 810bb303: 41 54 push %r12 810bb305: 53 push %rbx 810bb306: 48 89 fbmov%rdi,%rbx 810bb309: 0f 85 b6 00 00 00 jne810bb3c5 810bb30f: 45 31 e4xor%r12d,%r12d 810bb312: 44 8b 2d a7 a9 f5 00mov0xf5a9a7(%rip),%r13d # 82015cc0 810bb319: 41 f6 c5 01 test $0x1,%r13b 810bb31d: 0f 85 9b 00 00 00 jne810bb3be 810bb323: 48 8b 05 0e aa f5 00mov0xf5aa0e(%rip),%rax # 82015d38 810bb32a: 48 89 03mov%rax,(%rbx) 810bb32d: 48 8b 3d 94 a9 f5 00mov0xf5a994(%rip),%rdi # 82015cc8 810bb334: ff 15 96 a9 f5 00 callq *0xf5a996(%rip)# 82015cd0 810bb33a: 48 2b 05 9f a9 f5 00sub0xf5a99f(%rip),%rax # 82015ce0 <--- HERE 810bb341: 48 8b 15 90 a9 f5 00mov0xf5a990(%rip),%rdx # 82015cd8 810bb348: 44 8b 05 99 a9 f5 00mov0xf5a999(%rip),%r8d # 82015ce8 810bb34f: 48 8b 35 9a a9 f5 00mov0xf5a99a(%rip),%rsi # 82015cf0 810bb356: 8b 0d 90 a9 f5 00 mov0xf5a990(%rip),%ecx # 82015cec 810bb35c: 48 8b 3d e5 a9 f5 00