Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Wed, Oct 23, 2019 at 03:47:57PM +0200, Thomas Gleixner wrote: > > So the fix is trivial. Works like a charm. Tested-by: Cyrill Gorcunov
Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Wed, Oct 23, 2019 at 03:47:57PM +0200, Thomas Gleixner wrote: > > And you are right this happens because cea_exception_stacks is not yet > initialized which makes begin = 0 and therefore point into nirvana. > > So the fix is trivial. Great! Thanks a lot for sych detailed analysis! I'll test the patch.
Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Wed, 23 Oct 2019, Thomas Gleixner wrote: > On Tue, 22 Oct 2019, Cyrill Gorcunov wrote: > Ergo ep must be a valid pointer pointing to the statically allocated and > statically initialized estack_pages array. > > /* Guard page? */ > if (!ep->size) > > How on earth can dereferencing ep crash the machine? > > return false; > > That does not make any sense. > > Surely, we should not even try to decode exception stack when > cea_exception_stacks is not yet initialized, but that does not explain > anything what you are observing. So looking at your actual crash: [0.027246] BUG: unable to handle page fault for address: 1ff0 So this derefences the stack pointer address. [0.082275] stk 0x1010 k 1 begin 0x0 end 0xd000 estack_pages 0x82014880 ep 0x82014888 ep is pointing correctly to estack_pages[1] which is bogus because 0x1010 is not a valid stack value, but dereferencing ep does not make it crash. The crash farther down: end = begin + (unsigned long)ep->size; ==> end = 0x2000 regs = (struct pt_regs *)end - 1; ==> regs = 0x2000 - sizeof(struct pt_regs *) = 0x1ff0 info->type = ep->type; info->begin = (unsigned long *)begin; info->end = (unsigned long *)end; > info->next_sp = (unsigned long *)regs->sp; This is the crashing instruction trying to access 0x1ff0 And you are right this happens because cea_exception_stacks is not yet initialized which makes begin = 0 and therefore point into nirvana. So the fix is trivial. Thanks, tglx 8< --- a/arch/x86/kernel/dumpstack_64.c +++ b/arch/x86/kernel/dumpstack_64.c @@ -94,6 +94,13 @@ static bool in_exception_stack(unsigned BUILD_BUG_ON(N_EXCEPTION_STACKS != 6); begin = (unsigned long)__this_cpu_read(cea_exception_stacks); + /* +* Handle the case where stack trace is collected _before_ +* cea_exception_stacks had been initialized. +*/ + if (!begin) + return false; + end = begin + sizeof(struct cea_exception_stacks); /* Bail if @stack is outside the exception stack area. */ if (stk < begin || stk >= end)
Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Wed, Oct 23, 2019 at 03:38:40PM +0200, Thomas Gleixner wrote: > > > > > > [0,2,4,5,6,8,10,12] are guard pages so 0 is not that crappy at all > > > > Wait, Thomas, I might be wrong, but per-cpu is initialized to the pointer, > > the memory for this estack_pages has not yet been allocated, no? > > static const > struct estack_pages estack_pages[CEA_ESTACK_PAGES] cacheline_aligned = { > EPAGERANGE(DF), > EPAGERANGE(NMI), > EPAGERANGE(DB1), > EPAGERANGE(DB), > EPAGERANGE(MCE), > }; > > It's statically allocated. So it's available from the very beginning. Indeed, thanks! I happened to overlooked this moment. ... > And as I explained to you properly decoded the values _ARE_ correct and > make sense. Yes, just posted the diff itself to be sure. Thanks a huge for explanation, Thomas!
Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Wed, 23 Oct 2019, Cyrill Gorcunov wrote: > On Wed, Oct 23, 2019 at 03:21:05PM +0200, Thomas Gleixner wrote: > > Errm. estack_pages is statically initialized and it's an array of:. > > > > struct estack_pages { > > u32 offs; > > u16 size; > > u16 type; > > }; > > > > [0,2,4,5,6,8,10,12] are guard pages so 0 is not that crappy at all > > Wait, Thomas, I might be wrong, but per-cpu is initialized to the pointer, > the memory for this estack_pages has not yet been allocated, no? static const struct estack_pages estack_pages[CEA_ESTACK_PAGES] cacheline_aligned = { EPAGERANGE(DF), EPAGERANGE(NMI), EPAGERANGE(DB1), EPAGERANGE(DB), EPAGERANGE(MCE), }; It's statically allocated. So it's available from the very beginning. > The diff I made to fetch the values are > > diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c > index 753b8cfe8b8a..bf0d755b6079 100644 > --- a/arch/x86/kernel/dumpstack_64.c > +++ b/arch/x86/kernel/dumpstack_64.c > @@ -101,8 +101,18 @@ static bool in_exception_stack(unsigned long *stack, > struct stack_info *info) > > /* Calc page offset from start of exception stacks */ > k = (stk - begin) >> PAGE_SHIFT; > + > /* Lookup the page descriptor */ > ep = _pages[k]; > + > + printk("stk 0x%lx k %u begin 0x%lx end 0x%lx estack_pages 0x%lx ep > 0x%lx\n", > +stk, k, begin, end, (long)(void *)_pages[0], (long)(void > *)ep); > + > + for (k = 0; k < CEA_ESTACK_PAGES; k++) { > + long v = *(long *)(void *)_pages[k]; > + printk("estack_pages[%d] = 0x%lx\n", k, v); And as I explained to you properly decoded the values _ARE_ correct and make sense. > + } > + > /* Guard page? */ > if (!ep->size) > return false; > > > > > > e.g. 0x51000 1000 > > > > bit 0-31: 1000Offset 0x1000: 1 Page > > bit 32-47: 1000Size 0x1000: 1 Page > > bit 48-63: 5 Type 5: STACK_TYPE_EXCEPTION + ESTACK_DF > > > > So, no. This is NOT the problem. > > I drop the left of your reply. True, I agreed with anything you said. > > You know I didn't manage to dive more into this problem yesterday > but if time permits I'll continue today. It is easily triggering > under kvm (the kernel I'm building is almost without modules so > I simply upload bzImage into the guest). FWIW, the config I'm > using is https://gist.github.com/cyrillos/7cd5d2510a99af8ea872f07ac6f9095b That's helpful because I enabled kmemleak and the kernel comes up just fine. Thanks, tglx
Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Tue, 22 Oct 2019, Cyrill Gorcunov wrote: > On Tue, Oct 22, 2019 at 05:23:25PM +0300, Cyrill Gorcunov wrote: > > > > I presume the kmemleak tries to save stack trace too early when > > estack_pages are not > > yet filled. > > Indeed, at this stage of boot the percpu_setup_exception_stacks has not been > called > yet and estack_pages full of crap > > [0.157502] stk 0x1008 k 1 begin 0x0 end 0xd000 estack_pages > 0x82014880 ep 0x82014888 > [0.159395] estack_pages[0] = 0x0 > [0.160046] estack_pages[1] = 0x510001000 > [0.160881] estack_pages[2] = 0x0 > [0.161530] estack_pages[3] = 0x610003000 > [0.162343] estack_pages[4] = 0x0 > [0.162962] estack_pages[5] = 0x0 > [0.163523] estack_pages[6] = 0x0 > [0.164065] estack_pages[7] = 0x810007000 > [0.164978] estack_pages[8] = 0x0 > [0.165624] estack_pages[9] = 0x910009000 > [0.166448] estack_pages[10] = 0x0 > [0.167064] estack_pages[11] = 0xa1000b000 > [0.168055] estack_pages[12] = 0x0 Errm. estack_pages is statically initialized and it's an array of:. struct estack_pages { u32 offs; u16 size; u16 type; }; [0,2,4,5,6,8,10,12] are guard pages so 0 is not that crappy at all The rest looks completely valid if you actually decode it proper. e.g. 0x51000 1000 bit 0-31: 1000Offset 0x1000: 1 Page bit 32-47: 1000Size 0x1000: 1 Page bit 48-63: 5 Type 5: STACK_TYPE_EXCEPTION + ESTACK_DF So, no. This is NOT the problem. But yes, you are right that percpu_setup_exception_stacks() has not yet been called simply because the percpu entry area has not been mapped yet. So lets look at the full context: begin = (unsigned long)__this_cpu_read(cea_exception_stacks); When percpu_setup_exception_stacks() has not been called yet, then begin should be 0. end = begin + sizeof(struct cea_exception_stacks); end should be 0 + sizeof(struct cea_exception_stacks); /* Bail if @stack is outside the exception stack area. */ if (stk < begin || stk >= end) return false; So 'begin <= stk < end' must be true to get to the below: /* Calc page offset from start of exception stacks */ k = (stk - begin) >> PAGE_SHIFT; which gives a valid 'k' no matter what 'begin' is. And obviously 'k' cannot be outside of the array size of estack_pages. /* Lookup the page descriptor */ ep = _pages[k]; Ergo ep must be a valid pointer pointing to the statically allocated and statically initialized estack_pages array. /* Guard page? */ if (!ep->size) How on earth can dereferencing ep crash the machine? return false; That does not make any sense. Surely, we should not even try to decode exception stack when cea_exception_stacks is not yet initialized, but that does not explain anything what you are observing. Thanks, tglx
Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Wed, Oct 23, 2019 at 03:21:05PM +0200, Thomas Gleixner wrote: > On Tue, 22 Oct 2019, Cyrill Gorcunov wrote: > > On Tue, Oct 22, 2019 at 05:23:25PM +0300, Cyrill Gorcunov wrote: > > > > > > I presume the kmemleak tries to save stack trace too early when > > > estack_pages are not > > > yet filled. > > > > Indeed, at this stage of boot the percpu_setup_exception_stacks has not > > been called > > yet and estack_pages full of crap > > > > [0.157502] stk 0x1008 k 1 begin 0x0 end 0xd000 estack_pages > > 0x82014880 ep 0x82014888 > > [0.159395] estack_pages[0] = 0x0 > > [0.160046] estack_pages[1] = 0x510001000 > > [0.160881] estack_pages[2] = 0x0 > > [0.161530] estack_pages[3] = 0x610003000 > > [0.162343] estack_pages[4] = 0x0 > > [0.162962] estack_pages[5] = 0x0 > > [0.163523] estack_pages[6] = 0x0 > > [0.164065] estack_pages[7] = 0x810007000 > > [0.164978] estack_pages[8] = 0x0 > > [0.165624] estack_pages[9] = 0x910009000 > > [0.166448] estack_pages[10] = 0x0 > > [0.167064] estack_pages[11] = 0xa1000b000 > > [0.168055] estack_pages[12] = 0x0 > > Errm. estack_pages is statically initialized and it's an array of:. > > struct estack_pages { > u32 offs; > u16 size; > u16 type; > }; > > [0,2,4,5,6,8,10,12] are guard pages so 0 is not that crappy at all Wait, Thomas, I might be wrong, but per-cpu is initialized to the pointer, the memory for this estack_pages has not yet been allocated, no? > The rest looks completely valid if you actually decode it proper. The diff I made to fetch the values are diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c index 753b8cfe8b8a..bf0d755b6079 100644 --- a/arch/x86/kernel/dumpstack_64.c +++ b/arch/x86/kernel/dumpstack_64.c @@ -101,8 +101,18 @@ static bool in_exception_stack(unsigned long *stack, struct stack_info *info) /* Calc page offset from start of exception stacks */ k = (stk - begin) >> PAGE_SHIFT; + /* Lookup the page descriptor */ ep = _pages[k]; + + printk("stk 0x%lx k %u begin 0x%lx end 0x%lx estack_pages 0x%lx ep 0x%lx\n", + stk, k, begin, end, (long)(void *)_pages[0], (long)(void *)ep); + + for (k = 0; k < CEA_ESTACK_PAGES; k++) { + long v = *(long *)(void *)_pages[k]; + printk("estack_pages[%d] = 0x%lx\n", k, v); + } + /* Guard page? */ if (!ep->size) return false; > > e.g. 0x51000 1000 > > bit 0-31: 1000 Offset 0x1000: 1 Page > bit 32-47: 1000 Size 0x1000: 1 Page > bit 48-63: 5 Type 5: STACK_TYPE_EXCEPTION + ESTACK_DF > > So, no. This is NOT the problem. I drop the left of your reply. True, I agreed with anything you said. You know I didn't manage to dive more into this problem yesterday but if time permits I'll continue today. It is easily triggering under kvm (the kernel I'm building is almost without modules so I simply upload bzImage into the guest). FWIW, the config I'm using is https://gist.github.com/cyrillos/7cd5d2510a99af8ea872f07ac6f9095b
Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Tue, Oct 22, 2019 at 05:23:25PM +0300, Cyrill Gorcunov wrote: > > I presume the kmemleak tries to save stack trace too early when estack_pages > are not > yet filled. Indeed, at this stage of boot the percpu_setup_exception_stacks has not been called yet and estack_pages full of crap [0.157502] stk 0x1008 k 1 begin 0x0 end 0xd000 estack_pages 0x82014880 ep 0x82014888 [0.159395] estack_pages[0] = 0x0 [0.160046] estack_pages[1] = 0x510001000 [0.160881] estack_pages[2] = 0x0 [0.161530] estack_pages[3] = 0x610003000 [0.162343] estack_pages[4] = 0x0 [0.162962] estack_pages[5] = 0x0 [0.163523] estack_pages[6] = 0x0 [0.164065] estack_pages[7] = 0x810007000 [0.164978] estack_pages[8] = 0x0 [0.165624] estack_pages[9] = 0x910009000 [0.166448] estack_pages[10] = 0x0 [0.167064] estack_pages[11] = 0xa1000b000 [0.168055] estack_pages[12] = 0x0 [0.168891] BUG: unable to handle page fault for address: 1ff0
Re: [BUG -tip] kmemleak and stacktrace cause page faul
On Sat, Oct 19, 2019 at 02:44:21PM +0300, Cyrill Gorcunov wrote: ... > > I nailed it down to the following kmemleak code > > create_object > ... > object->trace_len = __save_stack_trace(object->trace); > > if I drop this line out it boots fine. Just wanted to share the observation, > probably it is known issue already. > > Sidenote: The last -tip kernel which I've been working with is dated Sep 18 > so the changes which cause the problem should be introduced last month. I've just tried to boot with fresh -tip commit c2e50c28eeb90c0f3309d43c2ce0bd292a6751b3 (HEAD -> master, origin/master, origin/HEAD) Merge: aec1ea9d4f48 27a0a90d6301 Author: Ingo Molnar Date: Tue Oct 22 01:16:59 2019 +0200 Merge branch 'perf/core' Conflicts: tools/perf/check-headers.sh but got the same issue. So I tried to go deeper, and here is a result: we're failing in arch/x86/kernel/dumpstack_64.c:in_exception_stack routine, precisely at line ep = _pages[k]; /* Guard page? */ if (!ep->size) return false; so I added a logline here [0.082275] stk 0x1010 k 1 begin 0x0 end 0xd000 estack_pages 0x82014880 ep 0x82014888 [0.084951] BUG: unable to handle page fault for address: 1ff0 [0.086724] #PF: supervisor read access in kernel mode [0.088506] #PF: error_code(0x) - not-present page [0.090265] PGD 0 P4D 0 [0.090846] Oops: [#2] PREEMPT SMP PTI [0.091734] CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-rc4-00258-gc2e50c28eeb9-dirty #114 [0.093514] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [0.096993] RIP: 0010:get_stack_info+0xdc/0x173 [0.097994] Code: 84 48 01 82 66 85 c0 74 27 42 8b 14 f5 80 48 01 82 49 01 d5 42 0f b7 14 f5 86 48 01 82 4c 01 e8 4c 89 6b 08 89 13 48 89 43 10 <48> 8b 40 f0 eb 2b 65 48 8b 05 16 f4 f9 7e 48 8d 90 00 c0 ff ff 48 I presume the kmemleak tries to save stack trace too early when estack_pages are not yet filled.
[BUG -tip] kmemleak and stacktrace cause page faul
Hi! I'm not sure if I've CC'ed proper persons, so please sorry if I did. Anyway, today's -tip (07b4dbf1d830) refused to boot [0.024793] No NUMA configuration found [0.025406] Faking a node at [mem 0x-0x7ffdefff] [0.026462] NODE_DATA(0) allocated [mem 0x7ffdb000-0x7ffdefff] [0.027246] BUG: unable to handle page fault for address: 1ff0 [0.028160] #PF: supervisor read access in kernel mode [0.028992] #PF: error_code(0x) - not-present page [0.029820] PGD 0 P4D 0 [0.030226] Oops: [#1] PREEMPT SMP PTI [0.031069] CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-rc3-00258-g07b4dbf1d830 #93 [0.032317] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [0.034163] RIP: 0010:get_stack_info+0xb3/0x148 [0.034903] Code: 04 d5 84 48 01 82 66 85 c0 74 25 8b 0c d5 80 48 01 82 0f b7 14 d5 86 48 01 82 48 01 f1 89 13 48 01 c8 48 89 4b 08 48 89 43 10 <48> 8b 40 f0 eb 2b 65 48 8b 05 1f f4 f9 7e 48 8d 90 00 c0 ff ff 48 [0.037579] RSP: :82603be0 EFLAGS: 00010006 I nailed it down to the following kmemleak code create_object ... object->trace_len = __save_stack_trace(object->trace); if I drop this line out it boots fine. Just wanted to share the observation, probably it is known issue already. Sidenote: The last -tip kernel which I've been working with is dated Sep 18 so the changes which cause the problem should be introduced last month.