Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-28 Thread Maciej W. Rozycki
On Wed, 18 Mar 2015, Andy Lutomirski wrote:

> > I posted the same problem to the opensuse kernel list shortly before turning
> > to LKML. There, Michal Kubecek noted:
> >
> > "I encountered a similar problem recently. The thing is, x86
> > specification says that on a double fault, RIP and RSP registers are
> > undefined, i.e. you not only can't expect them to contain values
> > corresponding to the first or second fault but you can't even expect
> > them to have any usable values at all. Unfortunately the kernel double
> > fault handler doesn't take this into account and does try to display
> > usual crash related information so that it itself does usually crash
> > when trying to show stack content (that's the show_stack_log_lvl()
> > crash).
> 
> I think that's not entirely true.  RIP is reliable for many classes of
> double faults, and we rely on that for espfix64.  The fact that hpa
> was willing to write that code strongly suggests that Intel chips at
> least really do work that way.

 A #DF won't deliberately clobber the instruction or the stack pointer.  
It's only that it may happen at a stage where either or both original 
pointers have been lost and replaced with new values already, possibly 
making them inconsistent with the corresponding segment selectors too (as 
they are not written at the same time).

 This will only happen in certain degenerate corner cases such as e.g. a 
problem with TSS (#TS) in the processing of a task gate used for taking 
the original exception, where a part of the new context has already been 
loaded before #DF resulted.  Another case will be a stack segment limit 
violation (#SS), where stack has been switched in the processing of a trap 
or interrupt gate, preventing return information and error code from being 
pushed for the original exception.  These are not conditions we'd normally 
observe in Linux.

 In other cases both the original instruction and the original stack 
pointer will have been retained.

  Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-28 Thread Maciej W. Rozycki
On Wed, 18 Mar 2015, Andy Lutomirski wrote:

  I posted the same problem to the opensuse kernel list shortly before turning
  to LKML. There, Michal Kubecek noted:
 
  I encountered a similar problem recently. The thing is, x86
  specification says that on a double fault, RIP and RSP registers are
  undefined, i.e. you not only can't expect them to contain values
  corresponding to the first or second fault but you can't even expect
  them to have any usable values at all. Unfortunately the kernel double
  fault handler doesn't take this into account and does try to display
  usual crash related information so that it itself does usually crash
  when trying to show stack content (that's the show_stack_log_lvl()
  crash).
 
 I think that's not entirely true.  RIP is reliable for many classes of
 double faults, and we rely on that for espfix64.  The fact that hpa
 was willing to write that code strongly suggests that Intel chips at
 least really do work that way.

 A #DF won't deliberately clobber the instruction or the stack pointer.  
It's only that it may happen at a stage where either or both original 
pointers have been lost and replaced with new values already, possibly 
making them inconsistent with the corresponding segment selectors too (as 
they are not written at the same time).

 This will only happen in certain degenerate corner cases such as e.g. a 
problem with TSS (#TS) in the processing of a task gate used for taking 
the original exception, where a part of the new context has already been 
loaded before #DF resulted.  Another case will be a stack segment limit 
violation (#SS), where stack has been switched in the processing of a trap 
or interrupt gate, preventing return information and error code from being 
pushed for the original exception.  These are not conditions we'd normally 
observe in Linux.

 In other cases both the original instruction and the original stack 
pointer will have been retained.

  Maciej
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Andy Lutomirski
On Mon, Mar 23, 2015 at 12:07 PM, Denys Vlasenko  wrote:
> On 03/23/2015 07:38 PM, Andy Lutomirski wrote:
>>> cmpq $__NR_syscall_max,%rax
>>> ja ret_from_sys_call
>>> movq %r10,%rcx
>>> call *sys_call_table(,%rax,8)  # XXX:rip relative
>>> movq %rax,RAX-ARGOFFSET(%rsp)
>>> ret_from_sys_call:
>>> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>>> 
>>> jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>>> LOCKDEP_SYS_EXIT
>>> DISABLE_INTERRUPTS(CLBR_NONE)
>>> TRACE_IRQS_OFF
>>> ...
>>> ...
>>> int_ret_from_sys_call_fixup:
>>> FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>>> jmp int_ret_from_sys_call
>>> ...
>>> ...
>>> GLOBAL(int_ret_from_sys_call)
>>> DISABLE_INTERRUPTS(CLBR_NONE)
>>> TRACE_IRQS_OFF
>>>
>>> You reverted that by moving this insn to be after first 
>>> DISABLE_INTERRUPTS(CLBR_NONE).
>>>
>>> I also don't see how moving that check (even if it is wrong in a more
>>> benign way) can have such a drastic effect.
>>
>> I bet I see it.  I have the advantage of having stared at KVM code and
>> cursed at it more recently than you, I suspect.  KVM does awful, awful
>> things to CPU state, and, as an optimization, it allows kernel code to
>> run with CPU state that would be totally invalid in user mode.  This
>> happens through a bunch of hooks, including this bit in __switch_to:
>>
>> /*
>>  * Now maybe reload the debug registers and handle I/O bitmaps
>>  */
>> if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
>>  task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
>> __switch_to_xtra(prev_p, next_p, tss);
>>
>> IOW, we *change* tif during context switches.
>>
>>
>> The race looks like this:
>>
>> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
>> jnz int_ret_from_sys_call_fixup/* Go the the slow path */
>>
>> --- preempted here, switch to KVM guest ---
>>
>> KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
>> happen to be a *32-bit* KVM guest, perhaps?
>>
>> Now KVM schedules, calling __switch_to.  __switch_to sets
>> _TIF_USER_RETURN_NOTIFY.
>
> Clear up to now...
>
>> We IRET back to the syscall exit code,
>
> So we end up being just after the "testl", right?
> We go into "int_ret_from_sys_call_fixup".

Nope, other way around.  We saw no work bits set in testl, but one or
more of those bits was set when we're preempted and return.  Now we
*don't* go to int_ret_from_sys_call_fixup.  I don't think that the
resulting sysret itself is harmful, but I think we're now running user
code with some MSRs programmed wrong.  The next syscall could do bad
things, such as failing to clear IF.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Denys Vlasenko
On 03/23/2015 07:38 PM, Andy Lutomirski wrote:
>> cmpq $__NR_syscall_max,%rax
>> ja ret_from_sys_call
>> movq %r10,%rcx
>> call *sys_call_table(,%rax,8)  # XXX:rip relative
>> movq %rax,RAX-ARGOFFSET(%rsp)
>> ret_from_sys_call:
>> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>> 
>> jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>> LOCKDEP_SYS_EXIT
>> DISABLE_INTERRUPTS(CLBR_NONE)
>> TRACE_IRQS_OFF
>> ...
>> ...
>> int_ret_from_sys_call_fixup:
>> FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>> jmp int_ret_from_sys_call
>> ...
>> ...
>> GLOBAL(int_ret_from_sys_call)
>> DISABLE_INTERRUPTS(CLBR_NONE)
>> TRACE_IRQS_OFF
>>
>> You reverted that by moving this insn to be after first 
>> DISABLE_INTERRUPTS(CLBR_NONE).
>>
>> I also don't see how moving that check (even if it is wrong in a more
>> benign way) can have such a drastic effect.
> 
> I bet I see it.  I have the advantage of having stared at KVM code and
> cursed at it more recently than you, I suspect.  KVM does awful, awful
> things to CPU state, and, as an optimization, it allows kernel code to
> run with CPU state that would be totally invalid in user mode.  This
> happens through a bunch of hooks, including this bit in __switch_to:
> 
> /*
>  * Now maybe reload the debug registers and handle I/O bitmaps
>  */
> if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
>  task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
> __switch_to_xtra(prev_p, next_p, tss);
> 
> IOW, we *change* tif during context switches.
> 
> 
> The race looks like this:
> 
> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> jnz int_ret_from_sys_call_fixup/* Go the the slow path */
> 
> --- preempted here, switch to KVM guest ---
> 
> KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
> happen to be a *32-bit* KVM guest, perhaps?
> 
> Now KVM schedules, calling __switch_to.  __switch_to sets
> _TIF_USER_RETURN_NOTIFY.

Clear up to now...

> We IRET back to the syscall exit code,

So we end up being just after the "testl", right?
We go into "int_ret_from_sys_call_fixup".
We FIXUP_TOP_OF_STACK - now iret frame contains correct values.
Then we jump to "int_ret_from_sys_call".

> turn off interrupts, and do sysret.  We are now screwed.

I don't understand. Where exactly it would go wrong?

On sysret, rsp would be restored from PER_CPU(old_rsp), right?
We'd end up in *userspace* with userspace rsp.

More to it. Since we FIXUPed the iret frame, it does not even matter
how we'll exit to userspace. Either sysret or iret would work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 11:48:42 -0700,
Andy Lutomirski wrote:
> 
> On Mon, Mar 23, 2015 at 11:38 AM, Andy Lutomirski  wrote:
> > On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko  wrote:
> >> On 03/23/2015 02:22 PM, Takashi Iwai wrote:
> >>> At Mon, 23 Mar 2015 10:35:41 +0100,
> >>> Takashi Iwai wrote:
> 
>  At Mon, 23 Mar 2015 10:02:52 +0100,
>  Takashi Iwai wrote:
> >
> > At Fri, 20 Mar 2015 19:16:53 +0100,
> > Denys Vlasenko wrote:
> >
> >>> I'm really puzzled now.  We have a few pieces of information:
> >>>
> >>> - git bisection pointed the commit 96b6352c1271:
> >>> x86_64, entry: Remove the syscall exit audit and schedule 
> >>> optimizations
> >>>   and reverting this "fixes" the problem indeed.  Even just moving two
> >>>   lines
> >>> LOCKDEP_SYS_EXIT
> >>> DISABLE_INTERRUPTS(CLBR_NONE)
> >>>   at the beginning of ret_from_sys_call already fixes.  (Of course I
> >>>   can't prove the fix but it stabilizes for a day without crash while
> >>>   usually I hit the bug in 10 minutes in full test running.)
> >>
> >> The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
> >> interrupt-disabled region to interrupt-enabled:
> >>
> >> cmpq $__NR_syscall_max,%rax
> >> ja ret_from_sys_call
> >> movq %r10,%rcx
> >> call *sys_call_table(,%rax,8)  # XXX:rip relative
> >> movq %rax,RAX-ARGOFFSET(%rsp)
> >> ret_from_sys_call:
> >> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> >> 
> >> jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> >> LOCKDEP_SYS_EXIT
> >> DISABLE_INTERRUPTS(CLBR_NONE)
> >> TRACE_IRQS_OFF
> >> ...
> >> ...
> >> int_ret_from_sys_call_fixup:
> >> FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
> >> jmp int_ret_from_sys_call
> >> ...
> >> ...
> >> GLOBAL(int_ret_from_sys_call)
> >> DISABLE_INTERRUPTS(CLBR_NONE)
> >> TRACE_IRQS_OFF
> >>
> >> You reverted that by moving this insn to be after first 
> >> DISABLE_INTERRUPTS(CLBR_NONE).
> >>
> >> I also don't see how moving that check (even if it is wrong in a more
> >> benign way) can have such a drastic effect.
> >
> > I bet I see it.  I have the advantage of having stared at KVM code and
> > cursed at it more recently than you, I suspect.  KVM does awful, awful
> > things to CPU state, and, as an optimization, it allows kernel code to
> > run with CPU state that would be totally invalid in user mode.  This
> > happens through a bunch of hooks, including this bit in __switch_to:
> >
> > /*
> >  * Now maybe reload the debug registers and handle I/O bitmaps
> >  */
> > if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
> >  task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
> > __switch_to_xtra(prev_p, next_p, tss);
> >
> > IOW, we *change* tif during context switches.
> >
> >
> > The race looks like this:
> >
> > testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> > jnz int_ret_from_sys_call_fixup/* Go the the slow path */
> >
> > --- preempted here, switch to KVM guest ---
> >
> > KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
> > happen to be a *32-bit* KVM guest, perhaps?
> >
> > Now KVM schedules, calling __switch_to.  __switch_to sets
> > _TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
> > off interrupts, and do sysret.  We are now screwed.
> >
> > I don't know why this manifests in this particular failure, but any
> > number of terrible things could happen now.
> >
> > FWIW, this will affect things other than KVM.  For example, SIGKILL
> > sent while a process is sleeping in that two-instruction window won't
> > work.
> >
> > Takashi, can you re-send your patch so we can review it for real in
> > light of this race?
> 
> Never mind, I'm testing a slightly fancier patch.

OK, I'll wait for your test patch.


thanks,

Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 11:38:30 -0700,
Andy Lutomirski wrote:
> 
> On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko  wrote:
> > On 03/23/2015 02:22 PM, Takashi Iwai wrote:
> >> At Mon, 23 Mar 2015 10:35:41 +0100,
> >> Takashi Iwai wrote:
> >>>
> >>> At Mon, 23 Mar 2015 10:02:52 +0100,
> >>> Takashi Iwai wrote:
> 
>  At Fri, 20 Mar 2015 19:16:53 +0100,
>  Denys Vlasenko wrote:
> 
> >> I'm really puzzled now.  We have a few pieces of information:
> >>
> >> - git bisection pointed the commit 96b6352c1271:
> >> x86_64, entry: Remove the syscall exit audit and schedule optimizations
> >>   and reverting this "fixes" the problem indeed.  Even just moving two
> >>   lines
> >> LOCKDEP_SYS_EXIT
> >> DISABLE_INTERRUPTS(CLBR_NONE)
> >>   at the beginning of ret_from_sys_call already fixes.  (Of course I
> >>   can't prove the fix but it stabilizes for a day without crash while
> >>   usually I hit the bug in 10 minutes in full test running.)
> >
> > The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
> > interrupt-disabled region to interrupt-enabled:
> >
> > cmpq $__NR_syscall_max,%rax
> > ja ret_from_sys_call
> > movq %r10,%rcx
> > call *sys_call_table(,%rax,8)  # XXX:rip relative
> > movq %rax,RAX-ARGOFFSET(%rsp)
> > ret_from_sys_call:
> > testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> > 
> > jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> > LOCKDEP_SYS_EXIT
> > DISABLE_INTERRUPTS(CLBR_NONE)
> > TRACE_IRQS_OFF
> > ...
> > ...
> > int_ret_from_sys_call_fixup:
> > FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
> > jmp int_ret_from_sys_call
> > ...
> > ...
> > GLOBAL(int_ret_from_sys_call)
> > DISABLE_INTERRUPTS(CLBR_NONE)
> > TRACE_IRQS_OFF
> >
> > You reverted that by moving this insn to be after first 
> > DISABLE_INTERRUPTS(CLBR_NONE).
> >
> > I also don't see how moving that check (even if it is wrong in a more
> > benign way) can have such a drastic effect.
> 
> I bet I see it.  I have the advantage of having stared at KVM code and
> cursed at it more recently than you, I suspect.  KVM does awful, awful
> things to CPU state, and, as an optimization, it allows kernel code to
> run with CPU state that would be totally invalid in user mode.  This
> happens through a bunch of hooks, including this bit in __switch_to:
> 
> /*
>  * Now maybe reload the debug registers and handle I/O bitmaps
>  */
> if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
>  task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
> __switch_to_xtra(prev_p, next_p, tss);
> 
> IOW, we *change* tif during context switches.
> 
> 
> The race looks like this:
> 
> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> jnz int_ret_from_sys_call_fixup/* Go the the slow path */
> 
> --- preempted here, switch to KVM guest ---
> 
> KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
> happen to be a *32-bit* KVM guest, perhaps?
> 
> Now KVM schedules, calling __switch_to.  __switch_to sets
> _TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
> off interrupts, and do sysret.  We are now screwed.

Thanks for enlightening!  That looks like a feasible scenario.
(I tested only a 64bit KVM guest, BTW.)

> I don't know why this manifests in this particular failure, but any
> number of terrible things could happen now.
> 
> FWIW, this will affect things other than KVM.  For example, SIGKILL
> sent while a process is sleeping in that two-instruction window won't
> work.
> 
> Takashi, can you re-send your patch so we can review it for real in
> light of this race?

The patch below worked.  I'll double-check tomorrow whether this
really cures reliably.


thanks,

Takashi

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1d74d161687c..5340ac7f88a9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -364,12 +364,12 @@ system_call_fastpath:
  * Has incomplete stack frame and undefined top of stack.
  */
 ret_from_sys_call:
-   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
-
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
+   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
+
CFI_REMEMBER_STATE
/*
 * sysretq will re-enable interrupts:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Stefan Seyfried
Am 23.03.2015 um 19:38 schrieb Andy Lutomirski:
> I bet I see it.  I have the advantage of having stared at KVM code and
> cursed at it more recently than you, I suspect.  KVM does awful, awful
> things to CPU state, and, as an optimization, it allows kernel code to
> run with CPU state that would be totally invalid in user mode.  This
> happens through a bunch of hooks, including this bit in __switch_to:
> 
> /*
>  * Now maybe reload the debug registers and handle I/O bitmaps
>  */
> if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
>  task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
> __switch_to_xtra(prev_p, next_p, tss);
> 
> IOW, we *change* tif during context switches.
> 
> 
> The race looks like this:
> 
> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> jnz int_ret_from_sys_call_fixup/* Go the the slow path */
> 
> --- preempted here, switch to KVM guest ---
> 
> KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
> happen to be a *32-bit* KVM guest, perhaps?

not in my case (penryn CPU), there it was 64bit guests.

> Now KVM schedules, calling __switch_to.  __switch_to sets
> _TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
> off interrupts, and do sysret.  We are now screwed.
> 
> I don't know why this manifests in this particular failure, but any
> number of terrible things could happen now.
> 
> FWIW, this will affect things other than KVM.  For example, SIGKILL
> sent while a process is sleeping in that two-instruction window won't
> work.
> 
> Takashi, can you re-send your patch so we can review it for real in
> light of this race?
-- 
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Andy Lutomirski
On Mon, Mar 23, 2015 at 11:38 AM, Andy Lutomirski  wrote:
> On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko  wrote:
>> On 03/23/2015 02:22 PM, Takashi Iwai wrote:
>>> At Mon, 23 Mar 2015 10:35:41 +0100,
>>> Takashi Iwai wrote:

 At Mon, 23 Mar 2015 10:02:52 +0100,
 Takashi Iwai wrote:
>
> At Fri, 20 Mar 2015 19:16:53 +0100,
> Denys Vlasenko wrote:
>
>>> I'm really puzzled now.  We have a few pieces of information:
>>>
>>> - git bisection pointed the commit 96b6352c1271:
>>> x86_64, entry: Remove the syscall exit audit and schedule optimizations
>>>   and reverting this "fixes" the problem indeed.  Even just moving two
>>>   lines
>>> LOCKDEP_SYS_EXIT
>>> DISABLE_INTERRUPTS(CLBR_NONE)
>>>   at the beginning of ret_from_sys_call already fixes.  (Of course I
>>>   can't prove the fix but it stabilizes for a day without crash while
>>>   usually I hit the bug in 10 minutes in full test running.)
>>
>> The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
>> interrupt-disabled region to interrupt-enabled:
>>
>> cmpq $__NR_syscall_max,%rax
>> ja ret_from_sys_call
>> movq %r10,%rcx
>> call *sys_call_table(,%rax,8)  # XXX:rip relative
>> movq %rax,RAX-ARGOFFSET(%rsp)
>> ret_from_sys_call:
>> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>> 
>> jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>> LOCKDEP_SYS_EXIT
>> DISABLE_INTERRUPTS(CLBR_NONE)
>> TRACE_IRQS_OFF
>> ...
>> ...
>> int_ret_from_sys_call_fixup:
>> FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>> jmp int_ret_from_sys_call
>> ...
>> ...
>> GLOBAL(int_ret_from_sys_call)
>> DISABLE_INTERRUPTS(CLBR_NONE)
>> TRACE_IRQS_OFF
>>
>> You reverted that by moving this insn to be after first 
>> DISABLE_INTERRUPTS(CLBR_NONE).
>>
>> I also don't see how moving that check (even if it is wrong in a more
>> benign way) can have such a drastic effect.
>
> I bet I see it.  I have the advantage of having stared at KVM code and
> cursed at it more recently than you, I suspect.  KVM does awful, awful
> things to CPU state, and, as an optimization, it allows kernel code to
> run with CPU state that would be totally invalid in user mode.  This
> happens through a bunch of hooks, including this bit in __switch_to:
>
> /*
>  * Now maybe reload the debug registers and handle I/O bitmaps
>  */
> if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
>  task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
> __switch_to_xtra(prev_p, next_p, tss);
>
> IOW, we *change* tif during context switches.
>
>
> The race looks like this:
>
> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> jnz int_ret_from_sys_call_fixup/* Go the the slow path */
>
> --- preempted here, switch to KVM guest ---
>
> KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
> happen to be a *32-bit* KVM guest, perhaps?
>
> Now KVM schedules, calling __switch_to.  __switch_to sets
> _TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
> off interrupts, and do sysret.  We are now screwed.
>
> I don't know why this manifests in this particular failure, but any
> number of terrible things could happen now.
>
> FWIW, this will affect things other than KVM.  For example, SIGKILL
> sent while a process is sleeping in that two-instruction window won't
> work.
>
> Takashi, can you re-send your patch so we can review it for real in
> light of this race?

Never mind, I'm testing a slightly fancier patch.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 18:46:45 +0100,
Denys Vlasenko wrote:
> 
> On 03/23/2015 06:18 PM, Takashi Iwai wrote:
> > At Mon, 23 Mar 2015 17:07:15 +0100, Denys Vlasenko wrote:
>  I pulled tip tree on top of 4.0-rc5, built with your patch and now
>  succeeded to get a better message:
> 
>   kvm: zapping shadow pages for mmio generation wraparound
>   kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
>   Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
>  EFLAGS: 00010006
>   RIP: 0010:[]  [] 
>  netlink_attachskb+0x1d/0x1d0
>   PANIC: double fault, error_code: 0x0
>   CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ 
>  #2
>   Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>   task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
>   RIP: 0010:[]  [] 
>  netlink_attachskb+0x1d/0x1d0
>   RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
>   RAX:  RBX: 0005 RCX: c101
>   RDX:  RSI: 0001 RDI: 7ffd22c23ef0
> 
> >> FYI: the disassembly of netlink_attachskb (from "Code:" line) is:
> >>
> >>0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
> >>5:   55  push   %rbp
> >>6:   48 89 e5mov%rsp,%rbp
> >>9:   41 56   push   %r14
> >>b:   41 55   push   %r13
> >>d:   49 89 d5mov%rdx,%r13
> >>   10:   41 54   push   %r12
> >>   12:   49 89 f4mov%rsi,%r12
> >>   15:   53  push   %rbx
> >>   16:   48 89 fbmov%rdi,%rbx
> >>   19:   48 83 ec 30 sub$0x30,%rsp
> >>   1d:   8b 87 68 01 00 00   mov0x168(%rdi),%eax
> >> ^
> >>   23:   39 87 9c 01 00 00   cmp%eax,0x19c(%rdi)
> >>   29:   7c 25   jl 50 <_start+0x50>
> >>   2b:   48 8b 87 88 04 00 00mov0x488(%rdi),%rax
> >>
> >> The ^ instruction is the one which faults. Since you said it
> >> consistently happens here, this should be a page fault, not an external
> >> hardware interrupt.
> >>
> >> The code corresponds to the comparison in if():
> >>
> >> int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
> >>   long *timeo, struct sock *ssk)
> >> {
> >> struct netlink_sock *nlk;
> >>
> >> nlk = nlk_sk(sk);
> >>
> >> if ((atomic_read(>sk_rmem_alloc) > sk->sk_rcvbuf ||
> 
> >>> - Another piece is that the bug happens only when a KVM is running.
> >>>   The kernel ran without problem over days with similar tasks
> >>>   (compiling kernel, etc) when no KVM was used.
> >>
> >> Conceivably virtualization support in CPUs can have nasty erratas.
> >> However, you and other reporter have different CPUs - yours
> >> is Ivy Bridge, his CPU is a Penryn.
> >>
> >> I don't see the path how KVM helps to trigger this.
> >>
> >>> - And now I get the trace as above, pointing netlink_attachskb().
> >>>
> >>> I have a difficulty to imagine how all these pieces fit into a single
> >>> picture.  Is something already screwed up before that?
> >>
> >> Well, a tiny bit more info will be seen if you'd change %rdi
> >> to, say, %r15 in these two lines in my patch:
> >>
> >>/* Save bogus RSP value */
> >>movq%rsp,%rdi
> >> ...
> >>push%rdi/* pt_regs->sp */
> >>
> >> Then original %rdi will be visible in the crash message.
> > 
> > OK, here we go.
> > 
> >  kvm: zapping shadow pages for mmio generation wraparound
> >  kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
> >  Exception on user stack 7fff1d7e5ec0: RSP: 0018:7fff1d7e5ef8  
> > EFLAGS: 00010002
> >  RIP: 0010:[]  [] 
> > netlink_attachskb+0x1d/0x1d0
> >  PANIC: double fault, error_code: 0x0
> >  CPU: 5 PID: 14285 Comm: fixdep Tainted: GW   4.0.0-rc5-debug1+ 
> > #3
> >  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
> >  task: 88020ba1c690 ti: 880206ba4000 task.ti: 880206ba4000
> >  RIP: 0010:[]  [] 
> > netlink_attachskb+0x1d/0x1d0
> >  RSP: 0018:7fff1d7e5ef8  EFLAGS: 00010002
> >  RAX:  RBX:  RCX: c101
> >  RDX:  RSI: 1ebb RDI: 
> 
> Thanks for your testing. So the %rdi was NULL... not very informative.
> 
> Notice that your every crash is preceded by
> 
> kvm: zapping shadow pages for mmio generation wraparound
> kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
> 
> This hints that kvm _is_ somehow responsible.

It's likely irrelevant, as this appears at the time a VM starting, not
at the crash time.  I've got this message all the time.  Sorry for
confusing.


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Andy Lutomirski
On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko  wrote:
> On 03/23/2015 02:22 PM, Takashi Iwai wrote:
>> At Mon, 23 Mar 2015 10:35:41 +0100,
>> Takashi Iwai wrote:
>>>
>>> At Mon, 23 Mar 2015 10:02:52 +0100,
>>> Takashi Iwai wrote:

 At Fri, 20 Mar 2015 19:16:53 +0100,
 Denys Vlasenko wrote:

>> I'm really puzzled now.  We have a few pieces of information:
>>
>> - git bisection pointed the commit 96b6352c1271:
>> x86_64, entry: Remove the syscall exit audit and schedule optimizations
>>   and reverting this "fixes" the problem indeed.  Even just moving two
>>   lines
>> LOCKDEP_SYS_EXIT
>> DISABLE_INTERRUPTS(CLBR_NONE)
>>   at the beginning of ret_from_sys_call already fixes.  (Of course I
>>   can't prove the fix but it stabilizes for a day without crash while
>>   usually I hit the bug in 10 minutes in full test running.)
>
> The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
> interrupt-disabled region to interrupt-enabled:
>
> cmpq $__NR_syscall_max,%rax
> ja ret_from_sys_call
> movq %r10,%rcx
> call *sys_call_table(,%rax,8)  # XXX:rip relative
> movq %rax,RAX-ARGOFFSET(%rsp)
> ret_from_sys_call:
> testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> 
> jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> LOCKDEP_SYS_EXIT
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
> ...
> ...
> int_ret_from_sys_call_fixup:
> FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
> jmp int_ret_from_sys_call
> ...
> ...
> GLOBAL(int_ret_from_sys_call)
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
>
> You reverted that by moving this insn to be after first 
> DISABLE_INTERRUPTS(CLBR_NONE).
>
> I also don't see how moving that check (even if it is wrong in a more
> benign way) can have such a drastic effect.

I bet I see it.  I have the advantage of having stared at KVM code and
cursed at it more recently than you, I suspect.  KVM does awful, awful
things to CPU state, and, as an optimization, it allows kernel code to
run with CPU state that would be totally invalid in user mode.  This
happens through a bunch of hooks, including this bit in __switch_to:

/*
 * Now maybe reload the debug registers and handle I/O bitmaps
 */
if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
 task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
__switch_to_xtra(prev_p, next_p, tss);

IOW, we *change* tif during context switches.


The race looks like this:

testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
jnz int_ret_from_sys_call_fixup/* Go the the slow path */

--- preempted here, switch to KVM guest ---

KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
happen to be a *32-bit* KVM guest, perhaps?

Now KVM schedules, calling __switch_to.  __switch_to sets
_TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
off interrupts, and do sysret.  We are now screwed.

I don't know why this manifests in this particular failure, but any
number of terrible things could happen now.

FWIW, this will affect things other than KVM.  For example, SIGKILL
sent while a process is sleeping in that two-instruction window won't
work.

Takashi, can you re-send your patch so we can review it for real in
light of this race?

>
>
> Shot-in-the-dark idea. At this code revision we did not yet
> store user's %rsp in pt_regs->sp, we used a fixup to populate it:
>
> .macro FIXUP_TOP_OF_STACK tmp offset=0
> movq PER_CPU_VAR(old_rsp),\tmp
> movq \tmp,RSP+\offset(%rsp)
>
> (There are pending patches to fix this mess).
>
> If an interrupt interrupting *kernel code* would go into a code path
> which does FIXUP_TOP_OF_STACK, it'd overwrite the correct saved %rsp
> with a user's one. The iret from interrupt would work,
> but the resulting CPU state would be inconsistent. But I don't see
> such a code path from interrupts to FIXUP_TOP_OF_STACK...

I don't buy it.  Anything that does that is so completely broken that
I'd hope we'd have found it long ago.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Denys Vlasenko
On 03/23/2015 06:18 PM, Takashi Iwai wrote:
> At Mon, 23 Mar 2015 17:07:15 +0100, Denys Vlasenko wrote:
 I pulled tip tree on top of 4.0-rc5, built with your patch and now
 succeeded to get a better message:

  kvm: zapping shadow pages for mmio generation wraparound
  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
  Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
 EFLAGS: 00010006
  RIP: 0010:[]  [] 
 netlink_attachskb+0x1d/0x1d0
  PANIC: double fault, error_code: 0x0
  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
  RIP: 0010:[]  [] 
 netlink_attachskb+0x1d/0x1d0
  RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
  RAX:  RBX: 0005 RCX: c101
  RDX:  RSI: 0001 RDI: 7ffd22c23ef0

>> FYI: the disassembly of netlink_attachskb (from "Code:" line) is:
>>
>>0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
>>5:   55  push   %rbp
>>6:   48 89 e5mov%rsp,%rbp
>>9:   41 56   push   %r14
>>b:   41 55   push   %r13
>>d:   49 89 d5mov%rdx,%r13
>>   10:   41 54   push   %r12
>>   12:   49 89 f4mov%rsi,%r12
>>   15:   53  push   %rbx
>>   16:   48 89 fbmov%rdi,%rbx
>>   19:   48 83 ec 30 sub$0x30,%rsp
>>   1d:   8b 87 68 01 00 00   mov0x168(%rdi),%eax
>> ^
>>   23:   39 87 9c 01 00 00   cmp%eax,0x19c(%rdi)
>>   29:   7c 25   jl 50 <_start+0x50>
>>   2b:   48 8b 87 88 04 00 00mov0x488(%rdi),%rax
>>
>> The ^ instruction is the one which faults. Since you said it
>> consistently happens here, this should be a page fault, not an external
>> hardware interrupt.
>>
>> The code corresponds to the comparison in if():
>>
>> int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
>>   long *timeo, struct sock *ssk)
>> {
>> struct netlink_sock *nlk;
>>
>> nlk = nlk_sk(sk);
>>
>> if ((atomic_read(>sk_rmem_alloc) > sk->sk_rcvbuf ||

>>> - Another piece is that the bug happens only when a KVM is running.
>>>   The kernel ran without problem over days with similar tasks
>>>   (compiling kernel, etc) when no KVM was used.
>>
>> Conceivably virtualization support in CPUs can have nasty erratas.
>> However, you and other reporter have different CPUs - yours
>> is Ivy Bridge, his CPU is a Penryn.
>>
>> I don't see the path how KVM helps to trigger this.
>>
>>> - And now I get the trace as above, pointing netlink_attachskb().
>>>
>>> I have a difficulty to imagine how all these pieces fit into a single
>>> picture.  Is something already screwed up before that?
>>
>> Well, a tiny bit more info will be seen if you'd change %rdi
>> to, say, %r15 in these two lines in my patch:
>>
>>/* Save bogus RSP value */
>>movq%rsp,%rdi
>> ...
>>push%rdi/* pt_regs->sp */
>>
>> Then original %rdi will be visible in the crash message.
> 
> OK, here we go.
> 
>  kvm: zapping shadow pages for mmio generation wraparound
>  kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
>  Exception on user stack 7fff1d7e5ec0: RSP: 0018:7fff1d7e5ef8  
> EFLAGS: 00010002
>  RIP: 0010:[]  [] 
> netlink_attachskb+0x1d/0x1d0
>  PANIC: double fault, error_code: 0x0
>  CPU: 5 PID: 14285 Comm: fixdep Tainted: GW   4.0.0-rc5-debug1+ #3
>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>  task: 88020ba1c690 ti: 880206ba4000 task.ti: 880206ba4000
>  RIP: 0010:[]  [] 
> netlink_attachskb+0x1d/0x1d0
>  RSP: 0018:7fff1d7e5ef8  EFLAGS: 00010002
>  RAX:  RBX:  RCX: c101
>  RDX:  RSI: 1ebb RDI: 

Thanks for your testing. So the %rdi was NULL... not very informative.

Notice that your every crash is preceded by

kvm: zapping shadow pages for mmio generation wraparound
kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x

This hints that kvm _is_ somehow responsible.
I'm no expert on kvm, I need to take a look around that code...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 17:07:15 +0100,
Denys Vlasenko wrote:
> 
> On 03/23/2015 02:22 PM, Takashi Iwai wrote:
> > At Mon, 23 Mar 2015 10:35:41 +0100,
> > Takashi Iwai wrote:
> >>
> >> At Mon, 23 Mar 2015 10:02:52 +0100,
> >> Takashi Iwai wrote:
> >>>
> >>> At Fri, 20 Mar 2015 19:16:53 +0100,
> >>> Denys Vlasenko wrote:
>  Takashi, are you willing to reproduce the panic one more time,
>  with this patch? I would like to see whether oops messages
>  are more informative with it.
> >>>
> >>> It can't be applied to 4.0-rc5, unfortunately.
> >>>
> >>> arch/x86/kernel/entry_64.S: Assembler messages:
> >>> arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
> >>> `alloc_pt_gpregs_on_stack'
> >>> arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
> >>> sections) for `+'
> >>> scripts/Makefile.build:294: recipe for target 
> >>> 'arch/x86/kernel/entry_64.o' failed
> >>
> >> I pulled tip tree on top of 4.0-rc5, built with your patch and now
> >> succeeded to get a better message:
> >>
> >>  kvm: zapping shadow pages for mmio generation wraparound
> >>  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
> >>  Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
> >> EFLAGS: 00010006
> >>  RIP: 0010:[]  [] 
> >> netlink_attachskb+0x1d/0x1d0
> >>  PANIC: double fault, error_code: 0x0
> >>  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
> >>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
> >>  task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
> >>  RIP: 0010:[]  [] 
> >> netlink_attachskb+0x1d/0x1d0
> >>  RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
> >>  RAX:  RBX: 0005 RCX: c101
> >>  RDX:  RSI: 0001 RDI: 7ffd22c23ef0
> >>  RBP: 0ea7 R08: 1ea7 R09: 
> >>  R10: 0309dbf8 R11: 0246 R12: 0001
> >>  R13:  R14: 03026e40 R15: 0309cd50
> >>  FS:  7f89c83c2800() GS:88021d24() 
> >> knlGS:
> >>  CS:  0010 DS:  ES:  CR0: 80050033
> >>  CR2: 016d CR3: d90a CR4: 001427e0
> >>  Stack:
> >>   0ea7  03099c10 0ea7
> >>   0ea7 0001 03099c10 0ea7
> >>   00c84696 03099c88 7f0122c23fb8 0302f610
> >>  Call Trace:
> >>
> >>  Code: 
> >>  10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 
> >> 56 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 
> >> 00 39 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
> >>  Kernel panic - not syncing: Machine halted.
> >>  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
> >>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
> >>    8800d1b33e28 816f80d2 
> >>   81a22f81 8800d1b33ea8 816f2358 58d7
> >>   0008 8800d1b33eb8 8800d1b33e58 8800d1b33ea8
> >>  Call Trace:
> >>   [] dump_stack+0x4c/0x6e
> >>   [] panic+0xc0/0x1f3
> >>   [] df_debug+0x35/0x40
> >>   [] do_double_fault+0x87/0x100
> >>   [] do_userpsace_rsp_in_kernel+0x107/0x140
> >>   [] ? netlink_attachskb+0x1d/0x1d0
> >>   [] userpsace_rsp_in_kernel+0x36/0x40
> >>   [] ? netlink_attachskb+0x1d/0x1d0
> >>
> >>
> >> So, it seems hitting in netlink_attachskb().
> >> I'd need to check whether this consistently hits there or just at
> >> random.
> > 
> > I managed to reproduce the bug two more times, and all three show the
> > very same stack trace like the above.  So, it's well reproducible.
> 
> FYI: the disassembly of netlink_attachskb (from "Code:" line) is:
> 
>0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
>5:   55  push   %rbp
>6:   48 89 e5mov%rsp,%rbp
>9:   41 56   push   %r14
>b:   41 55   push   %r13
>d:   49 89 d5mov%rdx,%r13
>   10:   41 54   push   %r12
>   12:   49 89 f4mov%rsi,%r12
>   15:   53  push   %rbx
>   16:   48 89 fbmov%rdi,%rbx
>   19:   48 83 ec 30 sub$0x30,%rsp
>   1d:   8b 87 68 01 00 00   mov0x168(%rdi),%eax
> ^
>   23:   39 87 9c 01 00 00   cmp%eax,0x19c(%rdi)
>   29:   7c 25   jl 50 <_start+0x50>
>   2b:   48 8b 87 88 04 00 00mov0x488(%rdi),%rax
> 
> The ^ instruction is the one which faults. Since you said it
> consistently happens here, this should be a page fault, not an external
> hardware interrupt.
> 
> The code corresponds to the comparison in if():
> 
> int netlink_attachskb(struct sock *sk, struct 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Denys Vlasenko
On 03/23/2015 02:22 PM, Takashi Iwai wrote:
> At Mon, 23 Mar 2015 10:35:41 +0100,
> Takashi Iwai wrote:
>>
>> At Mon, 23 Mar 2015 10:02:52 +0100,
>> Takashi Iwai wrote:
>>>
>>> At Fri, 20 Mar 2015 19:16:53 +0100,
>>> Denys Vlasenko wrote:
 Takashi, are you willing to reproduce the panic one more time,
 with this patch? I would like to see whether oops messages
 are more informative with it.
>>>
>>> It can't be applied to 4.0-rc5, unfortunately.
>>>
>>> arch/x86/kernel/entry_64.S: Assembler messages:
>>> arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
>>> `alloc_pt_gpregs_on_stack'
>>> arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
>>> sections) for `+'
>>> scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
>>> failed
>>
>> I pulled tip tree on top of 4.0-rc5, built with your patch and now
>> succeeded to get a better message:
>>
>>  kvm: zapping shadow pages for mmio generation wraparound
>>  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
>>  Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
>> EFLAGS: 00010006
>>  RIP: 0010:[]  [] 
>> netlink_attachskb+0x1d/0x1d0
>>  PANIC: double fault, error_code: 0x0
>>  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
>>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>>  task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
>>  RIP: 0010:[]  [] 
>> netlink_attachskb+0x1d/0x1d0
>>  RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
>>  RAX:  RBX: 0005 RCX: c101
>>  RDX:  RSI: 0001 RDI: 7ffd22c23ef0
>>  RBP: 0ea7 R08: 1ea7 R09: 
>>  R10: 0309dbf8 R11: 0246 R12: 0001
>>  R13:  R14: 03026e40 R15: 0309cd50
>>  FS:  7f89c83c2800() GS:88021d24() knlGS:
>>  CS:  0010 DS:  ES:  CR0: 80050033
>>  CR2: 016d CR3: d90a CR4: 001427e0
>>  Stack:
>>   0ea7  03099c10 0ea7
>>   0ea7 0001 03099c10 0ea7
>>   00c84696 03099c88 7f0122c23fb8 0302f610
>>  Call Trace:
>>
>>  Code: 
>>  10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 
>> 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 00 39 
>> 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
>>  Kernel panic - not syncing: Machine halted.
>>  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
>>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>>    8800d1b33e28 816f80d2 
>>   81a22f81 8800d1b33ea8 816f2358 58d7
>>   0008 8800d1b33eb8 8800d1b33e58 8800d1b33ea8
>>  Call Trace:
>>   [] dump_stack+0x4c/0x6e
>>   [] panic+0xc0/0x1f3
>>   [] df_debug+0x35/0x40
>>   [] do_double_fault+0x87/0x100
>>   [] do_userpsace_rsp_in_kernel+0x107/0x140
>>   [] ? netlink_attachskb+0x1d/0x1d0
>>   [] userpsace_rsp_in_kernel+0x36/0x40
>>   [] ? netlink_attachskb+0x1d/0x1d0
>>
>>
>> So, it seems hitting in netlink_attachskb().
>> I'd need to check whether this consistently hits there or just at
>> random.
> 
> I managed to reproduce the bug two more times, and all three show the
> very same stack trace like the above.  So, it's well reproducible.

FYI: the disassembly of netlink_attachskb (from "Code:" line) is:

   0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
   5:   55  push   %rbp
   6:   48 89 e5mov%rsp,%rbp
   9:   41 56   push   %r14
   b:   41 55   push   %r13
   d:   49 89 d5mov%rdx,%r13
  10:   41 54   push   %r12
  12:   49 89 f4mov%rsi,%r12
  15:   53  push   %rbx
  16:   48 89 fbmov%rdi,%rbx
  19:   48 83 ec 30 sub$0x30,%rsp
  1d:   8b 87 68 01 00 00   mov0x168(%rdi),%eax
^
  23:   39 87 9c 01 00 00   cmp%eax,0x19c(%rdi)
  29:   7c 25   jl 50 <_start+0x50>
  2b:   48 8b 87 88 04 00 00mov0x488(%rdi),%rax

The ^ instruction is the one which faults. Since you said it
consistently happens here, this should be a page fault, not an external
hardware interrupt.

The code corresponds to the comparison in if():

int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
  long *timeo, struct sock *ssk)
{
struct netlink_sock *nlk;

nlk = nlk_sk(sk);

if ((atomic_read(>sk_rmem_alloc) > sk->sk_rcvbuf ||

%rdi (which is 1st param, "struct sock *sk") is 7ffd22c23ef0
(userspace address), but it's 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 10:35:41 +0100,
Takashi Iwai wrote:
> 
> At Mon, 23 Mar 2015 10:02:52 +0100,
> Takashi Iwai wrote:
> > 
> > At Fri, 20 Mar 2015 19:16:53 +0100,
> > Denys Vlasenko wrote:
> > > Takashi, are you willing to reproduce the panic one more time,
> > > with this patch? I would like to see whether oops messages
> > > are more informative with it.
> > 
> > It can't be applied to 4.0-rc5, unfortunately.
> > 
> > arch/x86/kernel/entry_64.S: Assembler messages:
> > arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
> > `alloc_pt_gpregs_on_stack'
> > arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
> > sections) for `+'
> > scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
> > failed
> 
> I pulled tip tree on top of 4.0-rc5, built with your patch and now
> succeeded to get a better message:
> 
>  kvm: zapping shadow pages for mmio generation wraparound
>  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
>  Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
> EFLAGS: 00010006
>  RIP: 0010:[]  [] 
> netlink_attachskb+0x1d/0x1d0
>  PANIC: double fault, error_code: 0x0
>  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>  task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
>  RIP: 0010:[]  [] 
> netlink_attachskb+0x1d/0x1d0
>  RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
>  RAX:  RBX: 0005 RCX: c101
>  RDX:  RSI: 0001 RDI: 7ffd22c23ef0
>  RBP: 0ea7 R08: 1ea7 R09: 
>  R10: 0309dbf8 R11: 0246 R12: 0001
>  R13:  R14: 03026e40 R15: 0309cd50
>  FS:  7f89c83c2800() GS:88021d24() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 016d CR3: d90a CR4: 001427e0
>  Stack:
>   0ea7  03099c10 0ea7
>   0ea7 0001 03099c10 0ea7
>   00c84696 03099c88 7f0122c23fb8 0302f610
>  Call Trace:
>
>  Code: 
>  10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 
> 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 00 39 
> 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
>  Kernel panic - not syncing: Machine halted.
>  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>    8800d1b33e28 816f80d2 
>   81a22f81 8800d1b33ea8 816f2358 58d7
>   0008 8800d1b33eb8 8800d1b33e58 8800d1b33ea8
>  Call Trace:
>   [] dump_stack+0x4c/0x6e
>   [] panic+0xc0/0x1f3
>   [] df_debug+0x35/0x40
>   [] do_double_fault+0x87/0x100
>   [] do_userpsace_rsp_in_kernel+0x107/0x140
>   [] ? netlink_attachskb+0x1d/0x1d0
>   [] userpsace_rsp_in_kernel+0x36/0x40
>   [] ? netlink_attachskb+0x1d/0x1d0
> 
> 
> So, it seems hitting in netlink_attachskb().
> I'd need to check whether this consistently hits there or just at
> random.

I managed to reproduce the bug two more times, and all three show the
very same stack trace like the above.  So, it's well reproducible.

I'm really puzzled now.  We have a few pieces of information:

- git bisection pointed the commit 96b6352c1271:
x86_64, entry: Remove the syscall exit audit and schedule optimizations
  and reverting this "fixes" the problem indeed.  Even just moving two
  lines
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE) 
  at the beginning of ret_from_sys_call already fixes.  (Of course I
  can't prove the fix but it stabilizes for a day without crash while
  usually I hit the bug in 10 minutes in full test running.)

- Another piece is that the bug happens only when a KVM is running.
  The kernel ran without problem over days with similar tasks
  (compiling kernel, etc) when no KVM was used.

- And now I get the trace as above, pointing netlink_attachskb().

I have a difficulty to imagine how all these pieces fit into a single
picture.  Is something already screwed up before that?


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 10:02:52 +0100,
Takashi Iwai wrote:
> 
> At Fri, 20 Mar 2015 19:16:53 +0100,
> Denys Vlasenko wrote:
> > Takashi, are you willing to reproduce the panic one more time,
> > with this patch? I would like to see whether oops messages
> > are more informative with it.
> 
> It can't be applied to 4.0-rc5, unfortunately.
> 
> arch/x86/kernel/entry_64.S: Assembler messages:
> arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
> `alloc_pt_gpregs_on_stack'
> arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
> sections) for `+'
> scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
> failed

I pulled tip tree on top of 4.0-rc5, built with your patch and now
succeeded to get a better message:

 kvm: zapping shadow pages for mmio generation wraparound
 kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
 Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  EFLAGS: 
00010006
 RIP: 0010:[]  [] 
netlink_attachskb+0x1d/0x1d0
 PANIC: double fault, error_code: 0x0
 CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
 task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
 RIP: 0010:[]  [] 
netlink_attachskb+0x1d/0x1d0
 RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
 RAX:  RBX: 0005 RCX: c101
 RDX:  RSI: 0001 RDI: 7ffd22c23ef0
 RBP: 0ea7 R08: 1ea7 R09: 
 R10: 0309dbf8 R11: 0246 R12: 0001
 R13:  R14: 03026e40 R15: 0309cd50
 FS:  7f89c83c2800() GS:88021d24() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 016d CR3: d90a CR4: 001427e0
 Stack:
  0ea7  03099c10 0ea7
  0ea7 0001 03099c10 0ea7
  00c84696 03099c88 7f0122c23fb8 0302f610
 Call Trace:
   
 Code: 
 10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 
55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 00 39 87 9c 
01 00 00 7c 25 48 8b 87 88 04 00 00 
 Kernel panic - not syncing: Machine halted.
 CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
   8800d1b33e28 816f80d2 
  81a22f81 8800d1b33ea8 816f2358 58d7
  0008 8800d1b33eb8 8800d1b33e58 8800d1b33ea8
 Call Trace:
  [] dump_stack+0x4c/0x6e
  [] panic+0xc0/0x1f3
  [] df_debug+0x35/0x40
  [] do_double_fault+0x87/0x100
  [] do_userpsace_rsp_in_kernel+0x107/0x140
  [] ? netlink_attachskb+0x1d/0x1d0
  [] userpsace_rsp_in_kernel+0x36/0x40
  [] ? netlink_attachskb+0x1d/0x1d0


So, it seems hitting in netlink_attachskb().
I'd need to check whether this consistently hits there or just at
random.


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Fri, 20 Mar 2015 19:16:53 +0100,
Denys Vlasenko wrote:
> 
> Hi,
> 
> This particular crash was hard to diagnose because of two reasons:
> 
> * CPU would happily use userspace RSP in kernel mode.
>   Crash comes only later, when we run off the stack.
>   We lose information when it started.
> 
> * Kernel's error handling code is ill prepared for RSP pointing
>   to user stack. So we take another page fault trying
>   to dump stack.
> 
> I prepared a patch which helps with both problems.
> 
> For testing, I inserted an invalid instruction right before SYSRET
> to induce a similar bug, and booted resulting kernel in qemu.
> 
> Before my patch, double fault output starts like this:
> 
> [0.715216] PANIC: double fault, error_code: 0x0
> [0.716033] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #7
> [0.716033] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [0.716033] task: 880007588000 ti: 88000759 task.ti: 
> 88000759
> [0.716033] RIP: 0010:[]  [] 
> do_error_trap+0x47/0x120
> [0.716033] RSP: 0018:7ffd89e7ffb8  EFLAGS: 00010006
> 
> The key here is that it doesn't show at which RIP we took the first
> "bad" exception. The only useful detail visible here is bad RSP.
> "do_error_trap+0x47" is useless.
> 
> After the patch, the very moment of "bad" exception is caught:
> 
> [0.666758] Exception on user stack 7ffc1fd0c388: RSP: 
> 0018:7ffc1fd0c3b0  EFLAGS: 00010006
> [0.667285] RIP: 0010:[]  [] 
> ret_from_sys_call+0x5f/0x67
> [0.667285] PANIC: double fault, error_code: 0x
> [0.667285] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #13
> [0.667285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [0.667285] task: 880007588000 ti: 88000759 task.ti: 
> 88000759
> [0.667285] RIP: 0010:[]  [] 
> ret_from_sys_call+0x5f/0x67
> [0.667285] RSP: 0018:7ffc1fd0c3b0  EFLAGS: 00010006
> 
> The exception happened at "ret_from_sys_call+0x5f".
> We also won't take another page fault any more,
> output proceeds like this:
> 
> ...
> [0.667285] RAX: 07a0 RBX: 7ffc1fd0c4e0 RCX: 
> c101
> [0.667285] RDX: 8800 RSI: 5401 RDI: 
> 7ffc1fd0c388
> [0.667285] RBP: 7ffc1fd0c570 R08: 0010 R09: 
> 
> [0.667285] R10: 7ffc1fd0c650 R11: 0202 R12: 
> 0120
> [0.667285] R13: 005f7b78 R14:  R15: 
> 004c9d44
> [0.667285] FS:  () GS:880007a0() 
> knlGS:
> [0.667285] CS:  0010 DS:  ES:  CR0: 8005003b
> [0.667285] CR2: 004ad1e4 CR3: 00101000 CR4: 
> 07f0
> [0.667285] Stack:
> [0.667285]  0018 7ffc1fd0c490 7ffc1fd0c3d0 
> 
> [0.667285]    7ffc1fd0c490 
> 
> [0.667285]     
> 
> [0.667285] Call Trace:
> [0.667285]  
> [0.667285] Code: 8b 44 24 50 48 8b 54 24 60 48 8b 74 24 68 48 8b 7c 24 70 
> 48 8b 8c 24 80 00 00 00 4c 8b 9c 24 90 00 00 00 48 8b a4 24 98 00 00 00 <0f> 
> 0b 0f 01 f8 48 0f 07 48 c7 84 24 a0 00 00 00 2b 00 00 00 48
> [0.667285] Kernel panic - not syncing: Machine halted.
> [0.667285] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #13
> [0.667285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [0.667285]   880007593e28 81789625 
> 880007588000
> [0.667285]  81a3b181 880007593ea8 817840aa 
> 88000759
> [0.667285]  0008 880007593eb8 880007593e58 
> 0001
> [0.667285] Call Trace:
> [0.667285]  [] dump_stack+0x4c/0x65
> [0.667285]  [] panic+0xc6/0x1ff
> [0.667285]  [] df_debug+0x35/0x40
> [0.667285]  [] do_double_fault+0x87/0x100
> [0.667285]  [] do_userpsace_rsp_in_kernel+0x107/0x140
> [0.667285]  [] ? ret_from_sys_call+0x5f/0x67
> [0.667285]  [] userpsace_rsp_in_kernel+0x39/0x40
> [0.667285]  [] ? ret_from_sys_call+0x5f/0x67
> [0.667285] Kernel Offset: disabled
> [0.667285] Rebooting in 1 seconds..
> 
> Takashi, are you willing to reproduce the panic one more time,
> with this patch? I would like to see whether oops messages
> are more informative with it.

It can't be applied to 4.0-rc5, unfortunately.

arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
`alloc_pt_gpregs_on_stack'
arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
sections) for `+'
scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
failed


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 10:35:41 +0100,
Takashi Iwai wrote:
 
 At Mon, 23 Mar 2015 10:02:52 +0100,
 Takashi Iwai wrote:
  
  At Fri, 20 Mar 2015 19:16:53 +0100,
  Denys Vlasenko wrote:
   Takashi, are you willing to reproduce the panic one more time,
   with this patch? I would like to see whether oops messages
   are more informative with it.
  
  It can't be applied to 4.0-rc5, unfortunately.
  
  arch/x86/kernel/entry_64.S: Assembler messages:
  arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
  `alloc_pt_gpregs_on_stack'
  arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
  sections) for `+'
  scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
  failed
 
 I pulled tip tree on top of 4.0-rc5, built with your patch and now
 succeeded to get a better message:
 
  kvm: zapping shadow pages for mmio generation wraparound
  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
  Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
 EFLAGS: 00010006
  RIP: 0010:[8162681d]  [8162681d] 
 netlink_attachskb+0x1d/0x1d0
  PANIC: double fault, error_code: 0x0
  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
  RIP: 0010:[8162681d]  [8162681d] 
 netlink_attachskb+0x1d/0x1d0
  RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
  RAX:  RBX: 0005 RCX: c101
  RDX:  RSI: 0001 RDI: 7ffd22c23ef0
  RBP: 0ea7 R08: 1ea7 R09: 
  R10: 0309dbf8 R11: 0246 R12: 0001
  R13:  R14: 03026e40 R15: 0309cd50
  FS:  7f89c83c2800() GS:88021d24() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 016d CR3: d90a CR4: 001427e0
  Stack:
   0ea7  03099c10 0ea7
   0ea7 0001 03099c10 0ea7
   00c84696 03099c88 7f0122c23fb8 0302f610
  Call Trace:
   UNK 
  Code: 
  10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 
 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 8b 87 68 01 00 00 39 
 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
  Kernel panic - not syncing: Machine halted.
  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
    8800d1b33e28 816f80d2 
   81a22f81 8800d1b33ea8 816f2358 58d7
   0008 8800d1b33eb8 8800d1b33e58 8800d1b33ea8
  Call Trace:
   [816f80d2] dump_stack+0x4c/0x6e
   [816f2358] panic+0xc0/0x1f3
   [81046e65] df_debug+0x35/0x40
   [81003fe7] do_double_fault+0x87/0x100
   [81004167] do_userpsace_rsp_in_kernel+0x107/0x140
   [8162681d] ? netlink_attachskb+0x1d/0x1d0
   [81703ca6] userpsace_rsp_in_kernel+0x36/0x40
   [8162681d] ? netlink_attachskb+0x1d/0x1d0
 
 
 So, it seems hitting in netlink_attachskb().
 I'd need to check whether this consistently hits there or just at
 random.

I managed to reproduce the bug two more times, and all three show the
very same stack trace like the above.  So, it's well reproducible.

I'm really puzzled now.  We have a few pieces of information:

- git bisection pointed the commit 96b6352c1271:
x86_64, entry: Remove the syscall exit audit and schedule optimizations
  and reverting this fixes the problem indeed.  Even just moving two
  lines
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE) 
  at the beginning of ret_from_sys_call already fixes.  (Of course I
  can't prove the fix but it stabilizes for a day without crash while
  usually I hit the bug in 10 minutes in full test running.)

- Another piece is that the bug happens only when a KVM is running.
  The kernel ran without problem over days with similar tasks
  (compiling kernel, etc) when no KVM was used.

- And now I get the trace as above, pointing netlink_attachskb().

I have a difficulty to imagine how all these pieces fit into a single
picture.  Is something already screwed up before that?


Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 10:02:52 +0100,
Takashi Iwai wrote:
 
 At Fri, 20 Mar 2015 19:16:53 +0100,
 Denys Vlasenko wrote:
  Takashi, are you willing to reproduce the panic one more time,
  with this patch? I would like to see whether oops messages
  are more informative with it.
 
 It can't be applied to 4.0-rc5, unfortunately.
 
 arch/x86/kernel/entry_64.S: Assembler messages:
 arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
 `alloc_pt_gpregs_on_stack'
 arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
 sections) for `+'
 scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
 failed

I pulled tip tree on top of 4.0-rc5, built with your patch and now
succeeded to get a better message:

 kvm: zapping shadow pages for mmio generation wraparound
 kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
 Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  EFLAGS: 
00010006
 RIP: 0010:[8162681d]  [8162681d] 
netlink_attachskb+0x1d/0x1d0
 PANIC: double fault, error_code: 0x0
 CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
 task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
 RIP: 0010:[8162681d]  [8162681d] 
netlink_attachskb+0x1d/0x1d0
 RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
 RAX:  RBX: 0005 RCX: c101
 RDX:  RSI: 0001 RDI: 7ffd22c23ef0
 RBP: 0ea7 R08: 1ea7 R09: 
 R10: 0309dbf8 R11: 0246 R12: 0001
 R13:  R14: 03026e40 R15: 0309cd50
 FS:  7f89c83c2800() GS:88021d24() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 016d CR3: d90a CR4: 001427e0
 Stack:
  0ea7  03099c10 0ea7
  0ea7 0001 03099c10 0ea7
  00c84696 03099c88 7f0122c23fb8 0302f610
 Call Trace:
  UNK 
 Code: 
 10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 
55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 8b 87 68 01 00 00 39 87 9c 
01 00 00 7c 25 48 8b 87 88 04 00 00 
 Kernel panic - not syncing: Machine halted.
 CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
   8800d1b33e28 816f80d2 
  81a22f81 8800d1b33ea8 816f2358 58d7
  0008 8800d1b33eb8 8800d1b33e58 8800d1b33ea8
 Call Trace:
  [816f80d2] dump_stack+0x4c/0x6e
  [816f2358] panic+0xc0/0x1f3
  [81046e65] df_debug+0x35/0x40
  [81003fe7] do_double_fault+0x87/0x100
  [81004167] do_userpsace_rsp_in_kernel+0x107/0x140
  [8162681d] ? netlink_attachskb+0x1d/0x1d0
  [81703ca6] userpsace_rsp_in_kernel+0x36/0x40
  [8162681d] ? netlink_attachskb+0x1d/0x1d0


So, it seems hitting in netlink_attachskb().
I'd need to check whether this consistently hits there or just at
random.


Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Denys Vlasenko
On 03/23/2015 02:22 PM, Takashi Iwai wrote:
 At Mon, 23 Mar 2015 10:35:41 +0100,
 Takashi Iwai wrote:

 At Mon, 23 Mar 2015 10:02:52 +0100,
 Takashi Iwai wrote:

 At Fri, 20 Mar 2015 19:16:53 +0100,
 Denys Vlasenko wrote:
 Takashi, are you willing to reproduce the panic one more time,
 with this patch? I would like to see whether oops messages
 are more informative with it.

 It can't be applied to 4.0-rc5, unfortunately.

 arch/x86/kernel/entry_64.S: Assembler messages:
 arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
 `alloc_pt_gpregs_on_stack'
 arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
 sections) for `+'
 scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
 failed

 I pulled tip tree on top of 4.0-rc5, built with your patch and now
 succeeded to get a better message:

  kvm: zapping shadow pages for mmio generation wraparound
  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
  Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
 EFLAGS: 00010006
  RIP: 0010:[8162681d]  [8162681d] 
 netlink_attachskb+0x1d/0x1d0
  PANIC: double fault, error_code: 0x0
  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
  RIP: 0010:[8162681d]  [8162681d] 
 netlink_attachskb+0x1d/0x1d0
  RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
  RAX:  RBX: 0005 RCX: c101
  RDX:  RSI: 0001 RDI: 7ffd22c23ef0
  RBP: 0ea7 R08: 1ea7 R09: 
  R10: 0309dbf8 R11: 0246 R12: 0001
  R13:  R14: 03026e40 R15: 0309cd50
  FS:  7f89c83c2800() GS:88021d24() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 016d CR3: d90a CR4: 001427e0
  Stack:
   0ea7  03099c10 0ea7
   0ea7 0001 03099c10 0ea7
   00c84696 03099c88 7f0122c23fb8 0302f610
  Call Trace:
   UNK 
  Code: 
  10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 
 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 8b 87 68 01 00 00 39 
 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
  Kernel panic - not syncing: Machine halted.
  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
    8800d1b33e28 816f80d2 
   81a22f81 8800d1b33ea8 816f2358 58d7
   0008 8800d1b33eb8 8800d1b33e58 8800d1b33ea8
  Call Trace:
   [816f80d2] dump_stack+0x4c/0x6e
   [816f2358] panic+0xc0/0x1f3
   [81046e65] df_debug+0x35/0x40
   [81003fe7] do_double_fault+0x87/0x100
   [81004167] do_userpsace_rsp_in_kernel+0x107/0x140
   [8162681d] ? netlink_attachskb+0x1d/0x1d0
   [81703ca6] userpsace_rsp_in_kernel+0x36/0x40
   [8162681d] ? netlink_attachskb+0x1d/0x1d0


 So, it seems hitting in netlink_attachskb().
 I'd need to check whether this consistently hits there or just at
 random.
 
 I managed to reproduce the bug two more times, and all three show the
 very same stack trace like the above.  So, it's well reproducible.

FYI: the disassembly of netlink_attachskb (from Code: line) is:

   0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
   5:   55  push   %rbp
   6:   48 89 e5mov%rsp,%rbp
   9:   41 56   push   %r14
   b:   41 55   push   %r13
   d:   49 89 d5mov%rdx,%r13
  10:   41 54   push   %r12
  12:   49 89 f4mov%rsi,%r12
  15:   53  push   %rbx
  16:   48 89 fbmov%rdi,%rbx
  19:   48 83 ec 30 sub$0x30,%rsp
  1d:   8b 87 68 01 00 00   mov0x168(%rdi),%eax
^
  23:   39 87 9c 01 00 00   cmp%eax,0x19c(%rdi)
  29:   7c 25   jl 50 _start+0x50
  2b:   48 8b 87 88 04 00 00mov0x488(%rdi),%rax

The ^ instruction is the one which faults. Since you said it
consistently happens here, this should be a page fault, not an external
hardware interrupt.

The code corresponds to the comparison in if():

int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
  long *timeo, struct sock *ssk)
{
struct netlink_sock *nlk;

nlk = nlk_sk(sk);

if ((atomic_read(sk-sk_rmem_alloc)  sk-sk_rcvbuf ||

%rdi (which is 1st param, struct sock *sk) is 7ffd22c23ef0
(userspace 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Denys Vlasenko
On 03/23/2015 07:38 PM, Andy Lutomirski wrote:
 cmpq $__NR_syscall_max,%rax
 ja ret_from_sys_call
 movq %r10,%rcx
 call *sys_call_table(,%rax,8)  # XXX:rip relative
 movq %rax,RAX-ARGOFFSET(%rsp)
 ret_from_sys_call:
 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 
 jnz int_ret_from_sys_call_fixup /* Go the the slow path */
 LOCKDEP_SYS_EXIT
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF
 ...
 ...
 int_ret_from_sys_call_fixup:
 FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 jmp int_ret_from_sys_call
 ...
 ...
 GLOBAL(int_ret_from_sys_call)
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF

 You reverted that by moving this insn to be after first 
 DISABLE_INTERRUPTS(CLBR_NONE).

 I also don't see how moving that check (even if it is wrong in a more
 benign way) can have such a drastic effect.
 
 I bet I see it.  I have the advantage of having stared at KVM code and
 cursed at it more recently than you, I suspect.  KVM does awful, awful
 things to CPU state, and, as an optimization, it allows kernel code to
 run with CPU state that would be totally invalid in user mode.  This
 happens through a bunch of hooks, including this bit in __switch_to:
 
 /*
  * Now maybe reload the debug registers and handle I/O bitmaps
  */
 if (unlikely(task_thread_info(next_p)-flags  _TIF_WORK_CTXSW_NEXT ||
  task_thread_info(prev_p)-flags  _TIF_WORK_CTXSW_PREV))
 __switch_to_xtra(prev_p, next_p, tss);
 
 IOW, we *change* tif during context switches.
 
 
 The race looks like this:
 
 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
 jnz int_ret_from_sys_call_fixup/* Go the the slow path */
 
 --- preempted here, switch to KVM guest ---
 
 KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
 happen to be a *32-bit* KVM guest, perhaps?
 
 Now KVM schedules, calling __switch_to.  __switch_to sets
 _TIF_USER_RETURN_NOTIFY.

Clear up to now...

 We IRET back to the syscall exit code,

So we end up being just after the testl, right?
We go into int_ret_from_sys_call_fixup.
We FIXUP_TOP_OF_STACK - now iret frame contains correct values.
Then we jump to int_ret_from_sys_call.

 turn off interrupts, and do sysret.  We are now screwed.

I don't understand. Where exactly it would go wrong?

On sysret, rsp would be restored from PER_CPU(old_rsp), right?
We'd end up in *userspace* with userspace rsp.

More to it. Since we FIXUPed the iret frame, it does not even matter
how we'll exit to userspace. Either sysret or iret would work.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Andy Lutomirski
On Mon, Mar 23, 2015 at 12:07 PM, Denys Vlasenko dvlas...@redhat.com wrote:
 On 03/23/2015 07:38 PM, Andy Lutomirski wrote:
 cmpq $__NR_syscall_max,%rax
 ja ret_from_sys_call
 movq %r10,%rcx
 call *sys_call_table(,%rax,8)  # XXX:rip relative
 movq %rax,RAX-ARGOFFSET(%rsp)
 ret_from_sys_call:
 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 
 jnz int_ret_from_sys_call_fixup /* Go the the slow path */
 LOCKDEP_SYS_EXIT
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF
 ...
 ...
 int_ret_from_sys_call_fixup:
 FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 jmp int_ret_from_sys_call
 ...
 ...
 GLOBAL(int_ret_from_sys_call)
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF

 You reverted that by moving this insn to be after first 
 DISABLE_INTERRUPTS(CLBR_NONE).

 I also don't see how moving that check (even if it is wrong in a more
 benign way) can have such a drastic effect.

 I bet I see it.  I have the advantage of having stared at KVM code and
 cursed at it more recently than you, I suspect.  KVM does awful, awful
 things to CPU state, and, as an optimization, it allows kernel code to
 run with CPU state that would be totally invalid in user mode.  This
 happens through a bunch of hooks, including this bit in __switch_to:

 /*
  * Now maybe reload the debug registers and handle I/O bitmaps
  */
 if (unlikely(task_thread_info(next_p)-flags  _TIF_WORK_CTXSW_NEXT ||
  task_thread_info(prev_p)-flags  _TIF_WORK_CTXSW_PREV))
 __switch_to_xtra(prev_p, next_p, tss);

 IOW, we *change* tif during context switches.


 The race looks like this:

 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
 jnz int_ret_from_sys_call_fixup/* Go the the slow path */

 --- preempted here, switch to KVM guest ---

 KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
 happen to be a *32-bit* KVM guest, perhaps?

 Now KVM schedules, calling __switch_to.  __switch_to sets
 _TIF_USER_RETURN_NOTIFY.

 Clear up to now...

 We IRET back to the syscall exit code,

 So we end up being just after the testl, right?
 We go into int_ret_from_sys_call_fixup.

Nope, other way around.  We saw no work bits set in testl, but one or
more of those bits was set when we're preempted and return.  Now we
*don't* go to int_ret_from_sys_call_fixup.  I don't think that the
resulting sysret itself is harmful, but I think we're now running user
code with some MSRs programmed wrong.  The next syscall could do bad
things, such as failing to clear IF.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Fri, 20 Mar 2015 19:16:53 +0100,
Denys Vlasenko wrote:
 
 Hi,
 
 This particular crash was hard to diagnose because of two reasons:
 
 * CPU would happily use userspace RSP in kernel mode.
   Crash comes only later, when we run off the stack.
   We lose information when it started.
 
 * Kernel's error handling code is ill prepared for RSP pointing
   to user stack. So we take another page fault trying
   to dump stack.
 
 I prepared a patch which helps with both problems.
 
 For testing, I inserted an invalid instruction right before SYSRET
 to induce a similar bug, and booted resulting kernel in qemu.
 
 Before my patch, double fault output starts like this:
 
 [0.715216] PANIC: double fault, error_code: 0x0
 [0.716033] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #7
 [0.716033] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [0.716033] task: 880007588000 ti: 88000759 task.ti: 
 88000759
 [0.716033] RIP: 0010:[81017057]  [81017057] 
 do_error_trap+0x47/0x120
 [0.716033] RSP: 0018:7ffd89e7ffb8  EFLAGS: 00010006
 
 The key here is that it doesn't show at which RIP we took the first
 bad exception. The only useful detail visible here is bad RSP.
 do_error_trap+0x47 is useless.
 
 After the patch, the very moment of bad exception is caught:
 
 [0.666758] Exception on user stack 7ffc1fd0c388: RSP: 
 0018:7ffc1fd0c3b0  EFLAGS: 00010006
 [0.667285] RIP: 0010:[81793688]  [81793688] 
 ret_from_sys_call+0x5f/0x67
 [0.667285] PANIC: double fault, error_code: 0x
 [0.667285] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #13
 [0.667285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [0.667285] task: 880007588000 ti: 88000759 task.ti: 
 88000759
 [0.667285] RIP: 0010:[81793688]  [81793688] 
 ret_from_sys_call+0x5f/0x67
 [0.667285] RSP: 0018:7ffc1fd0c3b0  EFLAGS: 00010006
 
 The exception happened at ret_from_sys_call+0x5f.
 We also won't take another page fault any more,
 output proceeds like this:
 
 ...
 [0.667285] RAX: 07a0 RBX: 7ffc1fd0c4e0 RCX: 
 c101
 [0.667285] RDX: 8800 RSI: 5401 RDI: 
 7ffc1fd0c388
 [0.667285] RBP: 7ffc1fd0c570 R08: 0010 R09: 
 
 [0.667285] R10: 7ffc1fd0c650 R11: 0202 R12: 
 0120
 [0.667285] R13: 005f7b78 R14:  R15: 
 004c9d44
 [0.667285] FS:  () GS:880007a0() 
 knlGS:
 [0.667285] CS:  0010 DS:  ES:  CR0: 8005003b
 [0.667285] CR2: 004ad1e4 CR3: 00101000 CR4: 
 07f0
 [0.667285] Stack:
 [0.667285]  0018 7ffc1fd0c490 7ffc1fd0c3d0 
 
 [0.667285]    7ffc1fd0c490 
 
 [0.667285]     
 
 [0.667285] Call Trace:
 [0.667285]  UNK
 [0.667285] Code: 8b 44 24 50 48 8b 54 24 60 48 8b 74 24 68 48 8b 7c 24 70 
 48 8b 8c 24 80 00 00 00 4c 8b 9c 24 90 00 00 00 48 8b a4 24 98 00 00 00 0f 
 0b 0f 01 f8 48 0f 07 48 c7 84 24 a0 00 00 00 2b 00 00 00 48
 [0.667285] Kernel panic - not syncing: Machine halted.
 [0.667285] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #13
 [0.667285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [0.667285]   880007593e28 81789625 
 880007588000
 [0.667285]  81a3b181 880007593ea8 817840aa 
 88000759
 [0.667285]  0008 880007593eb8 880007593e58 
 0001
 [0.667285] Call Trace:
 [0.667285]  [81789625] dump_stack+0x4c/0x65
 [0.667285]  [817840aa] panic+0xc6/0x1ff
 [0.667285]  [81059ee5] df_debug+0x35/0x40
 [0.667285]  [81017e37] do_double_fault+0x87/0x100
 [0.667285]  [81017fb7] do_userpsace_rsp_in_kernel+0x107/0x140
 [0.667285]  [81793688] ? ret_from_sys_call+0x5f/0x67
 [0.667285]  [81795b49] userpsace_rsp_in_kernel+0x39/0x40
 [0.667285]  [81793688] ? ret_from_sys_call+0x5f/0x67
 [0.667285] Kernel Offset: disabled
 [0.667285] Rebooting in 1 seconds..
 
 Takashi, are you willing to reproduce the panic one more time,
 with this patch? I would like to see whether oops messages
 are more informative with it.

It can't be applied to 4.0-rc5, unfortunately.

arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
`alloc_pt_gpregs_on_stack'
arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
sections) for `+'
scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
failed


Takashi
--
To unsubscribe from this list: send the line unsubscribe 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 17:07:15 +0100,
Denys Vlasenko wrote:
 
 On 03/23/2015 02:22 PM, Takashi Iwai wrote:
  At Mon, 23 Mar 2015 10:35:41 +0100,
  Takashi Iwai wrote:
 
  At Mon, 23 Mar 2015 10:02:52 +0100,
  Takashi Iwai wrote:
 
  At Fri, 20 Mar 2015 19:16:53 +0100,
  Denys Vlasenko wrote:
  Takashi, are you willing to reproduce the panic one more time,
  with this patch? I would like to see whether oops messages
  are more informative with it.
 
  It can't be applied to 4.0-rc5, unfortunately.
 
  arch/x86/kernel/entry_64.S: Assembler messages:
  arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
  `alloc_pt_gpregs_on_stack'
  arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
  sections) for `+'
  scripts/Makefile.build:294: recipe for target 
  'arch/x86/kernel/entry_64.o' failed
 
  I pulled tip tree on top of 4.0-rc5, built with your patch and now
  succeeded to get a better message:
 
   kvm: zapping shadow pages for mmio generation wraparound
   kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
   Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
  EFLAGS: 00010006
   RIP: 0010:[8162681d]  [8162681d] 
  netlink_attachskb+0x1d/0x1d0
   PANIC: double fault, error_code: 0x0
   CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
   Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
   task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
   RIP: 0010:[8162681d]  [8162681d] 
  netlink_attachskb+0x1d/0x1d0
   RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
   RAX:  RBX: 0005 RCX: c101
   RDX:  RSI: 0001 RDI: 7ffd22c23ef0
   RBP: 0ea7 R08: 1ea7 R09: 
   R10: 0309dbf8 R11: 0246 R12: 0001
   R13:  R14: 03026e40 R15: 0309cd50
   FS:  7f89c83c2800() GS:88021d24() 
  knlGS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 016d CR3: d90a CR4: 001427e0
   Stack:
0ea7  03099c10 0ea7
0ea7 0001 03099c10 0ea7
00c84696 03099c88 7f0122c23fb8 0302f610
   Call Trace:
UNK 
   Code: 
   10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 
  56 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 8b 87 68 01 00 
  00 39 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
   Kernel panic - not syncing: Machine halted.
   CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
   Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
 8800d1b33e28 816f80d2 
81a22f81 8800d1b33ea8 816f2358 58d7
0008 8800d1b33eb8 8800d1b33e58 8800d1b33ea8
   Call Trace:
[816f80d2] dump_stack+0x4c/0x6e
[816f2358] panic+0xc0/0x1f3
[81046e65] df_debug+0x35/0x40
[81003fe7] do_double_fault+0x87/0x100
[81004167] do_userpsace_rsp_in_kernel+0x107/0x140
[8162681d] ? netlink_attachskb+0x1d/0x1d0
[81703ca6] userpsace_rsp_in_kernel+0x36/0x40
[8162681d] ? netlink_attachskb+0x1d/0x1d0
 
 
  So, it seems hitting in netlink_attachskb().
  I'd need to check whether this consistently hits there or just at
  random.
  
  I managed to reproduce the bug two more times, and all three show the
  very same stack trace like the above.  So, it's well reproducible.
 
 FYI: the disassembly of netlink_attachskb (from Code: line) is:
 
0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
5:   55  push   %rbp
6:   48 89 e5mov%rsp,%rbp
9:   41 56   push   %r14
b:   41 55   push   %r13
d:   49 89 d5mov%rdx,%r13
   10:   41 54   push   %r12
   12:   49 89 f4mov%rsi,%r12
   15:   53  push   %rbx
   16:   48 89 fbmov%rdi,%rbx
   19:   48 83 ec 30 sub$0x30,%rsp
   1d:   8b 87 68 01 00 00   mov0x168(%rdi),%eax
 ^
   23:   39 87 9c 01 00 00   cmp%eax,0x19c(%rdi)
   29:   7c 25   jl 50 _start+0x50
   2b:   48 8b 87 88 04 00 00mov0x488(%rdi),%rax
 
 The ^ instruction is the one which faults. Since you said it
 consistently happens here, this should be a page fault, not an external
 hardware interrupt.
 
 The code corresponds to the comparison in if():
 
 int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
   long *timeo, struct sock *ssk)
 {
 struct netlink_sock 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Stefan Seyfried
Am 23.03.2015 um 19:38 schrieb Andy Lutomirski:
 I bet I see it.  I have the advantage of having stared at KVM code and
 cursed at it more recently than you, I suspect.  KVM does awful, awful
 things to CPU state, and, as an optimization, it allows kernel code to
 run with CPU state that would be totally invalid in user mode.  This
 happens through a bunch of hooks, including this bit in __switch_to:
 
 /*
  * Now maybe reload the debug registers and handle I/O bitmaps
  */
 if (unlikely(task_thread_info(next_p)-flags  _TIF_WORK_CTXSW_NEXT ||
  task_thread_info(prev_p)-flags  _TIF_WORK_CTXSW_PREV))
 __switch_to_xtra(prev_p, next_p, tss);
 
 IOW, we *change* tif during context switches.
 
 
 The race looks like this:
 
 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
 jnz int_ret_from_sys_call_fixup/* Go the the slow path */
 
 --- preempted here, switch to KVM guest ---
 
 KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
 happen to be a *32-bit* KVM guest, perhaps?

not in my case (penryn CPU), there it was 64bit guests.

 Now KVM schedules, calling __switch_to.  __switch_to sets
 _TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
 off interrupts, and do sysret.  We are now screwed.
 
 I don't know why this manifests in this particular failure, but any
 number of terrible things could happen now.
 
 FWIW, this will affect things other than KVM.  For example, SIGKILL
 sent while a process is sleeping in that two-instruction window won't
 work.
 
 Takashi, can you re-send your patch so we can review it for real in
 light of this race?
-- 
Stefan Seyfried
Linux Consultant  Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 11:48:42 -0700,
Andy Lutomirski wrote:
 
 On Mon, Mar 23, 2015 at 11:38 AM, Andy Lutomirski l...@amacapital.net wrote:
  On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko dvlas...@redhat.com wrote:
  On 03/23/2015 02:22 PM, Takashi Iwai wrote:
  At Mon, 23 Mar 2015 10:35:41 +0100,
  Takashi Iwai wrote:
 
  At Mon, 23 Mar 2015 10:02:52 +0100,
  Takashi Iwai wrote:
 
  At Fri, 20 Mar 2015 19:16:53 +0100,
  Denys Vlasenko wrote:
 
  I'm really puzzled now.  We have a few pieces of information:
 
  - git bisection pointed the commit 96b6352c1271:
  x86_64, entry: Remove the syscall exit audit and schedule 
  optimizations
and reverting this fixes the problem indeed.  Even just moving two
lines
  LOCKDEP_SYS_EXIT
  DISABLE_INTERRUPTS(CLBR_NONE)
at the beginning of ret_from_sys_call already fixes.  (Of course I
can't prove the fix but it stabilizes for a day without crash while
usually I hit the bug in 10 minutes in full test running.)
 
  The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
  interrupt-disabled region to interrupt-enabled:
 
  cmpq $__NR_syscall_max,%rax
  ja ret_from_sys_call
  movq %r10,%rcx
  call *sys_call_table(,%rax,8)  # XXX:rip relative
  movq %rax,RAX-ARGOFFSET(%rsp)
  ret_from_sys_call:
  testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
  
  jnz int_ret_from_sys_call_fixup /* Go the the slow path */
  LOCKDEP_SYS_EXIT
  DISABLE_INTERRUPTS(CLBR_NONE)
  TRACE_IRQS_OFF
  ...
  ...
  int_ret_from_sys_call_fixup:
  FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
  jmp int_ret_from_sys_call
  ...
  ...
  GLOBAL(int_ret_from_sys_call)
  DISABLE_INTERRUPTS(CLBR_NONE)
  TRACE_IRQS_OFF
 
  You reverted that by moving this insn to be after first 
  DISABLE_INTERRUPTS(CLBR_NONE).
 
  I also don't see how moving that check (even if it is wrong in a more
  benign way) can have such a drastic effect.
 
  I bet I see it.  I have the advantage of having stared at KVM code and
  cursed at it more recently than you, I suspect.  KVM does awful, awful
  things to CPU state, and, as an optimization, it allows kernel code to
  run with CPU state that would be totally invalid in user mode.  This
  happens through a bunch of hooks, including this bit in __switch_to:
 
  /*
   * Now maybe reload the debug registers and handle I/O bitmaps
   */
  if (unlikely(task_thread_info(next_p)-flags  _TIF_WORK_CTXSW_NEXT ||
   task_thread_info(prev_p)-flags  _TIF_WORK_CTXSW_PREV))
  __switch_to_xtra(prev_p, next_p, tss);
 
  IOW, we *change* tif during context switches.
 
 
  The race looks like this:
 
  testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
  jnz int_ret_from_sys_call_fixup/* Go the the slow path */
 
  --- preempted here, switch to KVM guest ---
 
  KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
  happen to be a *32-bit* KVM guest, perhaps?
 
  Now KVM schedules, calling __switch_to.  __switch_to sets
  _TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
  off interrupts, and do sysret.  We are now screwed.
 
  I don't know why this manifests in this particular failure, but any
  number of terrible things could happen now.
 
  FWIW, this will affect things other than KVM.  For example, SIGKILL
  sent while a process is sleeping in that two-instruction window won't
  work.
 
  Takashi, can you re-send your patch so we can review it for real in
  light of this race?
 
 Never mind, I'm testing a slightly fancier patch.

OK, I'll wait for your test patch.


thanks,

Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 18:46:45 +0100,
Denys Vlasenko wrote:
 
 On 03/23/2015 06:18 PM, Takashi Iwai wrote:
  At Mon, 23 Mar 2015 17:07:15 +0100, Denys Vlasenko wrote:
  I pulled tip tree on top of 4.0-rc5, built with your patch and now
  succeeded to get a better message:
 
   kvm: zapping shadow pages for mmio generation wraparound
   kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
   Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
  EFLAGS: 00010006
   RIP: 0010:[8162681d]  [8162681d] 
  netlink_attachskb+0x1d/0x1d0
   PANIC: double fault, error_code: 0x0
   CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ 
  #2
   Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
   task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
   RIP: 0010:[8162681d]  [8162681d] 
  netlink_attachskb+0x1d/0x1d0
   RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
   RAX:  RBX: 0005 RCX: c101
   RDX:  RSI: 0001 RDI: 7ffd22c23ef0
 
  FYI: the disassembly of netlink_attachskb (from Code: line) is:
 
 0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
 5:   55  push   %rbp
 6:   48 89 e5mov%rsp,%rbp
 9:   41 56   push   %r14
 b:   41 55   push   %r13
 d:   49 89 d5mov%rdx,%r13
10:   41 54   push   %r12
12:   49 89 f4mov%rsi,%r12
15:   53  push   %rbx
16:   48 89 fbmov%rdi,%rbx
19:   48 83 ec 30 sub$0x30,%rsp
1d:   8b 87 68 01 00 00   mov0x168(%rdi),%eax
  ^
23:   39 87 9c 01 00 00   cmp%eax,0x19c(%rdi)
29:   7c 25   jl 50 _start+0x50
2b:   48 8b 87 88 04 00 00mov0x488(%rdi),%rax
 
  The ^ instruction is the one which faults. Since you said it
  consistently happens here, this should be a page fault, not an external
  hardware interrupt.
 
  The code corresponds to the comparison in if():
 
  int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
long *timeo, struct sock *ssk)
  {
  struct netlink_sock *nlk;
 
  nlk = nlk_sk(sk);
 
  if ((atomic_read(sk-sk_rmem_alloc)  sk-sk_rcvbuf ||
 
  - Another piece is that the bug happens only when a KVM is running.
The kernel ran without problem over days with similar tasks
(compiling kernel, etc) when no KVM was used.
 
  Conceivably virtualization support in CPUs can have nasty erratas.
  However, you and other reporter have different CPUs - yours
  is Ivy Bridge, his CPU is a Penryn.
 
  I don't see the path how KVM helps to trigger this.
 
  - And now I get the trace as above, pointing netlink_attachskb().
 
  I have a difficulty to imagine how all these pieces fit into a single
  picture.  Is something already screwed up before that?
 
  Well, a tiny bit more info will be seen if you'd change %rdi
  to, say, %r15 in these two lines in my patch:
 
 /* Save bogus RSP value */
 movq%rsp,%rdi
  ...
 push%rdi/* pt_regs-sp */
 
  Then original %rdi will be visible in the crash message.
  
  OK, here we go.
  
   kvm: zapping shadow pages for mmio generation wraparound
   kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
   Exception on user stack 7fff1d7e5ec0: RSP: 0018:7fff1d7e5ef8  
  EFLAGS: 00010002
   RIP: 0010:[8162681d]  [8162681d] 
  netlink_attachskb+0x1d/0x1d0
   PANIC: double fault, error_code: 0x0
   CPU: 5 PID: 14285 Comm: fixdep Tainted: GW   4.0.0-rc5-debug1+ 
  #3
   Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
   task: 88020ba1c690 ti: 880206ba4000 task.ti: 880206ba4000
   RIP: 0010:[8162681d]  [8162681d] 
  netlink_attachskb+0x1d/0x1d0
   RSP: 0018:7fff1d7e5ef8  EFLAGS: 00010002
   RAX:  RBX:  RCX: c101
   RDX:  RSI: 1ebb RDI: 
 
 Thanks for your testing. So the %rdi was NULL... not very informative.
 
 Notice that your every crash is preceded by
 
 kvm: zapping shadow pages for mmio generation wraparound
 kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
 
 This hints that kvm _is_ somehow responsible.

It's likely irrelevant, as this appears at the time a VM starting, not
at the crash time.  I've got this message all the time.  Sorry for
confusing.


Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Denys Vlasenko
On 03/23/2015 06:18 PM, Takashi Iwai wrote:
 At Mon, 23 Mar 2015 17:07:15 +0100, Denys Vlasenko wrote:
 I pulled tip tree on top of 4.0-rc5, built with your patch and now
 succeeded to get a better message:

  kvm: zapping shadow pages for mmio generation wraparound
  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
  Exception on user stack 7ffd22c23ef0: RSP: 0018:7ffd22c23f28  
 EFLAGS: 00010006
  RIP: 0010:[8162681d]  [8162681d] 
 netlink_attachskb+0x1d/0x1d0
  PANIC: double fault, error_code: 0x0
  CPU: 1 PID: 10819 Comm: cc1 Tainted: GW   4.0.0-rc5-debug1+ #2
  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  task: 8800d1b34b10 ti: 8800d1b3 task.ti: 8800d1b3
  RIP: 0010:[8162681d]  [8162681d] 
 netlink_attachskb+0x1d/0x1d0
  RSP: 0018:7ffd22c23f28  EFLAGS: 00010006
  RAX:  RBX: 0005 RCX: c101
  RDX:  RSI: 0001 RDI: 7ffd22c23ef0

 FYI: the disassembly of netlink_attachskb (from Code: line) is:

0:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
5:   55  push   %rbp
6:   48 89 e5mov%rsp,%rbp
9:   41 56   push   %r14
b:   41 55   push   %r13
d:   49 89 d5mov%rdx,%r13
   10:   41 54   push   %r12
   12:   49 89 f4mov%rsi,%r12
   15:   53  push   %rbx
   16:   48 89 fbmov%rdi,%rbx
   19:   48 83 ec 30 sub$0x30,%rsp
   1d:   8b 87 68 01 00 00   mov0x168(%rdi),%eax
 ^
   23:   39 87 9c 01 00 00   cmp%eax,0x19c(%rdi)
   29:   7c 25   jl 50 _start+0x50
   2b:   48 8b 87 88 04 00 00mov0x488(%rdi),%rax

 The ^ instruction is the one which faults. Since you said it
 consistently happens here, this should be a page fault, not an external
 hardware interrupt.

 The code corresponds to the comparison in if():

 int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
   long *timeo, struct sock *ssk)
 {
 struct netlink_sock *nlk;

 nlk = nlk_sk(sk);

 if ((atomic_read(sk-sk_rmem_alloc)  sk-sk_rcvbuf ||

 - Another piece is that the bug happens only when a KVM is running.
   The kernel ran without problem over days with similar tasks
   (compiling kernel, etc) when no KVM was used.

 Conceivably virtualization support in CPUs can have nasty erratas.
 However, you and other reporter have different CPUs - yours
 is Ivy Bridge, his CPU is a Penryn.

 I don't see the path how KVM helps to trigger this.

 - And now I get the trace as above, pointing netlink_attachskb().

 I have a difficulty to imagine how all these pieces fit into a single
 picture.  Is something already screwed up before that?

 Well, a tiny bit more info will be seen if you'd change %rdi
 to, say, %r15 in these two lines in my patch:

/* Save bogus RSP value */
movq%rsp,%rdi
 ...
push%rdi/* pt_regs-sp */

 Then original %rdi will be visible in the crash message.
 
 OK, here we go.
 
  kvm: zapping shadow pages for mmio generation wraparound
  kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
  Exception on user stack 7fff1d7e5ec0: RSP: 0018:7fff1d7e5ef8  
 EFLAGS: 00010002
  RIP: 0010:[8162681d]  [8162681d] 
 netlink_attachskb+0x1d/0x1d0
  PANIC: double fault, error_code: 0x0
  CPU: 5 PID: 14285 Comm: fixdep Tainted: GW   4.0.0-rc5-debug1+ #3
  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  task: 88020ba1c690 ti: 880206ba4000 task.ti: 880206ba4000
  RIP: 0010:[8162681d]  [8162681d] 
 netlink_attachskb+0x1d/0x1d0
  RSP: 0018:7fff1d7e5ef8  EFLAGS: 00010002
  RAX:  RBX:  RCX: c101
  RDX:  RSI: 1ebb RDI: 

Thanks for your testing. So the %rdi was NULL... not very informative.

Notice that your every crash is preceded by

kvm: zapping shadow pages for mmio generation wraparound
kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x

This hints that kvm _is_ somehow responsible.
I'm no expert on kvm, I need to take a look around that code...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Andy Lutomirski
On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko dvlas...@redhat.com wrote:
 On 03/23/2015 02:22 PM, Takashi Iwai wrote:
 At Mon, 23 Mar 2015 10:35:41 +0100,
 Takashi Iwai wrote:

 At Mon, 23 Mar 2015 10:02:52 +0100,
 Takashi Iwai wrote:

 At Fri, 20 Mar 2015 19:16:53 +0100,
 Denys Vlasenko wrote:

 I'm really puzzled now.  We have a few pieces of information:

 - git bisection pointed the commit 96b6352c1271:
 x86_64, entry: Remove the syscall exit audit and schedule optimizations
   and reverting this fixes the problem indeed.  Even just moving two
   lines
 LOCKDEP_SYS_EXIT
 DISABLE_INTERRUPTS(CLBR_NONE)
   at the beginning of ret_from_sys_call already fixes.  (Of course I
   can't prove the fix but it stabilizes for a day without crash while
   usually I hit the bug in 10 minutes in full test running.)

 The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
 interrupt-disabled region to interrupt-enabled:

 cmpq $__NR_syscall_max,%rax
 ja ret_from_sys_call
 movq %r10,%rcx
 call *sys_call_table(,%rax,8)  # XXX:rip relative
 movq %rax,RAX-ARGOFFSET(%rsp)
 ret_from_sys_call:
 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 
 jnz int_ret_from_sys_call_fixup /* Go the the slow path */
 LOCKDEP_SYS_EXIT
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF
 ...
 ...
 int_ret_from_sys_call_fixup:
 FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 jmp int_ret_from_sys_call
 ...
 ...
 GLOBAL(int_ret_from_sys_call)
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF

 You reverted that by moving this insn to be after first 
 DISABLE_INTERRUPTS(CLBR_NONE).

 I also don't see how moving that check (even if it is wrong in a more
 benign way) can have such a drastic effect.

I bet I see it.  I have the advantage of having stared at KVM code and
cursed at it more recently than you, I suspect.  KVM does awful, awful
things to CPU state, and, as an optimization, it allows kernel code to
run with CPU state that would be totally invalid in user mode.  This
happens through a bunch of hooks, including this bit in __switch_to:

/*
 * Now maybe reload the debug registers and handle I/O bitmaps
 */
if (unlikely(task_thread_info(next_p)-flags  _TIF_WORK_CTXSW_NEXT ||
 task_thread_info(prev_p)-flags  _TIF_WORK_CTXSW_PREV))
__switch_to_xtra(prev_p, next_p, tss);

IOW, we *change* tif during context switches.


The race looks like this:

testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
jnz int_ret_from_sys_call_fixup/* Go the the slow path */

--- preempted here, switch to KVM guest ---

KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
happen to be a *32-bit* KVM guest, perhaps?

Now KVM schedules, calling __switch_to.  __switch_to sets
_TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
off interrupts, and do sysret.  We are now screwed.

I don't know why this manifests in this particular failure, but any
number of terrible things could happen now.

FWIW, this will affect things other than KVM.  For example, SIGKILL
sent while a process is sleeping in that two-instruction window won't
work.

Takashi, can you re-send your patch so we can review it for real in
light of this race?



 Shot-in-the-dark idea. At this code revision we did not yet
 store user's %rsp in pt_regs-sp, we used a fixup to populate it:

 .macro FIXUP_TOP_OF_STACK tmp offset=0
 movq PER_CPU_VAR(old_rsp),\tmp
 movq \tmp,RSP+\offset(%rsp)

 (There are pending patches to fix this mess).

 If an interrupt interrupting *kernel code* would go into a code path
 which does FIXUP_TOP_OF_STACK, it'd overwrite the correct saved %rsp
 with a user's one. The iret from interrupt would work,
 but the resulting CPU state would be inconsistent. But I don't see
 such a code path from interrupts to FIXUP_TOP_OF_STACK...

I don't buy it.  Anything that does that is so completely broken that
I'd hope we'd have found it long ago.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Andy Lutomirski
On Mon, Mar 23, 2015 at 11:38 AM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko dvlas...@redhat.com wrote:
 On 03/23/2015 02:22 PM, Takashi Iwai wrote:
 At Mon, 23 Mar 2015 10:35:41 +0100,
 Takashi Iwai wrote:

 At Mon, 23 Mar 2015 10:02:52 +0100,
 Takashi Iwai wrote:

 At Fri, 20 Mar 2015 19:16:53 +0100,
 Denys Vlasenko wrote:

 I'm really puzzled now.  We have a few pieces of information:

 - git bisection pointed the commit 96b6352c1271:
 x86_64, entry: Remove the syscall exit audit and schedule optimizations
   and reverting this fixes the problem indeed.  Even just moving two
   lines
 LOCKDEP_SYS_EXIT
 DISABLE_INTERRUPTS(CLBR_NONE)
   at the beginning of ret_from_sys_call already fixes.  (Of course I
   can't prove the fix but it stabilizes for a day without crash while
   usually I hit the bug in 10 minutes in full test running.)

 The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
 interrupt-disabled region to interrupt-enabled:

 cmpq $__NR_syscall_max,%rax
 ja ret_from_sys_call
 movq %r10,%rcx
 call *sys_call_table(,%rax,8)  # XXX:rip relative
 movq %rax,RAX-ARGOFFSET(%rsp)
 ret_from_sys_call:
 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 
 jnz int_ret_from_sys_call_fixup /* Go the the slow path */
 LOCKDEP_SYS_EXIT
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF
 ...
 ...
 int_ret_from_sys_call_fixup:
 FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 jmp int_ret_from_sys_call
 ...
 ...
 GLOBAL(int_ret_from_sys_call)
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF

 You reverted that by moving this insn to be after first 
 DISABLE_INTERRUPTS(CLBR_NONE).

 I also don't see how moving that check (even if it is wrong in a more
 benign way) can have such a drastic effect.

 I bet I see it.  I have the advantage of having stared at KVM code and
 cursed at it more recently than you, I suspect.  KVM does awful, awful
 things to CPU state, and, as an optimization, it allows kernel code to
 run with CPU state that would be totally invalid in user mode.  This
 happens through a bunch of hooks, including this bit in __switch_to:

 /*
  * Now maybe reload the debug registers and handle I/O bitmaps
  */
 if (unlikely(task_thread_info(next_p)-flags  _TIF_WORK_CTXSW_NEXT ||
  task_thread_info(prev_p)-flags  _TIF_WORK_CTXSW_PREV))
 __switch_to_xtra(prev_p, next_p, tss);

 IOW, we *change* tif during context switches.


 The race looks like this:

 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
 jnz int_ret_from_sys_call_fixup/* Go the the slow path */

 --- preempted here, switch to KVM guest ---

 KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
 happen to be a *32-bit* KVM guest, perhaps?

 Now KVM schedules, calling __switch_to.  __switch_to sets
 _TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
 off interrupts, and do sysret.  We are now screwed.

 I don't know why this manifests in this particular failure, but any
 number of terrible things could happen now.

 FWIW, this will affect things other than KVM.  For example, SIGKILL
 sent while a process is sleeping in that two-instruction window won't
 work.

 Takashi, can you re-send your patch so we can review it for real in
 light of this race?

Never mind, I'm testing a slightly fancier patch.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-23 Thread Takashi Iwai
At Mon, 23 Mar 2015 11:38:30 -0700,
Andy Lutomirski wrote:
 
 On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko dvlas...@redhat.com wrote:
  On 03/23/2015 02:22 PM, Takashi Iwai wrote:
  At Mon, 23 Mar 2015 10:35:41 +0100,
  Takashi Iwai wrote:
 
  At Mon, 23 Mar 2015 10:02:52 +0100,
  Takashi Iwai wrote:
 
  At Fri, 20 Mar 2015 19:16:53 +0100,
  Denys Vlasenko wrote:
 
  I'm really puzzled now.  We have a few pieces of information:
 
  - git bisection pointed the commit 96b6352c1271:
  x86_64, entry: Remove the syscall exit audit and schedule optimizations
and reverting this fixes the problem indeed.  Even just moving two
lines
  LOCKDEP_SYS_EXIT
  DISABLE_INTERRUPTS(CLBR_NONE)
at the beginning of ret_from_sys_call already fixes.  (Of course I
can't prove the fix but it stabilizes for a day without crash while
usually I hit the bug in 10 minutes in full test running.)
 
  The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
  interrupt-disabled region to interrupt-enabled:
 
  cmpq $__NR_syscall_max,%rax
  ja ret_from_sys_call
  movq %r10,%rcx
  call *sys_call_table(,%rax,8)  # XXX:rip relative
  movq %rax,RAX-ARGOFFSET(%rsp)
  ret_from_sys_call:
  testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
  
  jnz int_ret_from_sys_call_fixup /* Go the the slow path */
  LOCKDEP_SYS_EXIT
  DISABLE_INTERRUPTS(CLBR_NONE)
  TRACE_IRQS_OFF
  ...
  ...
  int_ret_from_sys_call_fixup:
  FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
  jmp int_ret_from_sys_call
  ...
  ...
  GLOBAL(int_ret_from_sys_call)
  DISABLE_INTERRUPTS(CLBR_NONE)
  TRACE_IRQS_OFF
 
  You reverted that by moving this insn to be after first 
  DISABLE_INTERRUPTS(CLBR_NONE).
 
  I also don't see how moving that check (even if it is wrong in a more
  benign way) can have such a drastic effect.
 
 I bet I see it.  I have the advantage of having stared at KVM code and
 cursed at it more recently than you, I suspect.  KVM does awful, awful
 things to CPU state, and, as an optimization, it allows kernel code to
 run with CPU state that would be totally invalid in user mode.  This
 happens through a bunch of hooks, including this bit in __switch_to:
 
 /*
  * Now maybe reload the debug registers and handle I/O bitmaps
  */
 if (unlikely(task_thread_info(next_p)-flags  _TIF_WORK_CTXSW_NEXT ||
  task_thread_info(prev_p)-flags  _TIF_WORK_CTXSW_PREV))
 __switch_to_xtra(prev_p, next_p, tss);
 
 IOW, we *change* tif during context switches.
 
 
 The race looks like this:
 
 testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
 jnz int_ret_from_sys_call_fixup/* Go the the slow path */
 
 --- preempted here, switch to KVM guest ---
 
 KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
 happen to be a *32-bit* KVM guest, perhaps?
 
 Now KVM schedules, calling __switch_to.  __switch_to sets
 _TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
 off interrupts, and do sysret.  We are now screwed.

Thanks for enlightening!  That looks like a feasible scenario.
(I tested only a 64bit KVM guest, BTW.)

 I don't know why this manifests in this particular failure, but any
 number of terrible things could happen now.
 
 FWIW, this will affect things other than KVM.  For example, SIGKILL
 sent while a process is sleeping in that two-instruction window won't
 work.
 
 Takashi, can you re-send your patch so we can review it for real in
 light of this race?

The patch below worked.  I'll double-check tomorrow whether this
really cures reliably.


thanks,

Takashi

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1d74d161687c..5340ac7f88a9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -364,12 +364,12 @@ system_call_fastpath:
  * Has incomplete stack frame and undefined top of stack.
  */
 ret_from_sys_call:
-   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
-
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
+   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
+
CFI_REMEMBER_STATE
/*
 * sysretq will re-enable interrupts:
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-20 Thread Takashi Iwai
At Fri, 20 Mar 2015 19:16:53 +0100,
Denys Vlasenko wrote:
> 
> Takashi, are you willing to reproduce the panic one more time,
> with this patch? I would like to see whether oops messages
> are more informative with it.

Sure, I'll do it, but you'll have to wait until the next Monday as the
bug is triggered only on a machine in my office.  I checked my local
laptop, but it doesn't show the problem.

Maybe someone else can test it beforehand...


thanks,

Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-20 Thread Denys Vlasenko
Hi,

This particular crash was hard to diagnose because of two reasons:

* CPU would happily use userspace RSP in kernel mode.
  Crash comes only later, when we run off the stack.
  We lose information when it started.

* Kernel's error handling code is ill prepared for RSP pointing
  to user stack. So we take another page fault trying
  to dump stack.

I prepared a patch which helps with both problems.

For testing, I inserted an invalid instruction right before SYSRET
to induce a similar bug, and booted resulting kernel in qemu.

Before my patch, double fault output starts like this:

[0.715216] PANIC: double fault, error_code: 0x0
[0.716033] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #7
[0.716033] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[0.716033] task: 880007588000 ti: 88000759 task.ti: 
88000759
[0.716033] RIP: 0010:[]  [] 
do_error_trap+0x47/0x120
[0.716033] RSP: 0018:7ffd89e7ffb8  EFLAGS: 00010006

The key here is that it doesn't show at which RIP we took the first
"bad" exception. The only useful detail visible here is bad RSP.
"do_error_trap+0x47" is useless.

After the patch, the very moment of "bad" exception is caught:

[0.666758] Exception on user stack 7ffc1fd0c388: RSP: 
0018:7ffc1fd0c3b0  EFLAGS: 00010006
[0.667285] RIP: 0010:[]  [] 
ret_from_sys_call+0x5f/0x67
[0.667285] PANIC: double fault, error_code: 0x
[0.667285] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #13
[0.667285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[0.667285] task: 880007588000 ti: 88000759 task.ti: 
88000759
[0.667285] RIP: 0010:[]  [] 
ret_from_sys_call+0x5f/0x67
[0.667285] RSP: 0018:7ffc1fd0c3b0  EFLAGS: 00010006

The exception happened at "ret_from_sys_call+0x5f".
We also won't take another page fault any more,
output proceeds like this:

...
[0.667285] RAX: 07a0 RBX: 7ffc1fd0c4e0 RCX: c101
[0.667285] RDX: 8800 RSI: 5401 RDI: 7ffc1fd0c388
[0.667285] RBP: 7ffc1fd0c570 R08: 0010 R09: 
[0.667285] R10: 7ffc1fd0c650 R11: 0202 R12: 0120
[0.667285] R13: 005f7b78 R14:  R15: 004c9d44
[0.667285] FS:  () GS:880007a0() 
knlGS:
[0.667285] CS:  0010 DS:  ES:  CR0: 8005003b
[0.667285] CR2: 004ad1e4 CR3: 00101000 CR4: 07f0
[0.667285] Stack:
[0.667285]  0018 7ffc1fd0c490 7ffc1fd0c3d0 

[0.667285]    7ffc1fd0c490 

[0.667285]     

[0.667285] Call Trace:
[0.667285]  
[0.667285] Code: 8b 44 24 50 48 8b 54 24 60 48 8b 74 24 68 48 8b 7c 24 70 
48 8b 8c 24 80 00 00 00 4c 8b 9c 24 90 00 00 00 48 8b a4 24 98 00 00 00 <0f> 0b 
0f 01 f8 48 0f 07 48 c7 84 24 a0 00 00 00 2b 00 00 00 48
[0.667285] Kernel panic - not syncing: Machine halted.
[0.667285] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #13
[0.667285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[0.667285]   880007593e28 81789625 
880007588000
[0.667285]  81a3b181 880007593ea8 817840aa 
88000759
[0.667285]  0008 880007593eb8 880007593e58 
0001
[0.667285] Call Trace:
[0.667285]  [] dump_stack+0x4c/0x65
[0.667285]  [] panic+0xc6/0x1ff
[0.667285]  [] df_debug+0x35/0x40
[0.667285]  [] do_double_fault+0x87/0x100
[0.667285]  [] do_userpsace_rsp_in_kernel+0x107/0x140
[0.667285]  [] ? ret_from_sys_call+0x5f/0x67
[0.667285]  [] userpsace_rsp_in_kernel+0x39/0x40
[0.667285]  [] ? ret_from_sys_call+0x5f/0x67
[0.667285] Kernel Offset: disabled
[0.667285] Rebooting in 1 seconds..

Takashi, are you willing to reproduce the panic one more time,
with this patch? I would like to see whether oops messages
are more informative with it.



diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 4e49d7d..92a35e6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -70,6 +70,7 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, 
long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
+dotraplinkage void do_userpsace_rsp_in_kernel(struct pt_regs *regs);
 asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0c91256..fb85c26 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -958,6 +958,12 @@ ENTRY(\sym)
INTR_FRAME
 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-20 Thread Takashi Iwai
At Fri, 20 Mar 2015 19:16:53 +0100,
Denys Vlasenko wrote:
 
 Takashi, are you willing to reproduce the panic one more time,
 with this patch? I would like to see whether oops messages
 are more informative with it.

Sure, I'll do it, but you'll have to wait until the next Monday as the
bug is triggered only on a machine in my office.  I checked my local
laptop, but it doesn't show the problem.

Maybe someone else can test it beforehand...


thanks,

Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-20 Thread Denys Vlasenko
Hi,

This particular crash was hard to diagnose because of two reasons:

* CPU would happily use userspace RSP in kernel mode.
  Crash comes only later, when we run off the stack.
  We lose information when it started.

* Kernel's error handling code is ill prepared for RSP pointing
  to user stack. So we take another page fault trying
  to dump stack.

I prepared a patch which helps with both problems.

For testing, I inserted an invalid instruction right before SYSRET
to induce a similar bug, and booted resulting kernel in qemu.

Before my patch, double fault output starts like this:

[0.715216] PANIC: double fault, error_code: 0x0
[0.716033] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #7
[0.716033] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[0.716033] task: 880007588000 ti: 88000759 task.ti: 
88000759
[0.716033] RIP: 0010:[81017057]  [81017057] 
do_error_trap+0x47/0x120
[0.716033] RSP: 0018:7ffd89e7ffb8  EFLAGS: 00010006

The key here is that it doesn't show at which RIP we took the first
bad exception. The only useful detail visible here is bad RSP.
do_error_trap+0x47 is useless.

After the patch, the very moment of bad exception is caught:

[0.666758] Exception on user stack 7ffc1fd0c388: RSP: 
0018:7ffc1fd0c3b0  EFLAGS: 00010006
[0.667285] RIP: 0010:[81793688]  [81793688] 
ret_from_sys_call+0x5f/0x67
[0.667285] PANIC: double fault, error_code: 0x
[0.667285] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #13
[0.667285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[0.667285] task: 880007588000 ti: 88000759 task.ti: 
88000759
[0.667285] RIP: 0010:[81793688]  [81793688] 
ret_from_sys_call+0x5f/0x67
[0.667285] RSP: 0018:7ffc1fd0c3b0  EFLAGS: 00010006

The exception happened at ret_from_sys_call+0x5f.
We also won't take another page fault any more,
output proceeds like this:

...
[0.667285] RAX: 07a0 RBX: 7ffc1fd0c4e0 RCX: c101
[0.667285] RDX: 8800 RSI: 5401 RDI: 7ffc1fd0c388
[0.667285] RBP: 7ffc1fd0c570 R08: 0010 R09: 
[0.667285] R10: 7ffc1fd0c650 R11: 0202 R12: 0120
[0.667285] R13: 005f7b78 R14:  R15: 004c9d44
[0.667285] FS:  () GS:880007a0() 
knlGS:
[0.667285] CS:  0010 DS:  ES:  CR0: 8005003b
[0.667285] CR2: 004ad1e4 CR3: 00101000 CR4: 07f0
[0.667285] Stack:
[0.667285]  0018 7ffc1fd0c490 7ffc1fd0c3d0 

[0.667285]    7ffc1fd0c490 

[0.667285]     

[0.667285] Call Trace:
[0.667285]  UNK
[0.667285] Code: 8b 44 24 50 48 8b 54 24 60 48 8b 74 24 68 48 8b 7c 24 70 
48 8b 8c 24 80 00 00 00 4c 8b 9c 24 90 00 00 00 48 8b a4 24 98 00 00 00 0f 0b 
0f 01 f8 48 0f 07 48 c7 84 24 a0 00 00 00 2b 00 00 00 48
[0.667285] Kernel panic - not syncing: Machine halted.
[0.667285] CPU: 0 PID: 1 Comm: init Not tainted 4.0.0-rc2+ #13
[0.667285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[0.667285]   880007593e28 81789625 
880007588000
[0.667285]  81a3b181 880007593ea8 817840aa 
88000759
[0.667285]  0008 880007593eb8 880007593e58 
0001
[0.667285] Call Trace:
[0.667285]  [81789625] dump_stack+0x4c/0x65
[0.667285]  [817840aa] panic+0xc6/0x1ff
[0.667285]  [81059ee5] df_debug+0x35/0x40
[0.667285]  [81017e37] do_double_fault+0x87/0x100
[0.667285]  [81017fb7] do_userpsace_rsp_in_kernel+0x107/0x140
[0.667285]  [81793688] ? ret_from_sys_call+0x5f/0x67
[0.667285]  [81795b49] userpsace_rsp_in_kernel+0x39/0x40
[0.667285]  [81793688] ? ret_from_sys_call+0x5f/0x67
[0.667285] Kernel Offset: disabled
[0.667285] Rebooting in 1 seconds..

Takashi, are you willing to reproduce the panic one more time,
with this patch? I would like to see whether oops messages
are more informative with it.



diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 4e49d7d..92a35e6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -70,6 +70,7 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, 
long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
+dotraplinkage void do_userpsace_rsp_in_kernel(struct pt_regs *regs);
 asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2015 at 8:51 AM, Takashi Iwai  wrote:
> At Thu, 19 Mar 2015 08:41:57 -0700,
> Andy Lutomirski wrote:
>>
>> On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai  wrote:
>> > At Thu, 19 Mar 2015 15:55:26 +0100,
>> > Takashi Iwai wrote:
>> >>
>> >> At Thu, 19 Mar 2015 14:47:12 +0100,
>> >> Takashi Iwai wrote:
>> >> >
>> >> > At Thu, 19 Mar 2015 13:48:56 +0100,
>> >> > Denys Vlasenko wrote:
>> >> > >
>> >> > > Having no more ideas at the moment, here is a tarball of 13 patches
>> >> > > of commits touching entry_64.S up to 4.0.0-rc1.
>> >> > >
>> >> > > x0001.patch is the latest, x0015.patch is the oldest.
>> >> > >
>> >> > > Patches 0003 and 0008 are not there since 0003 is empty merge patch
>> >> > > and 0008 does some PCI fixup.
>> >> > >
>> >> > > If this breakage is recent, it ought to be one of these.
>> >> > > Most of them do some non-trivial surgery.
>> >> > >
>> >> > > Even though I did not spot anything suspicious in them,
>> >> > > entry.S is notorious for subtle breakage.
>> >> > >
>> >> > > Try reverting them in sequence starting from x0001.patch
>> >> > > and see reverting which one makes crash disappear.
>> >> >
>> >> > OK, I'm going to check these git series.
>> >>
>> >> Reverting the commit
>> >> 96b6352c12711d5c0bb7157f49c92580248e8146
>> >> x86_64, entry: Remove the syscall exit audit and schedule 
>> >> optimizations
>> >>
>> >> seems enough.  After reverting this one, the machine runs stable with
>> >> the kvm stress test.
>> >>
>> >> (I'll keep test running for a while; at the previous bisection, I hit
>> >>  the bug right after posting the mail ;)
>> >
>> > It survived long enough, so this looks like the spot.
>> >
>> > Also, I checked the patch below instead of reverting the commit, and
>> > this seems working, too.
>> >
>> >
>> > Takashi
>> >
>> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> > index 1d74d161687c..5340ac7f88a9 100644
>> > --- a/arch/x86/kernel/entry_64.S
>> > +++ b/arch/x86/kernel/entry_64.S
>> > @@ -364,12 +364,12 @@ system_call_fastpath:
>> >   * Has incomplete stack frame and undefined top of stack.
>> >   */
>> >  ret_from_sys_call:
>> > -   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>> > -   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>> > -
>> > LOCKDEP_SYS_EXIT
>> > DISABLE_INTERRUPTS(CLBR_NONE)
>> > TRACE_IRQS_OFF
>> > +   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>> > +   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>> > +
>> > CFI_REMEMBER_STATE
>> > /*
>> >  * sysretq will re-enable interrupts:
>>
>> The crash you're seeing could certainly be caused by an IRQ at the
>> wrong time.  However:
>>
>> int_ret_from_sys_call_fixup:
>> FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>> jmp int_ret_from_sys_call
>>
>> and
>>
>> GLOBAL(int_ret_from_sys_call)
>> DISABLE_INTERRUPTS(CLBR_NONE)
>> TRACE_IRQS_OFF
>>
>> so with or without your little patch, we're turning off IRQs very
>> quickly.  retint_swapgs also turnes off interrupts before doing
>> anything.  So I don't see how your patch would have any effect.
>
> What about LOCKDEP_SYS_EXIT?
>

There's a LOCKDEP_SYS_EXIT_IRQ a few lines down in
int_ret_from_sys_call, and the syscall slow path falls through
directly to int_ret_from_sys_call.

I'm going to try to write a diagnostic patch now.  I have four
separate contractors coming starting half an hour ago*, so it might
take a while.

* Yeah, right.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 08:41:57 -0700,
Andy Lutomirski wrote:
> 
> On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai  wrote:
> > At Thu, 19 Mar 2015 15:55:26 +0100,
> > Takashi Iwai wrote:
> >>
> >> At Thu, 19 Mar 2015 14:47:12 +0100,
> >> Takashi Iwai wrote:
> >> >
> >> > At Thu, 19 Mar 2015 13:48:56 +0100,
> >> > Denys Vlasenko wrote:
> >> > >
> >> > > Having no more ideas at the moment, here is a tarball of 13 patches
> >> > > of commits touching entry_64.S up to 4.0.0-rc1.
> >> > >
> >> > > x0001.patch is the latest, x0015.patch is the oldest.
> >> > >
> >> > > Patches 0003 and 0008 are not there since 0003 is empty merge patch
> >> > > and 0008 does some PCI fixup.
> >> > >
> >> > > If this breakage is recent, it ought to be one of these.
> >> > > Most of them do some non-trivial surgery.
> >> > >
> >> > > Even though I did not spot anything suspicious in them,
> >> > > entry.S is notorious for subtle breakage.
> >> > >
> >> > > Try reverting them in sequence starting from x0001.patch
> >> > > and see reverting which one makes crash disappear.
> >> >
> >> > OK, I'm going to check these git series.
> >>
> >> Reverting the commit
> >> 96b6352c12711d5c0bb7157f49c92580248e8146
> >> x86_64, entry: Remove the syscall exit audit and schedule optimizations
> >>
> >> seems enough.  After reverting this one, the machine runs stable with
> >> the kvm stress test.
> >>
> >> (I'll keep test running for a while; at the previous bisection, I hit
> >>  the bug right after posting the mail ;)
> >
> > It survived long enough, so this looks like the spot.
> >
> > Also, I checked the patch below instead of reverting the commit, and
> > this seems working, too.
> >
> >
> > Takashi
> >
> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> > index 1d74d161687c..5340ac7f88a9 100644
> > --- a/arch/x86/kernel/entry_64.S
> > +++ b/arch/x86/kernel/entry_64.S
> > @@ -364,12 +364,12 @@ system_call_fastpath:
> >   * Has incomplete stack frame and undefined top of stack.
> >   */
> >  ret_from_sys_call:
> > -   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> > -   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> > -
> > LOCKDEP_SYS_EXIT
> > DISABLE_INTERRUPTS(CLBR_NONE)
> > TRACE_IRQS_OFF
> > +   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> > +   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> > +
> > CFI_REMEMBER_STATE
> > /*
> >  * sysretq will re-enable interrupts:
> 
> The crash you're seeing could certainly be caused by an IRQ at the
> wrong time.  However:
> 
> int_ret_from_sys_call_fixup:
> FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
> jmp int_ret_from_sys_call
> 
> and
> 
> GLOBAL(int_ret_from_sys_call)
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
> 
> so with or without your little patch, we're turning off IRQs very
> quickly.  retint_swapgs also turnes off interrupts before doing
> anything.  So I don't see how your patch would have any effect.

What about LOCKDEP_SYS_EXIT?


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai  wrote:
> At Thu, 19 Mar 2015 15:55:26 +0100,
> Takashi Iwai wrote:
>>
>> At Thu, 19 Mar 2015 14:47:12 +0100,
>> Takashi Iwai wrote:
>> >
>> > At Thu, 19 Mar 2015 13:48:56 +0100,
>> > Denys Vlasenko wrote:
>> > >
>> > > Having no more ideas at the moment, here is a tarball of 13 patches
>> > > of commits touching entry_64.S up to 4.0.0-rc1.
>> > >
>> > > x0001.patch is the latest, x0015.patch is the oldest.
>> > >
>> > > Patches 0003 and 0008 are not there since 0003 is empty merge patch
>> > > and 0008 does some PCI fixup.
>> > >
>> > > If this breakage is recent, it ought to be one of these.
>> > > Most of them do some non-trivial surgery.
>> > >
>> > > Even though I did not spot anything suspicious in them,
>> > > entry.S is notorious for subtle breakage.
>> > >
>> > > Try reverting them in sequence starting from x0001.patch
>> > > and see reverting which one makes crash disappear.
>> >
>> > OK, I'm going to check these git series.
>>
>> Reverting the commit
>> 96b6352c12711d5c0bb7157f49c92580248e8146
>> x86_64, entry: Remove the syscall exit audit and schedule optimizations
>>
>> seems enough.  After reverting this one, the machine runs stable with
>> the kvm stress test.
>>
>> (I'll keep test running for a while; at the previous bisection, I hit
>>  the bug right after posting the mail ;)
>
> It survived long enough, so this looks like the spot.
>
> Also, I checked the patch below instead of reverting the commit, and
> this seems working, too.
>
>
> Takashi
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1d74d161687c..5340ac7f88a9 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -364,12 +364,12 @@ system_call_fastpath:
>   * Has incomplete stack frame and undefined top of stack.
>   */
>  ret_from_sys_call:
> -   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> -   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> -
> LOCKDEP_SYS_EXIT
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
> +   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> +   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> +
> CFI_REMEMBER_STATE
> /*
>  * sysretq will re-enable interrupts:

The crash you're seeing could certainly be caused by an IRQ at the
wrong time.  However:

int_ret_from_sys_call_fixup:
FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
jmp int_ret_from_sys_call

and

GLOBAL(int_ret_from_sys_call)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF

so with or without your little patch, we're turning off IRQs very
quickly.  retint_swapgs also turnes off interrupts before doing
anything.  So I don't see how your patch would have any effect.

I'm starting to wonder if the problem has something to do with running
fire_user_return_notifiers with IRQs on.  We appear to do that, and it
seems rather questionable to me that it's safe, given the sneaky
things that KVM does in there.

If we end up in user mode with a bad MSR_SYSCALL_MASK, we could see
your crash, although I don't see how that would happen either.

I'll try to write a diagnostic patch later this morning.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 15:55:26 +0100,
Takashi Iwai wrote:
> 
> At Thu, 19 Mar 2015 14:47:12 +0100,
> Takashi Iwai wrote:
> > 
> > At Thu, 19 Mar 2015 13:48:56 +0100,
> > Denys Vlasenko wrote:
> > > 
> > > Having no more ideas at the moment, here is a tarball of 13 patches
> > > of commits touching entry_64.S up to 4.0.0-rc1.
> > > 
> > > x0001.patch is the latest, x0015.patch is the oldest.
> > > 
> > > Patches 0003 and 0008 are not there since 0003 is empty merge patch
> > > and 0008 does some PCI fixup.
> > > 
> > > If this breakage is recent, it ought to be one of these.
> > > Most of them do some non-trivial surgery.
> > > 
> > > Even though I did not spot anything suspicious in them,
> > > entry.S is notorious for subtle breakage.
> > > 
> > > Try reverting them in sequence starting from x0001.patch
> > > and see reverting which one makes crash disappear.
> > 
> > OK, I'm going to check these git series.
> 
> Reverting the commit
> 96b6352c12711d5c0bb7157f49c92580248e8146
> x86_64, entry: Remove the syscall exit audit and schedule optimizations
> 
> seems enough.  After reverting this one, the machine runs stable with
> the kvm stress test.
> 
> (I'll keep test running for a while; at the previous bisection, I hit
>  the bug right after posting the mail ;)

It survived long enough, so this looks like the spot.

Also, I checked the patch below instead of reverting the commit, and
this seems working, too.


Takashi

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1d74d161687c..5340ac7f88a9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -364,12 +364,12 @@ system_call_fastpath:
  * Has incomplete stack frame and undefined top of stack.
  */
 ret_from_sys_call:
-   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
-
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
+   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
+
CFI_REMEMBER_STATE
/*
 * sysretq will re-enable interrupts:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 14:47:12 +0100,
Takashi Iwai wrote:
> 
> At Thu, 19 Mar 2015 13:48:56 +0100,
> Denys Vlasenko wrote:
> > 
> > Having no more ideas at the moment, here is a tarball of 13 patches
> > of commits touching entry_64.S up to 4.0.0-rc1.
> > 
> > x0001.patch is the latest, x0015.patch is the oldest.
> > 
> > Patches 0003 and 0008 are not there since 0003 is empty merge patch
> > and 0008 does some PCI fixup.
> > 
> > If this breakage is recent, it ought to be one of these.
> > Most of them do some non-trivial surgery.
> > 
> > Even though I did not spot anything suspicious in them,
> > entry.S is notorious for subtle breakage.
> > 
> > Try reverting them in sequence starting from x0001.patch
> > and see reverting which one makes crash disappear.
> 
> OK, I'm going to check these git series.

Reverting the commit
96b6352c12711d5c0bb7157f49c92580248e8146
x86_64, entry: Remove the syscall exit audit and schedule optimizations

seems enough.  After reverting this one, the machine runs stable with
the kvm stress test.

(I'll keep test running for a while; at the previous bisection, I hit
 the bug right after posting the mail ;)

BTW, I also tried to reproduce this on another machine (a Haswell
laptop), but I failed, even with the very same kernel.  So the bug
really seems depending on CPU.


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 13:48:56 +0100,
Denys Vlasenko wrote:
> 
> Having no more ideas at the moment, here is a tarball of 13 patches
> of commits touching entry_64.S up to 4.0.0-rc1.
> 
> x0001.patch is the latest, x0015.patch is the oldest.
> 
> Patches 0003 and 0008 are not there since 0003 is empty merge patch
> and 0008 does some PCI fixup.
> 
> If this breakage is recent, it ought to be one of these.
> Most of them do some non-trivial surgery.
> 
> Even though I did not spot anything suspicious in them,
> entry.S is notorious for subtle breakage.
> 
> Try reverting them in sequence starting from x0001.patch
> and see reverting which one makes crash disappear.

OK, I'm going to check these git series.


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Denys Vlasenko
On 03/18/2015 10:55 PM, Andy Lutomirski wrote:
> On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko  wrote:
>>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
>>> kernel with a user stack poiinter, maybe we're *exiting* the kernel,
>>> and have just reloaded the user stack pointer when "USERGS_SYSRET64"
>>> takes some fault.
>>
>> Yes, so far we happily thought that SYSRET never fails...
>>
>> This merits adding some code which would at least BUG_ON
>> if the faulting address is seen to match SYSRET64.
> 
> sysret64 can only fail with #GP, and we're totally screwed if that
> happens, although I agree about the BUG_ON in principle.  Where would
> we add it that would help in this case, though?  We never even made it
> to C code.

I propose to widen such check to catch any cases where
we enter an exception from CPL0 and find that our RSP
is bad. This will cover the case of faulting SYSRET and possible
future obscure bugs.

What this patch does is it stops CPU dead if we find itself
with userspace RSP (not saved RSP, but _actual_ %RSP register)
in an exception handler prologue:

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index a0a3a6e..53a34ba 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -930,6 +930,12 @@ ENTRY(\sym)
INTR_FRAME
.endif

+   testq %rsp,%rsp
+   /* If RSP is positive, we are in kernel but have userspace RSP. */
+   /* We corrupted user stack already by storing iret frame there. */
+   /* This is supposed to be impossible. */
+0: jns 0b
+
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME


Hopefully then NMI watchdog will kill it, and we'll get better data.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Denys Vlasenko
Having no more ideas at the moment, here is a tarball of 13 patches
of commits touching entry_64.S up to 4.0.0-rc1.

x0001.patch is the latest, x0015.patch is the oldest.

Patches 0003 and 0008 are not there since 0003 is empty merge patch
and 0008 does some PCI fixup.

If this breakage is recent, it ought to be one of these.
Most of them do some non-trivial surgery.

Even though I did not spot anything suspicious in them,
entry.S is notorious for subtle breakage.

Try reverting them in sequence starting from x0001.patch
and see reverting which one makes crash disappear.


revert_me_13.tar.gz
Description: GNU Zip compressed data


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 11:58:19 +0100,
Denys Vlasenko wrote:
> 
> On 03/19/2015 11:16 AM, Takashi Iwai wrote:
> > The kconfig is attached
> 
> You also have PARAVIRT enabled, like Stefan.
> 
> Just to obtain an additional data point, can you guys
> try reproducing it with PARAVIRT off?
> 
> It won't help us that much if it won't trigger with PARAVIRT off
> (the bug may just become much harder to trigger), but if it would
> still happen, that'd reduce the number of things we can suspect.

I tried w/o PARAVIRT and the bug is still seen.  The dmesg is attached
below.


Takashi

[0.00] CPU0 microcode updated early to revision 0x1b, date = 2014-05-29
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.0.0-rc4-testz+ (tiwai@alsa1) (gcc version 4.8.3 
20141208 [gcc-4_8-branch revision 218481] (SUSE Linux) ) #126 SMP PREEMPT Thu 
Mar 19 12:07:56 CET 2015
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.0.0-rc4-testz+ 
root=UUID=1190c997-9457-4dde-8a57-0cce0aae93c6 
resume=/dev/disk/by-id/ata-INTEL_SSDSA2M080G2GN_CVPO9412011S080BGN-part1 
splash=silent quiet showopts crashkernel=512M-:256M
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1fff] usable
[0.00] BIOS-e820: [mem 0x2000-0x201f] reserved
[0.00] BIOS-e820: [mem 0x2020-0x40003fff] usable
[0.00] BIOS-e820: [mem 0x40004000-0x40004fff] reserved
[0.00] BIOS-e820: [mem 0x40005000-0xd6709fff] usable
[0.00] BIOS-e820: [mem 0xd670a000-0xd67f] reserved
[0.00] BIOS-e820: [mem 0xd680-0xd6f55fff] usable
[0.00] BIOS-e820: [mem 0xd6f56000-0xd6ff] reserved
[0.00] BIOS-e820: [mem 0xd700-0xd77b3fff] usable
[0.00] BIOS-e820: [mem 0xd77b4000-0xd77f] ACPI data
[0.00] BIOS-e820: [mem 0xd780-0xd8f1dfff] usable
[0.00] BIOS-e820: [mem 0xd8f1e000-0xd8ff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd900-0xda6e2fff] usable
[0.00] BIOS-e820: [mem 0xda6e3000-0xda8e1fff] reserved
[0.00] BIOS-e820: [mem 0xda8e2000-0xda924fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xda925000-0xdaff] usable
[0.00] BIOS-e820: [mem 0xdb80-0xdf9f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00021e5f] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] AGP: No AGP bridge found
[0.00] e820: last_pfn = 0x21e600 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-D3FFF write-protect
[0.00]   D4000-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 0 mask E write-back
[0.00]   1 base 2 mask FE000 write-back
[0.00]   2 base 0E000 mask FE000 uncachable
[0.00]   3 base 0DC00 mask FFC00 uncachable
[0.00]   4 base 0DB80 mask FFF80 uncachable
[0.00]   5 base 21F00 mask FFF00 uncachable
[0.00]   6 base 21E80 mask FFF80 uncachable
[0.00]   7 base 21E60 mask FFFE0 uncachable
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] e820: update [mem 0xdb80-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xdb000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fda40-0x000fda4f] mapped at 
[880fda40]
[   

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Denys Vlasenko
On 03/19/2015 11:16 AM, Takashi Iwai wrote:
> The kconfig is attached

You also have PARAVIRT enabled, like Stefan.

Just to obtain an additional data point, can you guys
try reproducing it with PARAVIRT off?

It won't help us that much if it won't trigger with PARAVIRT off
(the bug may just become much harder to trigger), but if it would
still happen, that'd reduce the number of things we can suspect.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
Hi,

sorry to take time to back to this topic.

At Wed, 18 Mar 2015 15:29:14 -0700,
Andy Lutomirski wrote:
> 
> On Wed, Mar 18, 2015 at 3:22 PM, Jiri Kosina  wrote:
> > On Wed, 18 Mar 2015, Andy Lutomirski wrote:
> >
> >> sysret64 can only fail with #GP, and we're totally screwed if that
> >> happens,
> >
> > But what if the GPF handler pagefaults afterwards? It'd be operating on
> > user stack already.
> 
> Good point.
> 
> Stefan, can you try changing the first "jne
> opportunistic_sysret_failed" to "jmp opportunistic_sysret_failed" in
> entry_64.S and seeing if you can reproduce this?  (Is it easy enough
> to reproduce that this would tell us anything?)

I tried this, and the same crash still happens.

On my machine (a Dell desktop with IvyBridge 4-core, 8GB RAM), I could
reproduce it relatively easily.  Start a desktop session as usual, and
start a KVM with 1GB memory 4 CPU, and start compiling a kernel on VM
with make -j4.  Meanwhile, start compiling a kernel with make -j8 on
the host, too.  So nothing too special there.  The kconfig is attached
below.

Currently I haven't set up kdump for this machine due to the disk
space.  Will try to adjust somehow from now on.


Takashi



.config
Description: Binary data


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Stefan Seyfried
Good Morning :-)

Am 19.03.2015 um 01:57 schrieb Andy Lutomirski:

> Stefan, do you happen to know whether your disassembly of page_fault
> came from the instructions in memory or if they came from the vmlinux
> file?  Not that I have any relevant ideas there.

I think they came from memory. At least, the disassemble in crash...
crash> disassemble page_fault
Dump of assembler code for function page_fault:
   0x816834a0 <+0>: data32 xchg %ax,%ax
   0x816834a3 <+3>: data32 xchg %ax,%ax
   0x816834a6 <+6>: data32 xchg %ax,%ax
   0x816834a9 <+9>: sub$0x78,%rsp
   0x816834ad <+13>:callq  0x81683620 
   0x816834b2 <+18>:mov%rsp,%rdi
   0x816834b5 <+21>:mov0x78(%rsp),%rsi
   0x816834ba <+26>:movq   $0x,0x78(%rsp)
   0x816834c3 <+35>:callq  0x810504e0 
   0x816834c8 <+40>:jmpq   0x816836d0 
End of assembler dump.

...is different than the one from loading vmlinux in gdb:

Reading symbols from vmlinux-4.0.0-rc3-2.gd5c547f-desktop...done.
Reading symbols from 
/usr/lib/debug/boot/vmlinux-4.0.0-rc3-2.gd5c547f-desktop.debug...done.
(gdb) disassemble page_fault
Dump of assembler code for function page_fault:
   0x816834a0 <+0>: data16 xchg %ax,%ax
   0x816834a3 <+3>: callq  *0x7a5b07(%rip)# 
0x81e28fb0 
   0x816834a9 <+9>: sub$0x78,%rsp
   0x816834ad <+13>:callq  0x81683620 
   0x816834b2 <+18>:mov%rsp,%rdi
   0x816834b5 <+21>:mov0x78(%rsp),%rsi
   0x816834ba <+26>:movq   $0x,0x78(%rsp)
   0x816834c3 <+35>:callq  0x810504e0 
   0x816834c8 <+40>:jmpq   0x816836d0 
End of assembler dump.

Best regards,

Stefan
-- 
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Stefan Seyfried
Good Morning :-)

Am 19.03.2015 um 01:57 schrieb Andy Lutomirski:

 Stefan, do you happen to know whether your disassembly of page_fault
 came from the instructions in memory or if they came from the vmlinux
 file?  Not that I have any relevant ideas there.

I think they came from memory. At least, the disassemble in crash...
crash disassemble page_fault
Dump of assembler code for function page_fault:
   0x816834a0 +0: data32 xchg %ax,%ax
   0x816834a3 +3: data32 xchg %ax,%ax
   0x816834a6 +6: data32 xchg %ax,%ax
   0x816834a9 +9: sub$0x78,%rsp
   0x816834ad +13:callq  0x81683620 error_entry
   0x816834b2 +18:mov%rsp,%rdi
   0x816834b5 +21:mov0x78(%rsp),%rsi
   0x816834ba +26:movq   $0x,0x78(%rsp)
   0x816834c3 +35:callq  0x810504e0 do_page_fault
   0x816834c8 +40:jmpq   0x816836d0 error_exit
End of assembler dump.

...is different than the one from loading vmlinux in gdb:

Reading symbols from vmlinux-4.0.0-rc3-2.gd5c547f-desktop...done.
Reading symbols from 
/usr/lib/debug/boot/vmlinux-4.0.0-rc3-2.gd5c547f-desktop.debug...done.
(gdb) disassemble page_fault
Dump of assembler code for function page_fault:
   0x816834a0 +0: data16 xchg %ax,%ax
   0x816834a3 +3: callq  *0x7a5b07(%rip)# 
0x81e28fb0 pv_irq_ops+48
   0x816834a9 +9: sub$0x78,%rsp
   0x816834ad +13:callq  0x81683620 error_entry
   0x816834b2 +18:mov%rsp,%rdi
   0x816834b5 +21:mov0x78(%rsp),%rsi
   0x816834ba +26:movq   $0x,0x78(%rsp)
   0x816834c3 +35:callq  0x810504e0 do_page_fault
   0x816834c8 +40:jmpq   0x816836d0 error_exit
End of assembler dump.

Best regards,

Stefan
-- 
Stefan Seyfried
Linux Consultant  Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
Hi,

sorry to take time to back to this topic.

At Wed, 18 Mar 2015 15:29:14 -0700,
Andy Lutomirski wrote:
 
 On Wed, Mar 18, 2015 at 3:22 PM, Jiri Kosina jkos...@suse.cz wrote:
  On Wed, 18 Mar 2015, Andy Lutomirski wrote:
 
  sysret64 can only fail with #GP, and we're totally screwed if that
  happens,
 
  But what if the GPF handler pagefaults afterwards? It'd be operating on
  user stack already.
 
 Good point.
 
 Stefan, can you try changing the first jne
 opportunistic_sysret_failed to jmp opportunistic_sysret_failed in
 entry_64.S and seeing if you can reproduce this?  (Is it easy enough
 to reproduce that this would tell us anything?)

I tried this, and the same crash still happens.

On my machine (a Dell desktop with IvyBridge 4-core, 8GB RAM), I could
reproduce it relatively easily.  Start a desktop session as usual, and
start a KVM with 1GB memory 4 CPU, and start compiling a kernel on VM
with make -j4.  Meanwhile, start compiling a kernel with make -j8 on
the host, too.  So nothing too special there.  The kconfig is attached
below.

Currently I haven't set up kdump for this machine due to the disk
space.  Will try to adjust somehow from now on.


Takashi



.config
Description: Binary data


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Denys Vlasenko
Having no more ideas at the moment, here is a tarball of 13 patches
of commits touching entry_64.S up to 4.0.0-rc1.

x0001.patch is the latest, x0015.patch is the oldest.

Patches 0003 and 0008 are not there since 0003 is empty merge patch
and 0008 does some PCI fixup.

If this breakage is recent, it ought to be one of these.
Most of them do some non-trivial surgery.

Even though I did not spot anything suspicious in them,
entry.S is notorious for subtle breakage.

Try reverting them in sequence starting from x0001.patch
and see reverting which one makes crash disappear.


revert_me_13.tar.gz
Description: GNU Zip compressed data


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Denys Vlasenko
On 03/19/2015 11:16 AM, Takashi Iwai wrote:
 The kconfig is attached

You also have PARAVIRT enabled, like Stefan.

Just to obtain an additional data point, can you guys
try reproducing it with PARAVIRT off?

It won't help us that much if it won't trigger with PARAVIRT off
(the bug may just become much harder to trigger), but if it would
still happen, that'd reduce the number of things we can suspect.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 11:58:19 +0100,
Denys Vlasenko wrote:
 
 On 03/19/2015 11:16 AM, Takashi Iwai wrote:
  The kconfig is attached
 
 You also have PARAVIRT enabled, like Stefan.
 
 Just to obtain an additional data point, can you guys
 try reproducing it with PARAVIRT off?
 
 It won't help us that much if it won't trigger with PARAVIRT off
 (the bug may just become much harder to trigger), but if it would
 still happen, that'd reduce the number of things we can suspect.

I tried w/o PARAVIRT and the bug is still seen.  The dmesg is attached
below.


Takashi

[0.00] CPU0 microcode updated early to revision 0x1b, date = 2014-05-29
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.0.0-rc4-testz+ (tiwai@alsa1) (gcc version 4.8.3 
20141208 [gcc-4_8-branch revision 218481] (SUSE Linux) ) #126 SMP PREEMPT Thu 
Mar 19 12:07:56 CET 2015
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.0.0-rc4-testz+ 
root=UUID=1190c997-9457-4dde-8a57-0cce0aae93c6 
resume=/dev/disk/by-id/ata-INTEL_SSDSA2M080G2GN_CVPO9412011S080BGN-part1 
splash=silent quiet showopts crashkernel=512M-:256M
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable
[0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1fff] usable
[0.00] BIOS-e820: [mem 0x2000-0x201f] reserved
[0.00] BIOS-e820: [mem 0x2020-0x40003fff] usable
[0.00] BIOS-e820: [mem 0x40004000-0x40004fff] reserved
[0.00] BIOS-e820: [mem 0x40005000-0xd6709fff] usable
[0.00] BIOS-e820: [mem 0xd670a000-0xd67f] reserved
[0.00] BIOS-e820: [mem 0xd680-0xd6f55fff] usable
[0.00] BIOS-e820: [mem 0xd6f56000-0xd6ff] reserved
[0.00] BIOS-e820: [mem 0xd700-0xd77b3fff] usable
[0.00] BIOS-e820: [mem 0xd77b4000-0xd77f] ACPI data
[0.00] BIOS-e820: [mem 0xd780-0xd8f1dfff] usable
[0.00] BIOS-e820: [mem 0xd8f1e000-0xd8ff] ACPI NVS
[0.00] BIOS-e820: [mem 0xd900-0xda6e2fff] usable
[0.00] BIOS-e820: [mem 0xda6e3000-0xda8e1fff] reserved
[0.00] BIOS-e820: [mem 0xda8e2000-0xda924fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xda925000-0xdaff] usable
[0.00] BIOS-e820: [mem 0xdb80-0xdf9f] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed03fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00021e5f] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.7 present.
[0.00] DMI: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[0.00] e820: update [mem 0x-0x0fff] usable == reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] AGP: No AGP bridge found
[0.00] e820: last_pfn = 0x21e600 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-D3FFF write-protect
[0.00]   D4000-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 0 mask E write-back
[0.00]   1 base 2 mask FE000 write-back
[0.00]   2 base 0E000 mask FE000 uncachable
[0.00]   3 base 0DC00 mask FFC00 uncachable
[0.00]   4 base 0DB80 mask FFF80 uncachable
[0.00]   5 base 21F00 mask FFF00 uncachable
[0.00]   6 base 21E80 mask FFF80 uncachable
[0.00]   7 base 21E60 mask FFFE0 uncachable
[0.00]   8 disabled
[0.00]   9 disabled
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] e820: update [mem 0xdb80-0x] usable == reserved
[0.00] e820: last_pfn = 0xdb000 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000fda40-0x000fda4f] mapped at 
[880fda40]
[0.00] 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 13:48:56 +0100,
Denys Vlasenko wrote:
 
 Having no more ideas at the moment, here is a tarball of 13 patches
 of commits touching entry_64.S up to 4.0.0-rc1.
 
 x0001.patch is the latest, x0015.patch is the oldest.
 
 Patches 0003 and 0008 are not there since 0003 is empty merge patch
 and 0008 does some PCI fixup.
 
 If this breakage is recent, it ought to be one of these.
 Most of them do some non-trivial surgery.
 
 Even though I did not spot anything suspicious in them,
 entry.S is notorious for subtle breakage.
 
 Try reverting them in sequence starting from x0001.patch
 and see reverting which one makes crash disappear.

OK, I'm going to check these git series.


Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Denys Vlasenko
On 03/18/2015 10:55 PM, Andy Lutomirski wrote:
 On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko dvlas...@redhat.com wrote:
 in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
 kernel with a user stack poiinter, maybe we're *exiting* the kernel,
 and have just reloaded the user stack pointer when USERGS_SYSRET64
 takes some fault.

 Yes, so far we happily thought that SYSRET never fails...

 This merits adding some code which would at least BUG_ON
 if the faulting address is seen to match SYSRET64.
 
 sysret64 can only fail with #GP, and we're totally screwed if that
 happens, although I agree about the BUG_ON in principle.  Where would
 we add it that would help in this case, though?  We never even made it
 to C code.

I propose to widen such check to catch any cases where
we enter an exception from CPL0 and find that our RSP
is bad. This will cover the case of faulting SYSRET and possible
future obscure bugs.

What this patch does is it stops CPU dead if we find itself
with userspace RSP (not saved RSP, but _actual_ %RSP register)
in an exception handler prologue:

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index a0a3a6e..53a34ba 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -930,6 +930,12 @@ ENTRY(\sym)
INTR_FRAME
.endif

+   testq %rsp,%rsp
+   /* If RSP is positive, we are in kernel but have userspace RSP. */
+   /* We corrupted user stack already by storing iret frame there. */
+   /* This is supposed to be impossible. */
+0: jns 0b
+
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME


Hopefully then NMI watchdog will kill it, and we'll get better data.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 14:47:12 +0100,
Takashi Iwai wrote:
 
 At Thu, 19 Mar 2015 13:48:56 +0100,
 Denys Vlasenko wrote:
  
  Having no more ideas at the moment, here is a tarball of 13 patches
  of commits touching entry_64.S up to 4.0.0-rc1.
  
  x0001.patch is the latest, x0015.patch is the oldest.
  
  Patches 0003 and 0008 are not there since 0003 is empty merge patch
  and 0008 does some PCI fixup.
  
  If this breakage is recent, it ought to be one of these.
  Most of them do some non-trivial surgery.
  
  Even though I did not spot anything suspicious in them,
  entry.S is notorious for subtle breakage.
  
  Try reverting them in sequence starting from x0001.patch
  and see reverting which one makes crash disappear.
 
 OK, I'm going to check these git series.

Reverting the commit
96b6352c12711d5c0bb7157f49c92580248e8146
x86_64, entry: Remove the syscall exit audit and schedule optimizations

seems enough.  After reverting this one, the machine runs stable with
the kvm stress test.

(I'll keep test running for a while; at the previous bisection, I hit
 the bug right after posting the mail ;)

BTW, I also tried to reproduce this on another machine (a Haswell
laptop), but I failed, even with the very same kernel.  So the bug
really seems depending on CPU.


Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2015 at 8:51 AM, Takashi Iwai ti...@suse.de wrote:
 At Thu, 19 Mar 2015 08:41:57 -0700,
 Andy Lutomirski wrote:

 On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai ti...@suse.de wrote:
  At Thu, 19 Mar 2015 15:55:26 +0100,
  Takashi Iwai wrote:
 
  At Thu, 19 Mar 2015 14:47:12 +0100,
  Takashi Iwai wrote:
  
   At Thu, 19 Mar 2015 13:48:56 +0100,
   Denys Vlasenko wrote:
   
Having no more ideas at the moment, here is a tarball of 13 patches
of commits touching entry_64.S up to 4.0.0-rc1.
   
x0001.patch is the latest, x0015.patch is the oldest.
   
Patches 0003 and 0008 are not there since 0003 is empty merge patch
and 0008 does some PCI fixup.
   
If this breakage is recent, it ought to be one of these.
Most of them do some non-trivial surgery.
   
Even though I did not spot anything suspicious in them,
entry.S is notorious for subtle breakage.
   
Try reverting them in sequence starting from x0001.patch
and see reverting which one makes crash disappear.
  
   OK, I'm going to check these git series.
 
  Reverting the commit
  96b6352c12711d5c0bb7157f49c92580248e8146
  x86_64, entry: Remove the syscall exit audit and schedule 
  optimizations
 
  seems enough.  After reverting this one, the machine runs stable with
  the kvm stress test.
 
  (I'll keep test running for a while; at the previous bisection, I hit
   the bug right after posting the mail ;)
 
  It survived long enough, so this looks like the spot.
 
  Also, I checked the patch below instead of reverting the commit, and
  this seems working, too.
 
 
  Takashi
 
  diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
  index 1d74d161687c..5340ac7f88a9 100644
  --- a/arch/x86/kernel/entry_64.S
  +++ b/arch/x86/kernel/entry_64.S
  @@ -364,12 +364,12 @@ system_call_fastpath:
* Has incomplete stack frame and undefined top of stack.
*/
   ret_from_sys_call:
  -   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
  -   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
  -
  LOCKDEP_SYS_EXIT
  DISABLE_INTERRUPTS(CLBR_NONE)
  TRACE_IRQS_OFF
  +   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
  +   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
  +
  CFI_REMEMBER_STATE
  /*
   * sysretq will re-enable interrupts:

 The crash you're seeing could certainly be caused by an IRQ at the
 wrong time.  However:

 int_ret_from_sys_call_fixup:
 FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 jmp int_ret_from_sys_call

 and

 GLOBAL(int_ret_from_sys_call)
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF

 so with or without your little patch, we're turning off IRQs very
 quickly.  retint_swapgs also turnes off interrupts before doing
 anything.  So I don't see how your patch would have any effect.

 What about LOCKDEP_SYS_EXIT?


There's a LOCKDEP_SYS_EXIT_IRQ a few lines down in
int_ret_from_sys_call, and the syscall slow path falls through
directly to int_ret_from_sys_call.

I'm going to try to write a diagnostic patch now.  I have four
separate contractors coming starting half an hour ago*, so it might
take a while.

* Yeah, right.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 15:55:26 +0100,
Takashi Iwai wrote:
 
 At Thu, 19 Mar 2015 14:47:12 +0100,
 Takashi Iwai wrote:
  
  At Thu, 19 Mar 2015 13:48:56 +0100,
  Denys Vlasenko wrote:
   
   Having no more ideas at the moment, here is a tarball of 13 patches
   of commits touching entry_64.S up to 4.0.0-rc1.
   
   x0001.patch is the latest, x0015.patch is the oldest.
   
   Patches 0003 and 0008 are not there since 0003 is empty merge patch
   and 0008 does some PCI fixup.
   
   If this breakage is recent, it ought to be one of these.
   Most of them do some non-trivial surgery.
   
   Even though I did not spot anything suspicious in them,
   entry.S is notorious for subtle breakage.
   
   Try reverting them in sequence starting from x0001.patch
   and see reverting which one makes crash disappear.
  
  OK, I'm going to check these git series.
 
 Reverting the commit
 96b6352c12711d5c0bb7157f49c92580248e8146
 x86_64, entry: Remove the syscall exit audit and schedule optimizations
 
 seems enough.  After reverting this one, the machine runs stable with
 the kvm stress test.
 
 (I'll keep test running for a while; at the previous bisection, I hit
  the bug right after posting the mail ;)

It survived long enough, so this looks like the spot.

Also, I checked the patch below instead of reverting the commit, and
this seems working, too.


Takashi

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1d74d161687c..5340ac7f88a9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -364,12 +364,12 @@ system_call_fastpath:
  * Has incomplete stack frame and undefined top of stack.
  */
 ret_from_sys_call:
-   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
-
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
+   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
+
CFI_REMEMBER_STATE
/*
 * sysretq will re-enable interrupts:
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Takashi Iwai
At Thu, 19 Mar 2015 08:41:57 -0700,
Andy Lutomirski wrote:
 
 On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai ti...@suse.de wrote:
  At Thu, 19 Mar 2015 15:55:26 +0100,
  Takashi Iwai wrote:
 
  At Thu, 19 Mar 2015 14:47:12 +0100,
  Takashi Iwai wrote:
  
   At Thu, 19 Mar 2015 13:48:56 +0100,
   Denys Vlasenko wrote:
   
Having no more ideas at the moment, here is a tarball of 13 patches
of commits touching entry_64.S up to 4.0.0-rc1.
   
x0001.patch is the latest, x0015.patch is the oldest.
   
Patches 0003 and 0008 are not there since 0003 is empty merge patch
and 0008 does some PCI fixup.
   
If this breakage is recent, it ought to be one of these.
Most of them do some non-trivial surgery.
   
Even though I did not spot anything suspicious in them,
entry.S is notorious for subtle breakage.
   
Try reverting them in sequence starting from x0001.patch
and see reverting which one makes crash disappear.
  
   OK, I'm going to check these git series.
 
  Reverting the commit
  96b6352c12711d5c0bb7157f49c92580248e8146
  x86_64, entry: Remove the syscall exit audit and schedule optimizations
 
  seems enough.  After reverting this one, the machine runs stable with
  the kvm stress test.
 
  (I'll keep test running for a while; at the previous bisection, I hit
   the bug right after posting the mail ;)
 
  It survived long enough, so this looks like the spot.
 
  Also, I checked the patch below instead of reverting the commit, and
  this seems working, too.
 
 
  Takashi
 
  diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
  index 1d74d161687c..5340ac7f88a9 100644
  --- a/arch/x86/kernel/entry_64.S
  +++ b/arch/x86/kernel/entry_64.S
  @@ -364,12 +364,12 @@ system_call_fastpath:
* Has incomplete stack frame and undefined top of stack.
*/
   ret_from_sys_call:
  -   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
  -   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
  -
  LOCKDEP_SYS_EXIT
  DISABLE_INTERRUPTS(CLBR_NONE)
  TRACE_IRQS_OFF
  +   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
  +   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
  +
  CFI_REMEMBER_STATE
  /*
   * sysretq will re-enable interrupts:
 
 The crash you're seeing could certainly be caused by an IRQ at the
 wrong time.  However:
 
 int_ret_from_sys_call_fixup:
 FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
 jmp int_ret_from_sys_call
 
 and
 
 GLOBAL(int_ret_from_sys_call)
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF
 
 so with or without your little patch, we're turning off IRQs very
 quickly.  retint_swapgs also turnes off interrupts before doing
 anything.  So I don't see how your patch would have any effect.

What about LOCKDEP_SYS_EXIT?


Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-19 Thread Andy Lutomirski
On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai ti...@suse.de wrote:
 At Thu, 19 Mar 2015 15:55:26 +0100,
 Takashi Iwai wrote:

 At Thu, 19 Mar 2015 14:47:12 +0100,
 Takashi Iwai wrote:
 
  At Thu, 19 Mar 2015 13:48:56 +0100,
  Denys Vlasenko wrote:
  
   Having no more ideas at the moment, here is a tarball of 13 patches
   of commits touching entry_64.S up to 4.0.0-rc1.
  
   x0001.patch is the latest, x0015.patch is the oldest.
  
   Patches 0003 and 0008 are not there since 0003 is empty merge patch
   and 0008 does some PCI fixup.
  
   If this breakage is recent, it ought to be one of these.
   Most of them do some non-trivial surgery.
  
   Even though I did not spot anything suspicious in them,
   entry.S is notorious for subtle breakage.
  
   Try reverting them in sequence starting from x0001.patch
   and see reverting which one makes crash disappear.
 
  OK, I'm going to check these git series.

 Reverting the commit
 96b6352c12711d5c0bb7157f49c92580248e8146
 x86_64, entry: Remove the syscall exit audit and schedule optimizations

 seems enough.  After reverting this one, the machine runs stable with
 the kvm stress test.

 (I'll keep test running for a while; at the previous bisection, I hit
  the bug right after posting the mail ;)

 It survived long enough, so this looks like the spot.

 Also, I checked the patch below instead of reverting the commit, and
 this seems working, too.


 Takashi

 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index 1d74d161687c..5340ac7f88a9 100644
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -364,12 +364,12 @@ system_call_fastpath:
   * Has incomplete stack frame and undefined top of stack.
   */
  ret_from_sys_call:
 -   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 -   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
 -
 LOCKDEP_SYS_EXIT
 DISABLE_INTERRUPTS(CLBR_NONE)
 TRACE_IRQS_OFF
 +   testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 +   jnz int_ret_from_sys_call_fixup /* Go the the slow path */
 +
 CFI_REMEMBER_STATE
 /*
  * sysretq will re-enable interrupts:

The crash you're seeing could certainly be caused by an IRQ at the
wrong time.  However:

int_ret_from_sys_call_fixup:
FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
jmp int_ret_from_sys_call

and

GLOBAL(int_ret_from_sys_call)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF

so with or without your little patch, we're turning off IRQs very
quickly.  retint_swapgs also turnes off interrupts before doing
anything.  So I don't see how your patch would have any effect.

I'm starting to wonder if the problem has something to do with running
fire_user_return_notifiers with IRQs on.  We appear to do that, and it
seems rather questionable to me that it's safe, given the sneaky
things that KVM does in there.

If we end up in user mode with a bad MSR_SYSCALL_MASK, we could see
your crash, although I don't see how that would happen either.

I'll try to write a diagnostic patch later this morning.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Linus Torvalds
On Wed, Mar 18, 2015 at 5:57 PM, Andy Lutomirski  wrote:
>
>>   sp = 140735967860552,
>
> 0x7fffa55f1748
>
> Note that the double fault happened with rsp == 0x7fffa55eafb8,
> which is the saved rsp here - 0x6790.  That difference kind of large
> to make sense if this is a sysret problem.  Not that I have a better
> explanation...

Actually, that kind of large difference is what I'd expect if it's a
GP fault on sysret then cascades to more faults because our kernel
stack pointer is crap.

So it starts with getting a GP fault due to the sysret, but now we're
in la-la-land with really odd core register state, so what's not to
say that we don't get a recursive fault. We don't use the kernel stack
pointer for getting thread-info any more like we used to, but we still
have code like this in entry_64.c:

testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)

which seems to know that the thread info is below the kernel stack. So
let's say that the GP fault starts taking a recursive GP faults (or
recursive page faults) due to confusion with thread_info accesses or
something. And the stack keeps growing down, because all the faults
just fault themselves. Until finally we hit an unmapped area, and that
stops it - because while we had recursive faulting before, it was our
kernel code that was confused. But now the fault handling ends up
takiung a page fault while setting up the error information.

You would *not* expect the stack to be unmapped just under the
original %rsp value. User space has big frames and probably had deep
call chains before it ever hit the problematic case, so there's some
"slop" on the user stack. Only when we run out of slop do we get the
double-fault. Which explains why you should *not* expect the %rsp
values to be similar.

And around 30kB of stack before that happens sounds quite reasonable.

Now, to be honest, I don't see why we'd get the cascading faults, I
just get this feeling that if %rsp is crap, just about anything might
go wrong, and that if it's sysret taking a #GP fault, we're just
screwed.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 5:23 PM, Stefan Seyfried
 wrote:
> Am 19.03.2015 um 00:22 schrieb Andy Lutomirski:
>> On Wed, Mar 18, 2015 at 3:40 PM, Andy Lutomirski  wrote:
>>> Yes, it's userspace.  Thanks for checking, though.
>>
>> One more stupid hunch:
>>
>> Can you do:
>> x/21xg 8801013d4f58
>>
>> If I counted right, that'll dump task_pt_regs(current).
>
> That's all zeroes:
> crash> x /21xg 0x8801013d4f58
> 0x8801013d4f58: 0x  0x
> 0x8801013d4f68: 0x  0x
> 0x8801013d4f78: 0x  0x
> 0x8801013d4f88: 0x  0x
> 0x8801013d4f98: 0x  0x
> 0x8801013d4fa8: 0x  0x
> 0x8801013d4fb8: 0x  0x
> 0x8801013d4fc8: 0x  0x
> 0x8801013d4fd8: 0x  0x
> 0x8801013d4fe8: 0x  0x
> 0x8801013d4ff8: 0x
>
> But maybe you counted wrong (or I'm reading arch/x86/include/asm/processor.h 
> wrong, which is at least as likely...).
>
> #define task_pt_regs(tsk)  ((struct pt_regs *)(tsk)->thread.sp0 - 1)
>
> => I have the task_struct readily available decoded in the crash utility.
>
> crash> task, search for thread, in thread:
>  sp0 = 18446612136629993472
> crash> eval 18446612136629993472
> hexadecimal: 8801013d8000  (18014269664677728KB)

I did indeed count wrong -- THREAD_SIZE != 0x1000.  Whoops.

> 
> crash> print *(struct pt_regs *)(18446612136629993472 - sizeof(struct 
> pt_regs))

Looks like we last entered via an io_submit syscall.

> $20 = {
>   r15 = 18446744071585666077,
>   r14 = 16,
>   r13 = 582,
>   r12 = 18446612136629993352,
>   bp = 24,
>   bx = 18446744071585666061,
>   r11 = 582,

==flags, which is consistent with a syscall.  However, Denys' big
cleanup isn't in play here, so we probably did FIXUP_TOP_OF_STACK,
maybe even in the syscall in question.

>   r10 = 10760856,
>   r9 = 140712613762160,
>   r8 = 140735967861216,
>   ax = 1,

Entirely resonable if we're trying to exit from io_submit.

>   cx = 140712476030103,

0x7ffa2d263497

>   dx = 140712613782304,
>   si = 1,
>   di = 140712589295616,
>   orig_ax = 209,

__NR_io_submit

>   ip = 140712571864823,

0x7ffa32dc86f7, which is not equal to cx (oddly, given that this seems
to have been a syscall) and is canonical.  To me, this suggests that
FIXUP_TOP_OF_STACK last executed on a different syscall, in which case
all this opportunistic sysret stuff is a red herring - we never
executed FIXUP_TOP_OF_STACK for this syscall.

>   cs = 51,

__USER_CS

>   flags = 582,

0x246 (i.e. totally normal for userspace, I think)

>   sp = 140735967860552,

0x7fffa55f1748

Note that the double fault happened with rsp == 0x7fffa55eafb8,
which is the saved rsp here - 0x6790.  That difference kind of large
to make sense if this is a sysret problem.  Not that I have a better
explanation...

OTOH, if it's a syscall problem, then these regs are from the previous
syscall, so 0x6790 byts of additional user stack usage is entirely
sensible.  Alternatively, we could have taken a whole pile of nested
page faults until we crossed into the land of unwritable user stack
pages.

>   ss = 43

__USER_DS

> }
>
> =>
> r15 = 8168141d
> r12 = 8801013d7f88
> bx  = 8168140d
> r9  = 7ffa355bd470
> ip  = 7ffa32dc86f7
> sp  = 7fffa55f1748
>
> looks somehow legit, to my totally untrained eye (ip and sp actually).

One potentially interesting thing that changed is that we now return
from KVM to userspace (to the next scheduled task, not necessarily to
the run ioctl) via sysret *even if the user return notifier runs*.
This was part of the point of the opportunistic sysret code, and KVM
seems to be involved here.

>
> I'm off to bed now (01:20 around here ;), will be back in about 7 hours.

Thanks for the evening debugging help :)

FWIW, I just noticed that stub_execveat incorrect calls
RESTORE_TOP_OF_STACK before jumping to int_ret_from_sys_call.
Actually, there seems to be an impressive number of bugs like that
(the syscall slow path totally screws this up, but it seems harmless
to me).  I'm really glad that Denys is removing that code...

Stefan, do you happen to know whether your disassembly of page_fault
came from the instructions in memory or if they came from the vmlinux
file?  Not that I have any relevant ideas there.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Stefan Seyfried
Am 19.03.2015 um 00:22 schrieb Andy Lutomirski:
> On Wed, Mar 18, 2015 at 3:40 PM, Andy Lutomirski  wrote:
>> Yes, it's userspace.  Thanks for checking, though.
> 
> One more stupid hunch:
> 
> Can you do:
> x/21xg 8801013d4f58
> 
> If I counted right, that'll dump task_pt_regs(current).

That's all zeroes:
crash> x /21xg 0x8801013d4f58
0x8801013d4f58: 0x  0x
0x8801013d4f68: 0x  0x
0x8801013d4f78: 0x  0x
0x8801013d4f88: 0x  0x
0x8801013d4f98: 0x  0x
0x8801013d4fa8: 0x  0x
0x8801013d4fb8: 0x  0x
0x8801013d4fc8: 0x  0x
0x8801013d4fd8: 0x  0x
0x8801013d4fe8: 0x  0x
0x8801013d4ff8: 0x

But maybe you counted wrong (or I'm reading arch/x86/include/asm/processor.h 
wrong, which is at least as likely...).

#define task_pt_regs(tsk)  ((struct pt_regs *)(tsk)->thread.sp0 - 1)

=> I have the task_struct readily available decoded in the crash utility.

crash> task, search for thread, in thread:
 sp0 = 18446612136629993472
crash> eval 18446612136629993472
hexadecimal: 8801013d8000  (18014269664677728KB)

crash> print *(struct pt_regs *)(18446612136629993472 - sizeof(struct pt_regs))
$20 = {
  r15 = 18446744071585666077, 
  r14 = 16, 
  r13 = 582, 
  r12 = 18446612136629993352, 
  bp = 24, 
  bx = 18446744071585666061, 
  r11 = 582, 
  r10 = 10760856, 
  r9 = 140712613762160, 
  r8 = 140735967861216, 
  ax = 1, 
  cx = 140712476030103, 
  dx = 140712613782304, 
  si = 1, 
  di = 140712589295616, 
  orig_ax = 209, 
  ip = 140712571864823, 
  cs = 51, 
  flags = 582, 
  sp = 140735967860552, 
  ss = 43
}

=>
r15 = 8168141d
r12 = 8801013d7f88
bx  = 8168140d
r9  = 7ffa355bd470
ip  = 7ffa32dc86f7
sp  = 7fffa55f1748

looks somehow legit, to my totally untrained eye (ip and sp actually).

I'm off to bed now (01:20 around here ;), will be back in about 7 hours.

Best regards,

Stefan
-- 
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 3:40 PM, Andy Lutomirski  wrote:
> On Wed, Mar 18, 2015 at 3:38 PM, Stefan Seyfried
>  wrote:
>> Am 18.03.2015 um 23:29 schrieb Andy Lutomirski:
>>> On Wed, Mar 18, 2015 at 3:22 PM, Jiri Kosina  wrote:
 On Wed, 18 Mar 2015, Andy Lutomirski wrote:

> sysret64 can only fail with #GP, and we're totally screwed if that
> happens,

 But what if the GPF handler pagefaults afterwards? It'd be operating on
 user stack already.
>>>
>>> Good point.
>>>
>>> Stefan, can you try changing the first "jne
>>> opportunistic_sysret_failed" to "jmp opportunistic_sysret_failed" in
>>> entry_64.S and seeing if you can reproduce this?  (Is it easy enough
>>> to reproduce that this would tell us anything?)
>>
>> I have no good way of reproducing the issue (happens once per week...)
>> but apparently Takashi has, so I'd like to hand this task over to him.
>>
>>> It's a shame that double_fault doesn't record what gs was on entry.
>>> If we did sysret -> general_protection -> page_fault -> double_fault,
>>> then we'd enter double_fault with usergs, whereas syscall ->
>>> page_fault -> double_fault would enter double_fault with kernelgs.
>>>
>>> Hmm.  We may be able to answer this more directly.  Stefan, can you
>>> dump a couple hundred bytes starting at 0x7fffa55eafb8 (i.e. your
>>> page_fault stack at the time of the failure)?  That will tell us the
>>> faulting address.  If that fails, try starting at 7fffa55eb000
>>> instead.
>>
>> Unfortunately not, is this userspace memory? It's not in the dump I have.
>> This issue is the first I have seen where having a full dump would be
>> really helpful apart from cosmetic reasons...
>
> Yes, it's userspace.  Thanks for checking, though.

One more stupid hunch:

Can you do:
x/21xg 8801013d4f58

If I counted right, that'll dump task_pt_regs(current).

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 3:38 PM, Stefan Seyfried
 wrote:
> Am 18.03.2015 um 23:29 schrieb Andy Lutomirski:
>> On Wed, Mar 18, 2015 at 3:22 PM, Jiri Kosina  wrote:
>>> On Wed, 18 Mar 2015, Andy Lutomirski wrote:
>>>
 sysret64 can only fail with #GP, and we're totally screwed if that
 happens,
>>>
>>> But what if the GPF handler pagefaults afterwards? It'd be operating on
>>> user stack already.
>>
>> Good point.
>>
>> Stefan, can you try changing the first "jne
>> opportunistic_sysret_failed" to "jmp opportunistic_sysret_failed" in
>> entry_64.S and seeing if you can reproduce this?  (Is it easy enough
>> to reproduce that this would tell us anything?)
>
> I have no good way of reproducing the issue (happens once per week...)
> but apparently Takashi has, so I'd like to hand this task over to him.
>
>> It's a shame that double_fault doesn't record what gs was on entry.
>> If we did sysret -> general_protection -> page_fault -> double_fault,
>> then we'd enter double_fault with usergs, whereas syscall ->
>> page_fault -> double_fault would enter double_fault with kernelgs.
>>
>> Hmm.  We may be able to answer this more directly.  Stefan, can you
>> dump a couple hundred bytes starting at 0x7fffa55eafb8 (i.e. your
>> page_fault stack at the time of the failure)?  That will tell us the
>> faulting address.  If that fails, try starting at 7fffa55eb000
>> instead.
>
> Unfortunately not, is this userspace memory? It's not in the dump I have.
> This issue is the first I have seen where having a full dump would be
> really helpful apart from cosmetic reasons...

Yes, it's userspace.  Thanks for checking, though.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Stefan Seyfried
Am 18.03.2015 um 23:29 schrieb Andy Lutomirski:
> On Wed, Mar 18, 2015 at 3:22 PM, Jiri Kosina  wrote:
>> On Wed, 18 Mar 2015, Andy Lutomirski wrote:
>>
>>> sysret64 can only fail with #GP, and we're totally screwed if that
>>> happens,
>>
>> But what if the GPF handler pagefaults afterwards? It'd be operating on
>> user stack already.
> 
> Good point.
> 
> Stefan, can you try changing the first "jne
> opportunistic_sysret_failed" to "jmp opportunistic_sysret_failed" in
> entry_64.S and seeing if you can reproduce this?  (Is it easy enough
> to reproduce that this would tell us anything?)

I have no good way of reproducing the issue (happens once per week...)
but apparently Takashi has, so I'd like to hand this task over to him.

> It's a shame that double_fault doesn't record what gs was on entry.
> If we did sysret -> general_protection -> page_fault -> double_fault,
> then we'd enter double_fault with usergs, whereas syscall ->
> page_fault -> double_fault would enter double_fault with kernelgs.
> 
> Hmm.  We may be able to answer this more directly.  Stefan, can you
> dump a couple hundred bytes starting at 0x7fffa55eafb8 (i.e. your
> page_fault stack at the time of the failure)?  That will tell us the
> faulting address.  If that fails, try starting at 7fffa55eb000
> instead.

Unfortunately not, is this userspace memory? It's not in the dump I have.
This issue is the first I have seen where having a full dump would be
really helpful apart from cosmetic reasons...
-- 
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 3:22 PM, Jiri Kosina  wrote:
> On Wed, 18 Mar 2015, Andy Lutomirski wrote:
>
>> sysret64 can only fail with #GP, and we're totally screwed if that
>> happens,
>
> But what if the GPF handler pagefaults afterwards? It'd be operating on
> user stack already.

Good point.

Stefan, can you try changing the first "jne
opportunistic_sysret_failed" to "jmp opportunistic_sysret_failed" in
entry_64.S and seeing if you can reproduce this?  (Is it easy enough
to reproduce that this would tell us anything?)

It's a shame that double_fault doesn't record what gs was on entry.
If we did sysret -> general_protection -> page_fault -> double_fault,
then we'd enter double_fault with usergs, whereas syscall ->
page_fault -> double_fault would enter double_fault with kernelgs.

Hmm.  We may be able to answer this more directly.  Stefan, can you
dump a couple hundred bytes starting at 0x7fffa55eafb8 (i.e. your
page_fault stack at the time of the failure)?  That will tell us the
faulting address.  If that fails, try starting at 7fffa55eb000
instead.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 3:28 PM, Linus Torvalds
 wrote:
> On Wed, Mar 18, 2015 at 3:22 PM, Jiri Kosina  wrote:
>>
>> But what if the GPF handler pagefaults afterwards? It'd be operating on
>> user stack already.
>
> So I think this might be the answer. We don't see the GP fault,
> because we don't have a backtrace, because that backtrace is on the
> user stack (which is why the stack trace dumping fails - we should
> probably fix that, btw - the second oops is just confusing and not
> helpful).
>
> Is the intel check for canonical address (that __VIRTUAL_MASK_SHIFT
> thing) perhaps wrong or not as strict as Intel CPU's do? We'd never
> notice in normal situations..

I explicitly tested that I could blow up the kernel if I intentionally
broke that test, and I couldn't blow it up with the test as written.
That doesn't prove it's correct, though.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Linus Torvalds
On Wed, Mar 18, 2015 at 3:22 PM, Jiri Kosina  wrote:
>
> But what if the GPF handler pagefaults afterwards? It'd be operating on
> user stack already.

So I think this might be the answer. We don't see the GP fault,
because we don't have a backtrace, because that backtrace is on the
user stack (which is why the stack trace dumping fails - we should
probably fix that, btw - the second oops is just confusing and not
helpful).

Is the intel check for canonical address (that __VIRTUAL_MASK_SHIFT
thing) perhaps wrong or not as strict as Intel CPU's do? We'd never
notice in normal situations..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Denys Vlasenko
On Wed, Mar 18, 2015 at 11:20 PM, Andy Lutomirski  wrote:
>> There is an easy way to test the theory that SYSRET is to blame.
>>
>> Just replace
>>
>> movq RCX(%rsp),%rcx
>> cmpq %rcx,RIP(%rsp) /* RCX == RIP */
>> jne opportunistic_sysret_failed
>>
>> this "jne" with "jmp", and try to reproduce.
>>
>
> This is a classic root exploit, and it's why we check for
> non-canonical RIP.  In theory, that's the only way this can happen.
> Intel screwed up -- AMD never fails SYSRET.

I'm not saying the code needs to be changed.

I'm saying that *people who see the crash* can make this change,
run the modified kernel, and if crash disappears -
then it is caused by "opportunistic SYSRET".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 3:18 PM, Linus Torvalds
 wrote:
> On Wed, Mar 18, 2015 at 2:55 PM, Andy Lutomirski  wrote:
>>
>> On Xen, it goes to xen_sysret64, which touches the same percpu
>> variables that we touch on entry.  So I still like my percpu vmap
>> fault hypothesis, even though I don't understand what would trigger
>> it.
>
> I don't dislike the theory per se, but not only don't I see how it
> could happen on regular execution on a laptop, but I also don't see
> why this fault behavior would be new to 4.0.
>
> (And I do believe that we should make sure that CPU bringup ends up
> faulting in the percpu area, even if I don't really see why that would
> be the issue here)
>
> Afaik, the system call entry code hasn't changed at all.
>
> What *has* changed is the "paranoid" handling (double-fault has that
> magical "paranoid=2" thing, for example) and the return to user-space
> code.

Indeed.  If this were #DB, #BP, or #MC, I'd believe that, but the page
fault code didn't change.  And double-fault didn't materially change
-- the paranoid=2 thing means to opt *out* of the recent changes.  So
I'm not convinced by that theory.

>
> Which is really why I don't believe in that syscall thing. Not because
> it isn't the obvious culprit, but simply because it hasn't *changed*.
>
> Or is there something subtle I've missed?

We did change one thing here: for the first time* it's possible to
exit using sysret when we didn't enter using syscall.  But this really
shouldn't matter on native, since we don't touch any memory at all
between the stack switch and sysret.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Jiri Kosina
On Wed, 18 Mar 2015, Andy Lutomirski wrote:

> sysret64 can only fail with #GP, and we're totally screwed if that
> happens, 

But what if the GPF handler pagefaults afterwards? It'd be operating on 
user stack already.

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 3:17 PM, Denys Vlasenko  wrote:
> On 03/18/2015 10:55 PM, Andy Lutomirski wrote:
>> On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko  wrote:
>>> On 03/18/2015 10:32 PM, Linus Torvalds wrote:
 On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski  
 wrote:
>>
>> crash> disassemble page_fault
>> Dump of assembler code for function page_fault:
>>0x816834a0 <+0>: data32 xchg %ax,%ax
>>0x816834a3 <+3>: data32 xchg %ax,%ax
>>0x816834a6 <+6>: data32 xchg %ax,%ax
>>0x816834a9 <+9>: sub$0x78,%rsp
>>0x816834ad <+13>:callq  0x81683620 
>
> The callq was the double-faulting instruction, and it is indeed the
> first function in here that would have accessed the stack.  (The sub
> *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
> page fault, and the page fault is promoted to a double fault.  The
> surprising thing is that the page fault itself seems to have been
> delivered okay, and RSP wasn't on a page boundary.

 Not at all surprising, and sure it was on a page boundry..

 Look closer.

 %rsp is 7fffa55eafb8.

 But that's *after* page_fault has done that

 sub$0x78,%rsp

 so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
 different page.
>>
>> Ah, I forgot to add 0x78.  You're right, of course.
>>

 And that page happened to be mapped.

 So what happened is:

  - we somehow entered kernel mode without switching stacks

(ie presumably syscall)

  - the user stack was still fine

  - we took a page fault, which once again didn't switch stacks,
 because we were already in kernel mode. And this page fault worked,
 because it just pushed the error code onto the user stack which was
 mapped.

  - we now took a second page fault within the page fault handler,
 because now the stack pointer has been decremented and points one user
 page down that is *not* mapped, so now that page fault cannot push the
 error code and return information.

 Now, how we took that original page fault is sadly not very clear at
 all.  I agree that it's something about system-call (how could we not
 change stacks otherwise), but why it should have started now, I don't
 know. I don't think "system_call" has changed at all.

 Maybe there is something wrong with the new "ret_from_sys_call" logic,
 and that "use sysret to return to user mode" thing. Because this code
 sequence:

 +   movq (RSP-RIP)(%rsp),%rsp
 +   USERGS_SYSRET64

 in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
 kernel with a user stack poiinter, maybe we're *exiting* the kernel,
 and have just reloaded the user stack pointer when "USERGS_SYSRET64"
 takes some fault.
>>>
>>> Yes, so far we happily thought that SYSRET never fails...
>>>
>>> This merits adding some code which would at least BUG_ON
>>> if the faulting address is seen to match SYSRET64.
>>
>> sysret64 can only fail with #GP, and we're totally screwed if that
>> happens, although I agree about the BUG_ON in principle.  Where would
>> we add it that would help in this case, though?  We never even made it
>> to C code.
>>
>> In any event, this was a page fault.  sysret64 doesn't access memory.
>
> Let's see.
>
> Faulting SYSRET will still be in CPL0.
> It would drop CPU into the #GP handler
> but %rsp is already loaded with _user_ %rsp (!).
>
> #GP handler will start pushing stuff onto stack,
> happily thinking that it is a kernel stack.
>
> This can cause a page fault.
>
> Most likely, this page fault won't succeed,
> and we'd get a double fault with %pir somewhere in #GP handler.
>
> Yes, this doesn't entirely matches what we see...
>
> There is an easy way to test the theory that SYSRET is to blame.
>
> Just replace
>
> movq RCX(%rsp),%rcx
> cmpq %rcx,RIP(%rsp) /* RCX == RIP */
> jne opportunistic_sysret_failed
>
> this "jne" with "jmp", and try to reproduce.
>

This is a classic root exploit, and it's why we check for
non-canonical RIP.  In theory, that's the only way this can happen.
Intel screwed up -- AMD never fails SYSRET.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Linus Torvalds
On Wed, Mar 18, 2015 at 2:55 PM, Andy Lutomirski  wrote:
>
> On Xen, it goes to xen_sysret64, which touches the same percpu
> variables that we touch on entry.  So I still like my percpu vmap
> fault hypothesis, even though I don't understand what would trigger
> it.

I don't dislike the theory per se, but not only don't I see how it
could happen on regular execution on a laptop, but I also don't see
why this fault behavior would be new to 4.0.

(And I do believe that we should make sure that CPU bringup ends up
faulting in the percpu area, even if I don't really see why that would
be the issue here)

Afaik, the system call entry code hasn't changed at all.

What *has* changed is the "paranoid" handling (double-fault has that
magical "paranoid=2" thing, for example) and the return to user-space
code.

Which is really why I don't believe in that syscall thing. Not because
it isn't the obvious culprit, but simply because it hasn't *changed*.

Or is there something subtle I've missed?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Denys Vlasenko
On 03/18/2015 10:55 PM, Andy Lutomirski wrote:
> On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko  wrote:
>> On 03/18/2015 10:32 PM, Linus Torvalds wrote:
>>> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski  
>>> wrote:
>
> crash> disassemble page_fault
> Dump of assembler code for function page_fault:
>0x816834a0 <+0>: data32 xchg %ax,%ax
>0x816834a3 <+3>: data32 xchg %ax,%ax
>0x816834a6 <+6>: data32 xchg %ax,%ax
>0x816834a9 <+9>: sub$0x78,%rsp
>0x816834ad <+13>:callq  0x81683620 

 The callq was the double-faulting instruction, and it is indeed the
 first function in here that would have accessed the stack.  (The sub
 *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
 page fault, and the page fault is promoted to a double fault.  The
 surprising thing is that the page fault itself seems to have been
 delivered okay, and RSP wasn't on a page boundary.
>>>
>>> Not at all surprising, and sure it was on a page boundry..
>>>
>>> Look closer.
>>>
>>> %rsp is 7fffa55eafb8.
>>>
>>> But that's *after* page_fault has done that
>>>
>>> sub$0x78,%rsp
>>>
>>> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
>>> different page.
> 
> Ah, I forgot to add 0x78.  You're right, of course.
> 
>>>
>>> And that page happened to be mapped.
>>>
>>> So what happened is:
>>>
>>>  - we somehow entered kernel mode without switching stacks
>>>
>>>(ie presumably syscall)
>>>
>>>  - the user stack was still fine
>>>
>>>  - we took a page fault, which once again didn't switch stacks,
>>> because we were already in kernel mode. And this page fault worked,
>>> because it just pushed the error code onto the user stack which was
>>> mapped.
>>>
>>>  - we now took a second page fault within the page fault handler,
>>> because now the stack pointer has been decremented and points one user
>>> page down that is *not* mapped, so now that page fault cannot push the
>>> error code and return information.
>>>
>>> Now, how we took that original page fault is sadly not very clear at
>>> all.  I agree that it's something about system-call (how could we not
>>> change stacks otherwise), but why it should have started now, I don't
>>> know. I don't think "system_call" has changed at all.
>>>
>>> Maybe there is something wrong with the new "ret_from_sys_call" logic,
>>> and that "use sysret to return to user mode" thing. Because this code
>>> sequence:
>>>
>>> +   movq (RSP-RIP)(%rsp),%rsp
>>> +   USERGS_SYSRET64
>>>
>>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
>>> kernel with a user stack poiinter, maybe we're *exiting* the kernel,
>>> and have just reloaded the user stack pointer when "USERGS_SYSRET64"
>>> takes some fault.
>>
>> Yes, so far we happily thought that SYSRET never fails...
>>
>> This merits adding some code which would at least BUG_ON
>> if the faulting address is seen to match SYSRET64.
> 
> sysret64 can only fail with #GP, and we're totally screwed if that
> happens, although I agree about the BUG_ON in principle.  Where would
> we add it that would help in this case, though?  We never even made it
> to C code.
> 
> In any event, this was a page fault.  sysret64 doesn't access memory.

Let's see.

Faulting SYSRET will still be in CPL0.
It would drop CPU into the #GP handler
but %rsp is already loaded with _user_ %rsp (!).

#GP handler will start pushing stuff onto stack,
happily thinking that it is a kernel stack.

This can cause a page fault.

Most likely, this page fault won't succeed,
and we'd get a double fault with %pir somewhere in #GP handler.

Yes, this doesn't entirely matches what we see...

There is an easy way to test the theory that SYSRET is to blame.

Just replace

movq RCX(%rsp),%rcx
cmpq %rcx,RIP(%rsp) /* RCX == RIP */
jne opportunistic_sysret_failed

this "jne" with "jmp", and try to reproduce.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko  wrote:
> On 03/18/2015 10:32 PM, Linus Torvalds wrote:
>> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski  
>> wrote:

 crash> disassemble page_fault
 Dump of assembler code for function page_fault:
0x816834a0 <+0>: data32 xchg %ax,%ax
0x816834a3 <+3>: data32 xchg %ax,%ax
0x816834a6 <+6>: data32 xchg %ax,%ax
0x816834a9 <+9>: sub$0x78,%rsp
0x816834ad <+13>:callq  0x81683620 
>>>
>>> The callq was the double-faulting instruction, and it is indeed the
>>> first function in here that would have accessed the stack.  (The sub
>>> *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
>>> page fault, and the page fault is promoted to a double fault.  The
>>> surprising thing is that the page fault itself seems to have been
>>> delivered okay, and RSP wasn't on a page boundary.
>>
>> Not at all surprising, and sure it was on a page boundry..
>>
>> Look closer.
>>
>> %rsp is 7fffa55eafb8.
>>
>> But that's *after* page_fault has done that
>>
>> sub$0x78,%rsp
>>
>> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
>> different page.

Ah, I forgot to add 0x78.  You're right, of course.

>>
>> And that page happened to be mapped.
>>
>> So what happened is:
>>
>>  - we somehow entered kernel mode without switching stacks
>>
>>(ie presumably syscall)
>>
>>  - the user stack was still fine
>>
>>  - we took a page fault, which once again didn't switch stacks,
>> because we were already in kernel mode. And this page fault worked,
>> because it just pushed the error code onto the user stack which was
>> mapped.
>>
>>  - we now took a second page fault within the page fault handler,
>> because now the stack pointer has been decremented and points one user
>> page down that is *not* mapped, so now that page fault cannot push the
>> error code and return information.
>>
>> Now, how we took that original page fault is sadly not very clear at
>> all.  I agree that it's something about system-call (how could we not
>> change stacks otherwise), but why it should have started now, I don't
>> know. I don't think "system_call" has changed at all.
>>
>> Maybe there is something wrong with the new "ret_from_sys_call" logic,
>> and that "use sysret to return to user mode" thing. Because this code
>> sequence:
>>
>> +   movq (RSP-RIP)(%rsp),%rsp
>> +   USERGS_SYSRET64
>>
>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
>> kernel with a user stack poiinter, maybe we're *exiting* the kernel,
>> and have just reloaded the user stack pointer when "USERGS_SYSRET64"
>> takes some fault.
>
> Yes, so far we happily thought that SYSRET never fails...
>
> This merits adding some code which would at least BUG_ON
> if the faulting address is seen to match SYSRET64.

sysret64 can only fail with #GP, and we're totally screwed if that
happens, although I agree about the BUG_ON in principle.  Where would
we add it that would help in this case, though?  We never even made it
to C code.

In any event, this was a page fault.  sysret64 doesn't access memory.

>
> Now we only check for faulting IRETQ:
>
> error_kernelspace:
> CFI_REL_OFFSET rcx, RCX+8
> incl %ebx
> leaq native_irq_return_iret(%rip),%rcx
> cmpq %rcx,RIP+8(%rsp)
> je error_bad_iret
>
>>
>> Is PARAVIRT enabled? The three nop's at the beginning of 'page_fault'
>> makes me suspect it is,  and that that is some paravirt rewriting
>> area. What does paravirt go for that USERGS_SYSRET64 (or for
>> SWAPGS_UNSAFE_STACK, for that matter).

On Xen, it goes to xen_sysret64, which touches the same percpu
variables that we touch on entry.  So I still like my percpu vmap
fault hypothesis, even though I don't understand what would trigger
it.

At the risk of asking awful questions, what happens if we deliver an
IST interrupt in vmx_handle_external_intr?  Can that happen?  It can't
be a good thing if it happens.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Stefan Seyfried
Am 18.03.2015 um 22:49 schrieb Denys Vlasenko:
> Stefan, Takashi, can you post your /proc/cpuinfo
> and dmesg after boot?

susi:~ # cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Duo CPU L9400  @ 1.86GHz
stepping: 10
microcode   : 0xa0c
cpu MHz : 1867.000
cache size  : 6144 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts nopl aperfmperf pni dtes64 monitor ds_cpl 
vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida dtherm tpr_shadow 
vnmi flexpriority bugs:
bogomips: 3723.96
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

(repeats for second core :)

I'm running 3.19 now, but the dmesg extracted from the crash
dump of 4.0-rc3 is at http://paste.opensuse.org/48196621
-- 
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Denys Vlasenko
Stefan, Takashi, can you post your /proc/cpuinfo
and dmesg after boot?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Stefan Seyfried
Am 18.03.2015 um 22:32 schrieb Linus Torvalds:
> Is PARAVIRT enabled? The three nop's at the beginning of 'page_fault'
> makes me suspect it is,  and that that is some paravirt rewriting
> area. What does paravirt go for that USERGS_SYSRET64 (or for
> SWAPGS_UNSAFE_STACK, for that matter).

This from the newer kernel package, but I doubt this configuration has
been changed in the openSUSE kernel:

susi:~ # grep PARAVIRT /boot/config-4.0.0-rc4-1.g126fc64-desktop
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_DEBUG is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_PARAVIRT_CLOCK=y

So yes, PARAVIRT is enabled.

Best regards,

Stefan
-- 
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Denys Vlasenko
On 03/18/2015 10:32 PM, Linus Torvalds wrote:
> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski  wrote:
>>>
>>> crash> disassemble page_fault
>>> Dump of assembler code for function page_fault:
>>>0x816834a0 <+0>: data32 xchg %ax,%ax
>>>0x816834a3 <+3>: data32 xchg %ax,%ax
>>>0x816834a6 <+6>: data32 xchg %ax,%ax
>>>0x816834a9 <+9>: sub$0x78,%rsp
>>>0x816834ad <+13>:callq  0x81683620 
>>
>> The callq was the double-faulting instruction, and it is indeed the
>> first function in here that would have accessed the stack.  (The sub
>> *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
>> page fault, and the page fault is promoted to a double fault.  The
>> surprising thing is that the page fault itself seems to have been
>> delivered okay, and RSP wasn't on a page boundary.
> 
> Not at all surprising, and sure it was on a page boundry..
> 
> Look closer.
> 
> %rsp is 7fffa55eafb8.
> 
> But that's *after* page_fault has done that
> 
> sub$0x78,%rsp
> 
> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
> different page.
> 
> And that page happened to be mapped.
> 
> So what happened is:
> 
>  - we somehow entered kernel mode without switching stacks
> 
>(ie presumably syscall)
> 
>  - the user stack was still fine
> 
>  - we took a page fault, which once again didn't switch stacks,
> because we were already in kernel mode. And this page fault worked,
> because it just pushed the error code onto the user stack which was
> mapped.
> 
>  - we now took a second page fault within the page fault handler,
> because now the stack pointer has been decremented and points one user
> page down that is *not* mapped, so now that page fault cannot push the
> error code and return information.
> 
> Now, how we took that original page fault is sadly not very clear at
> all.  I agree that it's something about system-call (how could we not
> change stacks otherwise), but why it should have started now, I don't
> know. I don't think "system_call" has changed at all.
> 
> Maybe there is something wrong with the new "ret_from_sys_call" logic,
> and that "use sysret to return to user mode" thing. Because this code
> sequence:
> 
> +   movq (RSP-RIP)(%rsp),%rsp
> +   USERGS_SYSRET64
> 
> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
> kernel with a user stack poiinter, maybe we're *exiting* the kernel,
> and have just reloaded the user stack pointer when "USERGS_SYSRET64"
> takes some fault.

Yes, so far we happily thought that SYSRET never fails...

This merits adding some code which would at least BUG_ON
if the faulting address is seen to match SYSRET64.

Now we only check for faulting IRETQ:

error_kernelspace:
CFI_REL_OFFSET rcx, RCX+8
incl %ebx
leaq native_irq_return_iret(%rip),%rcx
cmpq %rcx,RIP+8(%rsp)
je error_bad_iret

> 
> Is PARAVIRT enabled? The three nop's at the beginning of 'page_fault'
> makes me suspect it is,  and that that is some paravirt rewriting
> area. What does paravirt go for that USERGS_SYSRET64 (or for
> SWAPGS_UNSAFE_STACK, for that matter).
> 
> Linus
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Stefan Seyfried
Am 18.03.2015 um 22:21 schrieb Andy Lutomirski:
> On Wed, Mar 18, 2015 at 2:12 PM, Stefan Seyfried
>  wrote:
>> Am 18.03.2015 um 21:51 schrieb Andy Lutomirski:
>>> On Wed, Mar 18, 2015 at 1:05 PM, Stefan Seyfried
>>>  wrote:
>>
> The relevant thread's stack is here (see ti in the trace):
>
> 8801013d4000
>
> It could be interesting to see what's there.
>
> I don't suppose you want to try to walk the paging structures to see
> if 88023bc8 (i.e. gsbase) and, more specifically,
> 88023bc8 + old_rsp and 88023bc8 + kernel_stack are
> present?  You'd only have to walk one level -- presumably, if the PGD
> entry is there, the rest of the entries are okay, too.

 That's all greek to me :-)

 I see that there is something at 88023bc8:

 crash> x /64xg 0x88023bc8
 0x88023bc8: 0x  0x
 0x88023bc80010: 0x  0x
 0x88023bc80020: 0x  0x6686ada9
 0x88023bc80030: 0x  0x
 0x88023bc80040: 0x  0x
 [all zeroes]
 0x88023bc801f0: 0x  0x

 old_rsp and kernel_stack seem bogus:
 crash> print old_rsp
 Cannot access memory at address 0xa200
 gdb: gdb request failed: print old_rsp
 crash> print kernel_stack
 Cannot access memory at address 0xaa48
 gdb: gdb request failed: print kernel_stack

 kernel_stack is not a pointer? So 0x88023bc8 + 0xaa48 it is:
>>>
>>> Yup.  old_rsp and kernel_stack are offsets relative to gsbase.
>>>

 crash> x /64xg 0x88023bc8aa00
 0x88023bc8aa00: 0x  0x
>>>
>>> [...]
>>>
>>> I don't know enough about crashkernel to know whether the fact that
>>> this worked means anything.
>>
>> AFAIK this just means that the memory at this location is included in
>> the dump :-)
>>
>>> Can you dump the page of physical memory at 0x4779a067?  That's the PGD.
>>
>> Unfortunately not, this is a partial dump (I think the default config in
>> openSUSE, but I might have changed it some time ago) and the dump_level
>> is 31 which means that the following are excluded:
>>
>>  |  |cache  |cache  |  |
>> dump | zero |without|with   | user | free
>>level | page |private|private| data | page
>>   ---+--+---+---+--+--
>>   31 |  X   |   X   |   X   |  X   |  X
>>
>> so this:
>> crash> x /64xg 0x4779a067
>> 0x4779a067: Cannot access memory at address 0x4779a067
>> gdb: gdb request failed: x /64xg
>>
>> probably just means, that the PGD falls in one of the above excluded
>> categories.
> 
> I suspect that it actually means that gdb sees virtual addresses, not
> physical addresses.  But I screwed up completely -- "PGD" in the dump
> is the PGD *entry*, not the PGD pointer.

in crash, usually physical addresses work (it's a sophisticated wrapper
around gdb AFAICT)
> 
> We could plausibly fish it out from current->mm, but that's a mess.

I'll come to that later
  I
> don't suppose that "info registers" or "p/x $cr3" will show the cr3
> value?

No, that does not work from crash.

But current->mm is easy:
crash> task|grep mm
  start_comm =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  mm = 0x8800b8a9c040,
  active_mm = 0x8800b8a9c040,
  comm = "qemu-system-x86",

and (guessing the type :-)
crash> print *(struct mm_struct *)0x8800b8a9c040|grep pgd
  pgd = 0x880002d7e000,

But if that's correct, pgd contains all zeroes:
crash> print *(pgd_t *)0x880002d7e000
$15 = {
  pgd = 0
}
crash> x /16xg 0x880002d7e000
0x880002d7e000: 0x  0x
0x880002d7e010: 0x  0x
0x880002d7e020: 0x  0x
0x880002d7e030: 0x  0x
0x880002d7e040: 0x  0x
0x880002d7e050: 0x  0x
0x880002d7e060: 0x  0x
0x880002d7e070: 0x  0x

> In any case, Denys is right -- my theory doesn't really hold water on
> non-SMAP systems.

Mine is definitely not new enough for this feature :)

Maybe it would be more helpful if Takashi who is able to reproduce this
more reliably than me would do a crash dump, preferably with a lower
dumplevel, to investigate on.
I have seen the bug two or three times in a week or two, which makes
waiting for it to happen a boring experience.

Best regards,

Stefan

-- 
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Linus Torvalds
On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski  wrote:
>>
>> crash> disassemble page_fault
>> Dump of assembler code for function page_fault:
>>0x816834a0 <+0>: data32 xchg %ax,%ax
>>0x816834a3 <+3>: data32 xchg %ax,%ax
>>0x816834a6 <+6>: data32 xchg %ax,%ax
>>0x816834a9 <+9>: sub$0x78,%rsp
>>0x816834ad <+13>:callq  0x81683620 
>
> The callq was the double-faulting instruction, and it is indeed the
> first function in here that would have accessed the stack.  (The sub
> *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
> page fault, and the page fault is promoted to a double fault.  The
> surprising thing is that the page fault itself seems to have been
> delivered okay, and RSP wasn't on a page boundary.

Not at all surprising, and sure it was on a page boundry..

Look closer.

%rsp is 7fffa55eafb8.

But that's *after* page_fault has done that

sub$0x78,%rsp

so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
different page.

And that page happened to be mapped.

So what happened is:

 - we somehow entered kernel mode without switching stacks

   (ie presumably syscall)

 - the user stack was still fine

 - we took a page fault, which once again didn't switch stacks,
because we were already in kernel mode. And this page fault worked,
because it just pushed the error code onto the user stack which was
mapped.

 - we now took a second page fault within the page fault handler,
because now the stack pointer has been decremented and points one user
page down that is *not* mapped, so now that page fault cannot push the
error code and return information.

Now, how we took that original page fault is sadly not very clear at
all.  I agree that it's something about system-call (how could we not
change stacks otherwise), but why it should have started now, I don't
know. I don't think "system_call" has changed at all.

Maybe there is something wrong with the new "ret_from_sys_call" logic,
and that "use sysret to return to user mode" thing. Because this code
sequence:

+   movq (RSP-RIP)(%rsp),%rsp
+   USERGS_SYSRET64

in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
kernel with a user stack poiinter, maybe we're *exiting* the kernel,
and have just reloaded the user stack pointer when "USERGS_SYSRET64"
takes some fault.

Is PARAVIRT enabled? The three nop's at the beginning of 'page_fault'
makes me suspect it is,  and that that is some paravirt rewriting
area. What does paravirt go for that USERGS_SYSRET64 (or for
SWAPGS_UNSAFE_STACK, for that matter).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 2:12 PM, Stefan Seyfried
 wrote:
> Am 18.03.2015 um 21:51 schrieb Andy Lutomirski:
>> On Wed, Mar 18, 2015 at 1:05 PM, Stefan Seyfried
>>  wrote:
>
 The relevant thread's stack is here (see ti in the trace):

 8801013d4000

 It could be interesting to see what's there.

 I don't suppose you want to try to walk the paging structures to see
 if 88023bc8 (i.e. gsbase) and, more specifically,
 88023bc8 + old_rsp and 88023bc8 + kernel_stack are
 present?  You'd only have to walk one level -- presumably, if the PGD
 entry is there, the rest of the entries are okay, too.
>>>
>>> That's all greek to me :-)
>>>
>>> I see that there is something at 88023bc8:
>>>
>>> crash> x /64xg 0x88023bc8
>>> 0x88023bc8: 0x  0x
>>> 0x88023bc80010: 0x  0x
>>> 0x88023bc80020: 0x  0x6686ada9
>>> 0x88023bc80030: 0x  0x
>>> 0x88023bc80040: 0x  0x
>>> [all zeroes]
>>> 0x88023bc801f0: 0x  0x
>>>
>>> old_rsp and kernel_stack seem bogus:
>>> crash> print old_rsp
>>> Cannot access memory at address 0xa200
>>> gdb: gdb request failed: print old_rsp
>>> crash> print kernel_stack
>>> Cannot access memory at address 0xaa48
>>> gdb: gdb request failed: print kernel_stack
>>>
>>> kernel_stack is not a pointer? So 0x88023bc8 + 0xaa48 it is:
>>
>> Yup.  old_rsp and kernel_stack are offsets relative to gsbase.
>>
>>>
>>> crash> x /64xg 0x88023bc8aa00
>>> 0x88023bc8aa00: 0x  0x
>>
>> [...]
>>
>> I don't know enough about crashkernel to know whether the fact that
>> this worked means anything.
>
> AFAIK this just means that the memory at this location is included in
> the dump :-)
>
>> Can you dump the page of physical memory at 0x4779a067?  That's the PGD.
>
> Unfortunately not, this is a partial dump (I think the default config in
> openSUSE, but I might have changed it some time ago) and the dump_level
> is 31 which means that the following are excluded:
>
>  |  |cache  |cache  |  |
> dump | zero |without|with   | user | free
>level | page |private|private| data | page
>   ---+--+---+---+--+--
>   31 |  X   |   X   |   X   |  X   |  X
>
> so this:
> crash> x /64xg 0x4779a067
> 0x4779a067: Cannot access memory at address 0x4779a067
> gdb: gdb request failed: x /64xg
>
> probably just means, that the PGD falls in one of the above excluded
> categories.

I suspect that it actually means that gdb sees virtual addresses, not
physical addresses.  But I screwed up completely -- "PGD" in the dump
is the PGD *entry*, not the PGD pointer.

We could plausibly fish it out from current->mm, but that's a mess.  I
don't suppose that "info registers" or "p/x $cr3" will show the cr3
value?

In any case, Denys is right -- my theory doesn't really hold water on
non-SMAP systems.

--Andy

>
> Best regards,
>
> Stefan
> --
> Stefan Seyfried
> Linux Consultant & Developer -- GPG Key: 0x731B665B
>
> B1 Systems GmbH
> Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
> GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 2:06 PM, Denys Vlasenko  wrote:
> On 03/18/2015 09:49 PM, Andy Lutomirski wrote:
>> On Wed, Mar 18, 2015 at 1:06 PM, Denys Vlasenko  wrote:
>>> On 03/18/2015 08:26 PM, Andy Lutomirski wrote:
 Hi Linus-

 You seem to enjoy debugging these things.  Want to give this a shot?
 My guess is a vmalloc fault accessing either old_rsp or kernel_stack
 right after swapgs in syscall entry.
>>>
>>> The code is:
>>>
>>> ENTRY(system_call)
>>> SWAPGS_UNSAFE_STACK
>>> GLOBAL(system_call_after_swapgs)
>>> movq%rsp,PER_CPU_VAR(rsp_scratch)
>>> movqPER_CPU_VAR(kernel_stack),%rsp
>>>
>>> If PER_CPU_VAR(var) memory access can page fault
>>> (I was thinking this is ensured to never fault),
>>> then on these two instructions such page fault
>>> will be fatal: we will still have userspace %rsp.
>>>
>>> I thought we can only get a NMI or debug interrupt here,
>>> and they are both set up to use IST stacks
>>> to prevent this scenario (among other reasons).
>>
>> I don't think that #DB is possible -- we should never have a
>> watchpoint on percpu memory like that (unless we're using kgdb, in
>> which case I think that kgdb should be fixed).
>
> And #DB shouldn't cause a problem even if it happens (it's on
> an IST stack).
>
> I was thinking about it more and the thing is, CPU did manage
> to enter page fault handler.
>
> It means that it managed to store iret frame.
>
> This means that stores to (%rsp) worked, whatever %rsp is
> (even if it points to user's page).
>
> The double fault happened only when CALL insn inside the handler
> attempted to push yet another word. _This_ is what did not work.
>
> Why?
>
> I almost ready to declare that it's SMAP triggering:
> that attempts to access (write to) userspace were caught.
> However, disassembly shows
>
> crash> disassemble page_fault
> Dump of assembler code for function page_fault:
>0x816834a0 <+0>: data32 xchg %ax,%ax
>0x816834a3 <+3>: data32 xchg %ax,%ax
>0x816834a6 <+6>: data32 xchg %ax,%ax
>0x816834a9 <+9>: sub$0x78,%rsp
>0x816834ad <+13>:callq  0x81683620 
> KABOOM HERE^^^
>0x816834b2 <+18>:mov%rsp,%rdi
>0x816834b5 <+21>:mov0x78(%rsp),%rsi
>0x816834ba <+26>:movq   $0x,0x78(%rsp)
>0x816834c3 <+35>:callq  0x810504e0 
>0x816834c8 <+40>:jmpq   0x816836d0 
> End of assembler dump.
>
> Those NOPs at the beginning are ASM_CLAC and PARAVIRT_ADJUST_EXCEPTION_FRAME
> from this source:
>
>
> .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
> ENTRY(\sym)
> /* Sanity check */
> .if \shift_ist != -1 && \paranoid == 0
> .error "using shift_ist requires paranoid=1"
> .endif
>
> .if \has_error_code
> XCPT_FRAME
> .else
> INTR_FRAME
> .endif
>
> ASM_CLAC
> PARAVIRT_ADJUST_EXCEPTION_FRAME
>
> subq $ORIG_RAX-R15, %rsp
> call error_entry
> ...
>
> If ASM_CLAC is replaced by NOPs, this CPU must be not SMAP capable.
> If so, then another store to (%rsp) should have worked too...
>
>
> Stefan, Takashi - are you seeing this on SMAP-capable CPUs?

That's why I asked if this was Broadwell.  It's not :(

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Stefan Seyfried
Am 18.03.2015 um 21:51 schrieb Andy Lutomirski:
> On Wed, Mar 18, 2015 at 1:05 PM, Stefan Seyfried
>  wrote:

>>> The relevant thread's stack is here (see ti in the trace):
>>>
>>> 8801013d4000
>>>
>>> It could be interesting to see what's there.
>>>
>>> I don't suppose you want to try to walk the paging structures to see
>>> if 88023bc8 (i.e. gsbase) and, more specifically,
>>> 88023bc8 + old_rsp and 88023bc8 + kernel_stack are
>>> present?  You'd only have to walk one level -- presumably, if the PGD
>>> entry is there, the rest of the entries are okay, too.
>>
>> That's all greek to me :-)
>>
>> I see that there is something at 88023bc8:
>>
>> crash> x /64xg 0x88023bc8
>> 0x88023bc8: 0x  0x
>> 0x88023bc80010: 0x  0x
>> 0x88023bc80020: 0x  0x6686ada9
>> 0x88023bc80030: 0x  0x
>> 0x88023bc80040: 0x  0x
>> [all zeroes]
>> 0x88023bc801f0: 0x  0x
>>
>> old_rsp and kernel_stack seem bogus:
>> crash> print old_rsp
>> Cannot access memory at address 0xa200
>> gdb: gdb request failed: print old_rsp
>> crash> print kernel_stack
>> Cannot access memory at address 0xaa48
>> gdb: gdb request failed: print kernel_stack
>>
>> kernel_stack is not a pointer? So 0x88023bc8 + 0xaa48 it is:
> 
> Yup.  old_rsp and kernel_stack are offsets relative to gsbase.
> 
>>
>> crash> x /64xg 0x88023bc8aa00
>> 0x88023bc8aa00: 0x  0x
> 
> [...]
> 
> I don't know enough about crashkernel to know whether the fact that
> this worked means anything.

AFAIK this just means that the memory at this location is included in
the dump :-)

> Can you dump the page of physical memory at 0x4779a067?  That's the PGD.

Unfortunately not, this is a partial dump (I think the default config in
openSUSE, but I might have changed it some time ago) and the dump_level
is 31 which means that the following are excluded:

 |  |cache  |cache  |  |
dump | zero |without|with   | user | free
   level | page |private|private| data | page
  ---+--+---+---+--+--
  31 |  X   |   X   |   X   |  X   |  X

so this:
crash> x /64xg 0x4779a067
0x4779a067: Cannot access memory at address 0x4779a067
gdb: gdb request failed: x /64xg

probably just means, that the PGD falls in one of the above excluded
categories.

Best regards,

Stefan
-- 
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Denys Vlasenko
On 03/18/2015 09:49 PM, Andy Lutomirski wrote:
> On Wed, Mar 18, 2015 at 1:06 PM, Denys Vlasenko  wrote:
>> On 03/18/2015 08:26 PM, Andy Lutomirski wrote:
>>> Hi Linus-
>>>
>>> You seem to enjoy debugging these things.  Want to give this a shot?
>>> My guess is a vmalloc fault accessing either old_rsp or kernel_stack
>>> right after swapgs in syscall entry.
>>
>> The code is:
>>
>> ENTRY(system_call)
>> SWAPGS_UNSAFE_STACK
>> GLOBAL(system_call_after_swapgs)
>> movq%rsp,PER_CPU_VAR(rsp_scratch)
>> movqPER_CPU_VAR(kernel_stack),%rsp
>>
>> If PER_CPU_VAR(var) memory access can page fault
>> (I was thinking this is ensured to never fault),
>> then on these two instructions such page fault
>> will be fatal: we will still have userspace %rsp.
>>
>> I thought we can only get a NMI or debug interrupt here,
>> and they are both set up to use IST stacks
>> to prevent this scenario (among other reasons).
> 
> I don't think that #DB is possible -- we should never have a
> watchpoint on percpu memory like that (unless we're using kgdb, in
> which case I think that kgdb should be fixed).

And #DB shouldn't cause a problem even if it happens (it's on
an IST stack).

I was thinking about it more and the thing is, CPU did manage
to enter page fault handler.

It means that it managed to store iret frame.

This means that stores to (%rsp) worked, whatever %rsp is
(even if it points to user's page).

The double fault happened only when CALL insn inside the handler
attempted to push yet another word. _This_ is what did not work.

Why?

I almost ready to declare that it's SMAP triggering:
that attempts to access (write to) userspace were caught.
However, disassembly shows

crash> disassemble page_fault
Dump of assembler code for function page_fault:
   0x816834a0 <+0>: data32 xchg %ax,%ax
   0x816834a3 <+3>: data32 xchg %ax,%ax
   0x816834a6 <+6>: data32 xchg %ax,%ax
   0x816834a9 <+9>: sub$0x78,%rsp
   0x816834ad <+13>:callq  0x81683620 
KABOOM HERE^^^
   0x816834b2 <+18>:mov%rsp,%rdi
   0x816834b5 <+21>:mov0x78(%rsp),%rsi
   0x816834ba <+26>:movq   $0x,0x78(%rsp)
   0x816834c3 <+35>:callq  0x810504e0 
   0x816834c8 <+40>:jmpq   0x816836d0 
End of assembler dump.

Those NOPs at the beginning are ASM_CLAC and PARAVIRT_ADJUST_EXCEPTION_FRAME
from this source:


.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
/* Sanity check */
.if \shift_ist != -1 && \paranoid == 0
.error "using shift_ist requires paranoid=1"
.endif

.if \has_error_code
XCPT_FRAME
.else
INTR_FRAME
.endif

ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME

subq $ORIG_RAX-R15, %rsp
call error_entry
...

If ASM_CLAC is replaced by NOPs, this CPU must be not SMAP capable.
If so, then another store to (%rsp) should have worked too...


Stefan, Takashi - are you seeing this on SMAP-capable CPUs?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 1:05 PM, Stefan Seyfried
 wrote:
> Hi Andy,
>
> Am 18.03.2015 um 20:26 schrieb Andy Lutomirski:
>> Hi Linus-
>>
>> You seem to enjoy debugging these things.  Want to give this a shot?
>> My guess is a vmalloc fault accessing either old_rsp or kernel_stack
>> right after swapgs in syscall entry.
>>
>> On Wed, Mar 18, 2015 at 12:03 PM, Stefan Seyfried
>>  wrote:
>>> Hi all,
>>>
>>> first, I'm kind of happy that I'm not the only one seeing this, and
>>> thus my beloved Thinkpad can stay for a bit longer... :-)
>>>
>>> Then, I'm mostly an amateur when it comes to kernel debugging, so bear
>>> with me when I'm stumbling through the code...
>>>
>>> Am 18.03.2015 um 19:03 schrieb Andy Lutomirski:
 On Wed, Mar 18, 2015 at 10:46 AM, Takashi Iwai  wrote:
> At Wed, 18 Mar 2015 18:43:52 +0100,
> Takashi Iwai wrote:
>>
>> At Wed, 18 Mar 2015 15:16:42 +0100,
>> Takashi Iwai wrote:
>>>
>>> At Sun, 15 Mar 2015 09:17:15 +0100,
>>> Stefan Seyfried wrote:

 Hi all,

 in 4.0-rc I have recently seen a few crashes, always when running
 KVM guests (IIRC). Today I was able to capture a crash dump, this
 is the backtrace from dmesg.txt:

 [242060.604870] PANIC: double fault, error_code: 0x0

 OK, we double faulted.  Too bad that x86 CPUs don't tell us why.

 [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G  
   W   4.0.0-rc3-2.gd5c547f-desktop #1
 [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
 (3.21 ) 12/13/2011
 [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
 8801013d4000
 [242060.604885] RIP: 0010:[]  [] 
 page_fault+0xd/0x30

 The double fault happened during page fault processing.  Could you
 disassemble your page_fault function to find the offending
 instruction?
>>>
>>> This one is easy:
>>>
>>> crash> disassemble page_fault
>>> Dump of assembler code for function page_fault:
>>>0x816834a0 <+0>: data32 xchg %ax,%ax
>>>0x816834a3 <+3>: data32 xchg %ax,%ax
>>>0x816834a6 <+6>: data32 xchg %ax,%ax
>>>0x816834a9 <+9>: sub$0x78,%rsp
>>>0x816834ad <+13>:callq  0x81683620 
>>
>> The callq was the double-faulting instruction, and it is indeed the
>> first function in here that would have accessed the stack.  (The sub
>> *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
>> page fault, and the page fault is promoted to a double fault.  The
>> surprising thing is that the page fault itself seems to have been
>> delivered okay, and RSP wasn't on a page boundary.
>>
>> You wouldn't happen to be using a Broadwell machine?
>
> No, this is a quite old Thinkpad X200s, Core2duo
> processor   : 1
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 23
> model name  : Intel(R) Core(TM)2 Duo CPU L9400  @ 1.86GHz
> stepping: 10
> microcode   : 0xa0c
>
>> The only way to get here with bogus RSP is if we interrupted something
>> that was previously running at CPL0 with similarly bogus RSP.
>>
>> I don't know if I trust CR2.  It's 16 bytes lower than I'd expect.
>>
>>>0x816834b2 <+18>:mov%rsp,%rdi
>>>0x816834b5 <+21>:mov0x78(%rsp),%rsi
>>>0x816834ba <+26>:movq   $0x,0x78(%rsp)
>>>0x816834c3 <+35>:callq  0x810504e0 
>>>0x816834c8 <+40>:jmpq   0x816836d0 
>>> End of assembler dump.
>>>
>>>
 [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016

 Uh, what?  That RSP is a user address.

 [242060.604895] RAX: aa40 RBX: 0001 RCX: 
 81682237
 [242060.604896] RDX: aa40 RSI:  RDI: 
 7fffa55eb078
 [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
 
 [242060.604900] R10:  R11: 0293 R12: 
 004a
 [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
 7ffa3556cf20
 [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
 knlGS:
 [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
 [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
 000427e0
 [242060.604909] Stack:
 [242060.604942] BUG: unable to handle kernel paging request at 
 7fffa55eafb8
 [242060.604995] IP: [] show_stack_log_lvl+0x124/0x190

 This is suspicious.  We need to have died, again, of a fatal page
 fault while dumping the stack.
>>>
>>> I posted the same problem to the opensuse kernel list shortly before turning
>>> to LKML. There, 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 1:06 PM, Denys Vlasenko  wrote:
> On 03/18/2015 08:26 PM, Andy Lutomirski wrote:
>> Hi Linus-
>>
>> You seem to enjoy debugging these things.  Want to give this a shot?
>> My guess is a vmalloc fault accessing either old_rsp or kernel_stack
>> right after swapgs in syscall entry.
>
> The code is:
>
> ENTRY(system_call)
> SWAPGS_UNSAFE_STACK
> GLOBAL(system_call_after_swapgs)
> movq%rsp,PER_CPU_VAR(rsp_scratch)
> movqPER_CPU_VAR(kernel_stack),%rsp
>
> If PER_CPU_VAR(var) memory access can page fault
> (I was thinking this is ensured to never fault),
> then on these two instructions such page fault
> will be fatal: we will still have userspace %rsp.
>
> I thought we can only get a NMI or debug interrupt here,
> and they are both set up to use IST stacks
> to prevent this scenario (among other reasons).

I don't think that #DB is possible -- we should never have a
watchpoint on percpu memory like that (unless we're using kgdb, in
which case I think that kgdb should be fixed).

On the other hand, we can and do take page faults on percpu memory,
because percpu lives in vmap space and we lazily populate PGD entries
in per-mm PGDs.  (That is, when we allocate a kernel PGD entry, we
populate it in init_mm's pgd, but we don't proactively copy it during
context switches.)

But the affected system is a laptop, so there shouldn't be CPU hotplug
or enough memory for this to happen.  Confused.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Denys Vlasenko
On 03/18/2015 08:26 PM, Andy Lutomirski wrote:
> Hi Linus-
> 
> You seem to enjoy debugging these things.  Want to give this a shot?
> My guess is a vmalloc fault accessing either old_rsp or kernel_stack
> right after swapgs in syscall entry.

The code is:

ENTRY(system_call)
SWAPGS_UNSAFE_STACK
GLOBAL(system_call_after_swapgs)
movq%rsp,PER_CPU_VAR(rsp_scratch)
movqPER_CPU_VAR(kernel_stack),%rsp

If PER_CPU_VAR(var) memory access can page fault
(I was thinking this is ensured to never fault),
then on these two instructions such page fault
will be fatal: we will still have userspace %rsp.

I thought we can only get a NMI or debug interrupt here,
and they are both set up to use IST stacks
to prevent this scenario (among other reasons).

-- 
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Stefan Seyfried
Hi Andy,

Am 18.03.2015 um 20:26 schrieb Andy Lutomirski:
> Hi Linus-
> 
> You seem to enjoy debugging these things.  Want to give this a shot?
> My guess is a vmalloc fault accessing either old_rsp or kernel_stack
> right after swapgs in syscall entry.
> 
> On Wed, Mar 18, 2015 at 12:03 PM, Stefan Seyfried
>  wrote:
>> Hi all,
>>
>> first, I'm kind of happy that I'm not the only one seeing this, and
>> thus my beloved Thinkpad can stay for a bit longer... :-)
>>
>> Then, I'm mostly an amateur when it comes to kernel debugging, so bear
>> with me when I'm stumbling through the code...
>>
>> Am 18.03.2015 um 19:03 schrieb Andy Lutomirski:
>>> On Wed, Mar 18, 2015 at 10:46 AM, Takashi Iwai  wrote:
 At Wed, 18 Mar 2015 18:43:52 +0100,
 Takashi Iwai wrote:
>
> At Wed, 18 Mar 2015 15:16:42 +0100,
> Takashi Iwai wrote:
>>
>> At Sun, 15 Mar 2015 09:17:15 +0100,
>> Stefan Seyfried wrote:
>>>
>>> Hi all,
>>>
>>> in 4.0-rc I have recently seen a few crashes, always when running
>>> KVM guests (IIRC). Today I was able to capture a crash dump, this
>>> is the backtrace from dmesg.txt:
>>>
>>> [242060.604870] PANIC: double fault, error_code: 0x0
>>>
>>> OK, we double faulted.  Too bad that x86 CPUs don't tell us why.
>>>
>>> [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G   
>>>  W   4.0.0-rc3-2.gd5c547f-desktop #1
>>> [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
>>> (3.21 ) 12/13/2011
>>> [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
>>> 8801013d4000
>>> [242060.604885] RIP: 0010:[]  [] 
>>> page_fault+0xd/0x30
>>>
>>> The double fault happened during page fault processing.  Could you
>>> disassemble your page_fault function to find the offending
>>> instruction?
>>
>> This one is easy:
>>
>> crash> disassemble page_fault
>> Dump of assembler code for function page_fault:
>>0x816834a0 <+0>: data32 xchg %ax,%ax
>>0x816834a3 <+3>: data32 xchg %ax,%ax
>>0x816834a6 <+6>: data32 xchg %ax,%ax
>>0x816834a9 <+9>: sub$0x78,%rsp
>>0x816834ad <+13>:callq  0x81683620 
> 
> The callq was the double-faulting instruction, and it is indeed the
> first function in here that would have accessed the stack.  (The sub
> *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
> page fault, and the page fault is promoted to a double fault.  The
> surprising thing is that the page fault itself seems to have been
> delivered okay, and RSP wasn't on a page boundary.
> 
> You wouldn't happen to be using a Broadwell machine?

No, this is a quite old Thinkpad X200s, Core2duo
processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Duo CPU L9400  @ 1.86GHz
stepping: 10
microcode   : 0xa0c

> The only way to get here with bogus RSP is if we interrupted something
> that was previously running at CPL0 with similarly bogus RSP.
> 
> I don't know if I trust CR2.  It's 16 bytes lower than I'd expect.
> 
>>0x816834b2 <+18>:mov%rsp,%rdi
>>0x816834b5 <+21>:mov0x78(%rsp),%rsi
>>0x816834ba <+26>:movq   $0x,0x78(%rsp)
>>0x816834c3 <+35>:callq  0x810504e0 
>>0x816834c8 <+40>:jmpq   0x816836d0 
>> End of assembler dump.
>>
>>
>>> [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
>>>
>>> Uh, what?  That RSP is a user address.
>>>
>>> [242060.604895] RAX: aa40 RBX: 0001 RCX: 
>>> 81682237
>>> [242060.604896] RDX: aa40 RSI:  RDI: 
>>> 7fffa55eb078
>>> [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
>>> 
>>> [242060.604900] R10:  R11: 0293 R12: 
>>> 004a
>>> [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
>>> 7ffa3556cf20
>>> [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
>>> knlGS:
>>> [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
>>> [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
>>> 000427e0
>>> [242060.604909] Stack:
>>> [242060.604942] BUG: unable to handle kernel paging request at 
>>> 7fffa55eafb8
>>> [242060.604995] IP: [] show_stack_log_lvl+0x124/0x190
>>>
>>> This is suspicious.  We need to have died, again, of a fatal page
>>> fault while dumping the stack.
>>
>> I posted the same problem to the opensuse kernel list shortly before turning
>> to LKML. There, Michal Kubecek noted:
>>
>> "I encountered a similar problem recently. The thing is, x86
>> specification says that on a double fault, RIP and RSP registers are
>> undefined, i.e. you 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
Hi Linus-

You seem to enjoy debugging these things.  Want to give this a shot?
My guess is a vmalloc fault accessing either old_rsp or kernel_stack
right after swapgs in syscall entry.

On Wed, Mar 18, 2015 at 12:03 PM, Stefan Seyfried
 wrote:
> Hi all,
>
> first, I'm kind of happy that I'm not the only one seeing this, and
> thus my beloved Thinkpad can stay for a bit longer... :-)
>
> Then, I'm mostly an amateur when it comes to kernel debugging, so bear
> with me when I'm stumbling through the code...
>
> Am 18.03.2015 um 19:03 schrieb Andy Lutomirski:
>> On Wed, Mar 18, 2015 at 10:46 AM, Takashi Iwai  wrote:
>>> At Wed, 18 Mar 2015 18:43:52 +0100,
>>> Takashi Iwai wrote:

 At Wed, 18 Mar 2015 15:16:42 +0100,
 Takashi Iwai wrote:
>
> At Sun, 15 Mar 2015 09:17:15 +0100,
> Stefan Seyfried wrote:
>>
>> Hi all,
>>
>> in 4.0-rc I have recently seen a few crashes, always when running
>> KVM guests (IIRC). Today I was able to capture a crash dump, this
>> is the backtrace from dmesg.txt:
>>
>> [242060.604870] PANIC: double fault, error_code: 0x0
>>
>> OK, we double faulted.  Too bad that x86 CPUs don't tell us why.
>>
>> [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
>> W   4.0.0-rc3-2.gd5c547f-desktop #1
>> [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
>> (3.21 ) 12/13/2011
>> [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
>> 8801013d4000
>> [242060.604885] RIP: 0010:[]  [] 
>> page_fault+0xd/0x30
>>
>> The double fault happened during page fault processing.  Could you
>> disassemble your page_fault function to find the offending
>> instruction?
>
> This one is easy:
>
> crash> disassemble page_fault
> Dump of assembler code for function page_fault:
>0x816834a0 <+0>: data32 xchg %ax,%ax
>0x816834a3 <+3>: data32 xchg %ax,%ax
>0x816834a6 <+6>: data32 xchg %ax,%ax
>0x816834a9 <+9>: sub$0x78,%rsp
>0x816834ad <+13>:callq  0x81683620 

The callq was the double-faulting instruction, and it is indeed the
first function in here that would have accessed the stack.  (The sub
*changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
page fault, and the page fault is promoted to a double fault.  The
surprising thing is that the page fault itself seems to have been
delivered okay, and RSP wasn't on a page boundary.

You wouldn't happen to be using a Broadwell machine?

The only way to get here with bogus RSP is if we interrupted something
that was previously running at CPL0 with similarly bogus RSP.

I don't know if I trust CR2.  It's 16 bytes lower than I'd expect.

>0x816834b2 <+18>:mov%rsp,%rdi
>0x816834b5 <+21>:mov0x78(%rsp),%rsi
>0x816834ba <+26>:movq   $0x,0x78(%rsp)
>0x816834c3 <+35>:callq  0x810504e0 
>0x816834c8 <+40>:jmpq   0x816836d0 
> End of assembler dump.
>
>
>> [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
>>
>> Uh, what?  That RSP is a user address.
>>
>> [242060.604895] RAX: aa40 RBX: 0001 RCX: 
>> 81682237
>> [242060.604896] RDX: aa40 RSI:  RDI: 
>> 7fffa55eb078
>> [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
>> 
>> [242060.604900] R10:  R11: 0293 R12: 
>> 004a
>> [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
>> 7ffa3556cf20
>> [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
>> knlGS:
>> [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
>> [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
>> 000427e0
>> [242060.604909] Stack:
>> [242060.604942] BUG: unable to handle kernel paging request at 
>> 7fffa55eafb8
>> [242060.604995] IP: [] show_stack_log_lvl+0x124/0x190
>>
>> This is suspicious.  We need to have died, again, of a fatal page
>> fault while dumping the stack.
>
> I posted the same problem to the opensuse kernel list shortly before turning
> to LKML. There, Michal Kubecek noted:
>
> "I encountered a similar problem recently. The thing is, x86
> specification says that on a double fault, RIP and RSP registers are
> undefined, i.e. you not only can't expect them to contain values
> corresponding to the first or second fault but you can't even expect
> them to have any usable values at all. Unfortunately the kernel double
> fault handler doesn't take this into account and does try to display
> usual crash related information so that it itself does usually crash
> when trying to show stack content (that's the show_stack_log_lvl()
> crash).

I think that's not entirely true.  

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Stefan Seyfried
Hi all,

first, I'm kind of happy that I'm not the only one seeing this, and
thus my beloved Thinkpad can stay for a bit longer... :-)

Then, I'm mostly an amateur when it comes to kernel debugging, so bear
with me when I'm stumbling through the code...

Am 18.03.2015 um 19:03 schrieb Andy Lutomirski:
> On Wed, Mar 18, 2015 at 10:46 AM, Takashi Iwai  wrote:
>> At Wed, 18 Mar 2015 18:43:52 +0100,
>> Takashi Iwai wrote:
>>>
>>> At Wed, 18 Mar 2015 15:16:42 +0100,
>>> Takashi Iwai wrote:

 At Sun, 15 Mar 2015 09:17:15 +0100,
 Stefan Seyfried wrote:
>
> Hi all,
>
> in 4.0-rc I have recently seen a few crashes, always when running
> KVM guests (IIRC). Today I was able to capture a crash dump, this
> is the backtrace from dmesg.txt:
>
> [242060.604870] PANIC: double fault, error_code: 0x0
> 
> OK, we double faulted.  Too bad that x86 CPUs don't tell us why.
> 
> [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
> W   4.0.0-rc3-2.gd5c547f-desktop #1
> [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
> (3.21 ) 12/13/2011
> [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
> 8801013d4000
> [242060.604885] RIP: 0010:[]  [] 
> page_fault+0xd/0x30
> 
> The double fault happened during page fault processing.  Could you
> disassemble your page_fault function to find the offending
> instruction?

This one is easy:

crash> disassemble page_fault
Dump of assembler code for function page_fault:
   0x816834a0 <+0>: data32 xchg %ax,%ax
   0x816834a3 <+3>: data32 xchg %ax,%ax
   0x816834a6 <+6>: data32 xchg %ax,%ax
   0x816834a9 <+9>: sub$0x78,%rsp
   0x816834ad <+13>:callq  0x81683620 
   0x816834b2 <+18>:mov%rsp,%rdi
   0x816834b5 <+21>:mov0x78(%rsp),%rsi
   0x816834ba <+26>:movq   $0x,0x78(%rsp)
   0x816834c3 <+35>:callq  0x810504e0 
   0x816834c8 <+40>:jmpq   0x816836d0 
End of assembler dump.


> [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
> 
> Uh, what?  That RSP is a user address.
> 
> [242060.604895] RAX: aa40 RBX: 0001 RCX: 
> 81682237
> [242060.604896] RDX: aa40 RSI:  RDI: 
> 7fffa55eb078
> [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
> 
> [242060.604900] R10:  R11: 0293 R12: 
> 004a
> [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
> 7ffa3556cf20
> [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
> knlGS:
> [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
> [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
> 000427e0
> [242060.604909] Stack:
> [242060.604942] BUG: unable to handle kernel paging request at 
> 7fffa55eafb8
> [242060.604995] IP: [] show_stack_log_lvl+0x124/0x190
> 
> This is suspicious.  We need to have died, again, of a fatal page
> fault while dumping the stack.

I posted the same problem to the opensuse kernel list shortly before turning
to LKML. There, Michal Kubecek noted:

"I encountered a similar problem recently. The thing is, x86
specification says that on a double fault, RIP and RSP registers are
undefined, i.e. you not only can't expect them to contain values
corresponding to the first or second fault but you can't even expect
them to have any usable values at all. Unfortunately the kernel double
fault handler doesn't take this into account and does try to display
usual crash related information so that it itself does usually crash
when trying to show stack content (that's the show_stack_log_lvl()
crash).

The result is a double fault (which itself would be very hard to debug)
followed by a crash in its handler so that analysing the outcome is
extremely difficult."

I cannot judge if this is true, but it sounded related to solving the
problem to me.

> [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
> [242060.605078] Oops:  [#1] PREEMPT SMP
> [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
> nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
> sunrpc fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp 
> ppp_async crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac 
> algif_hash ctr ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
> nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge 
> stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
> ip_tables x_tables af_packet bnep dm_crypt ecb 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 10:46 AM, Takashi Iwai  wrote:
> At Wed, 18 Mar 2015 18:43:52 +0100,
> Takashi Iwai wrote:
>>
>> At Wed, 18 Mar 2015 15:16:42 +0100,
>> Takashi Iwai wrote:
>> >
>> > At Sun, 15 Mar 2015 09:17:15 +0100,
>> > Stefan Seyfried wrote:
>> > >
>> > > Hi all,
>> > >
>> > > in 4.0-rc I have recently seen a few crashes, always when running
>> > > KVM guests (IIRC). Today I was able to capture a crash dump, this
>> > > is the backtrace from dmesg.txt:
>> > >
>> > > [242060.604870] PANIC: double fault, error_code: 0x0

OK, we double faulted.  Too bad that x86 CPUs don't tell us why.

>> > > [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
>> > > W   4.0.0-rc3-2.gd5c547f-desktop #1
>> > > [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
>> > > (3.21 ) 12/13/2011
>> > > [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
>> > > 8801013d4000
>> > > [242060.604885] RIP: 0010:[]  [] 
>> > > page_fault+0xd/0x30

The double fault happened during page fault processing.  Could you
disassemble your page_fault function to find the offending
instruction?

>> > > [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016

Uh, what?  That RSP is a user address.

>> > > [242060.604895] RAX: aa40 RBX: 0001 RCX: 
>> > > 81682237
>> > > [242060.604896] RDX: aa40 RSI:  RDI: 
>> > > 7fffa55eb078
>> > > [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
>> > > 
>> > > [242060.604900] R10:  R11: 0293 R12: 
>> > > 004a
>> > > [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
>> > > 7ffa3556cf20
>> > > [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
>> > > knlGS:
>> > > [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
>> > > [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
>> > > 000427e0
>> > > [242060.604909] Stack:
>> > > [242060.604942] BUG: unable to handle kernel paging request at 
>> > > 7fffa55eafb8
>> > > [242060.604995] IP: [] show_stack_log_lvl+0x124/0x190

This is suspicious.  We need to have died, again, of a fatal page
fault while dumping the stack.

>> > > [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
>> > > [242060.605078] Oops:  [#1] PREEMPT SMP
>> > > [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
>> > > nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
>> > > sunrpc fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp 
>> > > ppp_async crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac 
>> > > algif_hash ctr ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
>> > > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
>> > > nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge 
>> > > stp llc ebtable_filter ebtables ip6table_filter ip6_tables 
>> > > iptable_filter ip_tables x_tables af_packet bnep dm_crypt ecb cbc 
>> > > algif_skcipher af_alg xfs libcrc32c snd_hda_codec_conexant 
>> > > snd_hda_codec_generic iTCO_wdt iTCO_vendor_support snd_hda_intel 
>> > > snd_hda_controller snd_hda_codec snd_hwdep snd_pcm_oss snd_pcm
>> > > [242060.605396]  dm_mod snd_seq snd_seq_device snd_timer coretemp 
>> > > kvm_intel kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 
>> > > uvcvideo videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core 
>> > > btusb v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 
>> > > pcspkr e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei 
>> > > cfg80211 shpchp wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq 
>> > > i915 xhci_pci xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video 
>> > > button processor sg loop
>> > > [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
>> > > W   4.0.0-rc3-2.gd5c547f-desktop #1
>> > > [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
>> > > (3.21 ) 12/13/2011
>> > > [242060.605396] task: 880103f46150 ti: 8801013d4000 task.ti: 
>> > > 8801013d4000
>> > > [242060.605396] RIP: 0010:[]  [] 
>> > > show_stack_log_lvl+0x124/0x190
>> > > [242060.605396] RSP: 0018:88023bc84e88  EFLAGS: 00010046
>> > > [242060.605396] RAX: 7fffa55eafc0 RBX: 7fffa55eafb8 RCX: 
>> > > 88023bc7ffc0
>> > > [242060.605396] RDX:  RSI: 88023bc84f58 RDI: 
>> > > 
>> > > [242060.605396] RBP: 88023bc83fc0 R08: 81a2fe15 R09: 
>> > > 0020
>> > > [242060.605396] R10: 0afb R11: 88023bc84bee R12: 
>> > > 88023bc84f58
>> > > [242060.605396] R13:  R14: 81a2fe15 R15: 
>> > > 
>> > > [242060.605396] FS:  7ffa33dbfa80() GS:88023bc8() 
>> > > knlGS:
>> > > 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Takashi Iwai
At Wed, 18 Mar 2015 18:43:52 +0100,
Takashi Iwai wrote:
> 
> At Wed, 18 Mar 2015 15:16:42 +0100,
> Takashi Iwai wrote:
> > 
> > At Sun, 15 Mar 2015 09:17:15 +0100,
> > Stefan Seyfried wrote:
> > > 
> > > Hi all,
> > > 
> > > in 4.0-rc I have recently seen a few crashes, always when running
> > > KVM guests (IIRC). Today I was able to capture a crash dump, this
> > > is the backtrace from dmesg.txt:
> > > 
> > > [242060.604870] PANIC: double fault, error_code: 0x0
> > > [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
> > > W   4.0.0-rc3-2.gd5c547f-desktop #1
> > > [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
> > > (3.21 ) 12/13/2011
> > > [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
> > > 8801013d4000
> > > [242060.604885] RIP: 0010:[]  [] 
> > > page_fault+0xd/0x30
> > > [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
> > > [242060.604895] RAX: aa40 RBX: 0001 RCX: 
> > > 81682237
> > > [242060.604896] RDX: aa40 RSI:  RDI: 
> > > 7fffa55eb078
> > > [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
> > > 
> > > [242060.604900] R10:  R11: 0293 R12: 
> > > 004a
> > > [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
> > > 7ffa3556cf20
> > > [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
> > > knlGS:
> > > [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
> > > [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
> > > 000427e0
> > > [242060.604909] Stack:
> > > [242060.604942] BUG: unable to handle kernel paging request at 
> > > 7fffa55eafb8
> > > [242060.604995] IP: [] show_stack_log_lvl+0x124/0x190
> > > [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
> > > [242060.605078] Oops:  [#1] PREEMPT SMP 
> > > [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
> > > nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
> > > sunrpc fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp 
> > > ppp_async crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac 
> > > algif_hash ctr ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
> > > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
> > > nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge 
> > > stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
> > > ip_tables x_tables af_packet bnep dm_crypt ecb cbc algif_skcipher af_alg 
> > > xfs libcrc32c snd_hda_codec_conexant snd_hda_codec_generic iTCO_wdt 
> > > iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec 
> > > snd_hwdep snd_pcm_oss snd_pcm
> > > [242060.605396]  dm_mod snd_seq snd_seq_device snd_timer coretemp 
> > > kvm_intel kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 
> > > uvcvideo videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core 
> > > btusb v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 
> > > pcspkr e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei 
> > > cfg80211 shpchp wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq 
> > > i915 xhci_pci xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video 
> > > button processor sg loop
> > > [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
> > > W   4.0.0-rc3-2.gd5c547f-desktop #1
> > > [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
> > > (3.21 ) 12/13/2011
> > > [242060.605396] task: 880103f46150 ti: 8801013d4000 task.ti: 
> > > 8801013d4000
> > > [242060.605396] RIP: 0010:[]  [] 
> > > show_stack_log_lvl+0x124/0x190
> > > [242060.605396] RSP: 0018:88023bc84e88  EFLAGS: 00010046
> > > [242060.605396] RAX: 7fffa55eafc0 RBX: 7fffa55eafb8 RCX: 
> > > 88023bc7ffc0
> > > [242060.605396] RDX:  RSI: 88023bc84f58 RDI: 
> > > 
> > > [242060.605396] RBP: 88023bc83fc0 R08: 81a2fe15 R09: 
> > > 0020
> > > [242060.605396] R10: 0afb R11: 88023bc84bee R12: 
> > > 88023bc84f58
> > > [242060.605396] R13:  R14: 81a2fe15 R15: 
> > > 
> > > [242060.605396] FS:  7ffa33dbfa80() GS:88023bc8() 
> > > knlGS:
> > > [242060.605396] CS:  0010 DS:  ES:  CR0: 80050033
> > > [242060.605396] CR2: 7fffa55eafb8 CR3: 02d7e000 CR4: 
> > > 000427e0
> > > [242060.605396] Stack:
> > > [242060.605396]  02d7e000 0008 88023bc84ee8 
> > > 7fffa55eafb8
> > > [242060.605396]   88023bc84f58 7fffa55eafb8 
> > > 0040
> > > [242060.605396]  7ffa356b5d60 000f 7ffa3556cf20 
> > > 81005c36
> > > 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Takashi Iwai
At Wed, 18 Mar 2015 15:16:42 +0100,
Takashi Iwai wrote:
> 
> At Sun, 15 Mar 2015 09:17:15 +0100,
> Stefan Seyfried wrote:
> > 
> > Hi all,
> > 
> > in 4.0-rc I have recently seen a few crashes, always when running
> > KVM guests (IIRC). Today I was able to capture a crash dump, this
> > is the backtrace from dmesg.txt:
> > 
> > [242060.604870] PANIC: double fault, error_code: 0x0
> > [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: GW  
> >  4.0.0-rc3-2.gd5c547f-desktop #1
> > [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 
> > ) 12/13/2011
> > [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
> > 8801013d4000
> > [242060.604885] RIP: 0010:[]  [] 
> > page_fault+0xd/0x30
> > [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
> > [242060.604895] RAX: aa40 RBX: 0001 RCX: 
> > 81682237
> > [242060.604896] RDX: aa40 RSI:  RDI: 
> > 7fffa55eb078
> > [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
> > 
> > [242060.604900] R10:  R11: 0293 R12: 
> > 004a
> > [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
> > 7ffa3556cf20
> > [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
> > knlGS:
> > [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
> > [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
> > 000427e0
> > [242060.604909] Stack:
> > [242060.604942] BUG: unable to handle kernel paging request at 
> > 7fffa55eafb8
> > [242060.604995] IP: [] show_stack_log_lvl+0x124/0x190
> > [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
> > [242060.605078] Oops:  [#1] PREEMPT SMP 
> > [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
> > nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
> > sunrpc fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp 
> > ppp_async crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac 
> > algif_hash ctr ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
> > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
> > nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge 
> > stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
> > ip_tables x_tables af_packet bnep dm_crypt ecb cbc algif_skcipher af_alg 
> > xfs libcrc32c snd_hda_codec_conexant snd_hda_codec_generic iTCO_wdt 
> > iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec 
> > snd_hwdep snd_pcm_oss snd_pcm
> > [242060.605396]  dm_mod snd_seq snd_seq_device snd_timer coretemp kvm_intel 
> > kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 uvcvideo 
> > videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core btusb 
> > v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 pcspkr 
> > e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei cfg80211 shpchp 
> > wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq i915 xhci_pci 
> > xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video button processor sg 
> > loop
> > [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: GW  
> >  4.0.0-rc3-2.gd5c547f-desktop #1
> > [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 
> > ) 12/13/2011
> > [242060.605396] task: 880103f46150 ti: 8801013d4000 task.ti: 
> > 8801013d4000
> > [242060.605396] RIP: 0010:[]  [] 
> > show_stack_log_lvl+0x124/0x190
> > [242060.605396] RSP: 0018:88023bc84e88  EFLAGS: 00010046
> > [242060.605396] RAX: 7fffa55eafc0 RBX: 7fffa55eafb8 RCX: 
> > 88023bc7ffc0
> > [242060.605396] RDX:  RSI: 88023bc84f58 RDI: 
> > 
> > [242060.605396] RBP: 88023bc83fc0 R08: 81a2fe15 R09: 
> > 0020
> > [242060.605396] R10: 0afb R11: 88023bc84bee R12: 
> > 88023bc84f58
> > [242060.605396] R13:  R14: 81a2fe15 R15: 
> > 
> > [242060.605396] FS:  7ffa33dbfa80() GS:88023bc8() 
> > knlGS:
> > [242060.605396] CS:  0010 DS:  ES:  CR0: 80050033
> > [242060.605396] CR2: 7fffa55eafb8 CR3: 02d7e000 CR4: 
> > 000427e0
> > [242060.605396] Stack:
> > [242060.605396]  02d7e000 0008 88023bc84ee8 
> > 7fffa55eafb8
> > [242060.605396]   88023bc84f58 7fffa55eafb8 
> > 0040
> > [242060.605396]  7ffa356b5d60 000f 7ffa3556cf20 
> > 81005c36
> > [242060.605396] Call Trace:
> > [242060.605396]  [] show_regs+0x86/0x210
> > [242060.605396]  [] df_debug+0x1f/0x30
> > [242060.605396]  [] do_double_fault+0x84/0x100
> > [242060.605396]  [] double_fault+0x28/0x30
> > [242060.605396]  [] page_fault+0xd/0x30

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Takashi Iwai
At Wed, 18 Mar 2015 15:16:42 +0100,
Takashi Iwai wrote:
> 
> IIRC, this didn't happen with the early 4.0-rc, but can't say 100%
> sure.

I could reproduce the panic on 4.0-rc1, so scratch this comment.


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Takashi Iwai
At Sun, 15 Mar 2015 09:17:15 +0100,
Stefan Seyfried wrote:
> 
> Hi all,
> 
> in 4.0-rc I have recently seen a few crashes, always when running
> KVM guests (IIRC). Today I was able to capture a crash dump, this
> is the backtrace from dmesg.txt:
> 
> [242060.604870] PANIC: double fault, error_code: 0x0
> [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: GW
>4.0.0-rc3-2.gd5c547f-desktop #1
> [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 ) 
> 12/13/2011
> [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
> 8801013d4000
> [242060.604885] RIP: 0010:[]  [] 
> page_fault+0xd/0x30
> [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
> [242060.604895] RAX: aa40 RBX: 0001 RCX: 
> 81682237
> [242060.604896] RDX: aa40 RSI:  RDI: 
> 7fffa55eb078
> [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
> 
> [242060.604900] R10:  R11: 0293 R12: 
> 004a
> [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
> 7ffa3556cf20
> [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
> knlGS:
> [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
> [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
> 000427e0
> [242060.604909] Stack:
> [242060.604942] BUG: unable to handle kernel paging request at 
> 7fffa55eafb8
> [242060.604995] IP: [] show_stack_log_lvl+0x124/0x190
> [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
> [242060.605078] Oops:  [#1] PREEMPT SMP 
> [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
> nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc 
> fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp ppp_async 
> crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac algif_hash ctr 
> ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
> nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge stp 
> llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
> ip_tables x_tables af_packet bnep dm_crypt ecb cbc algif_skcipher af_alg xfs 
> libcrc32c snd_hda_codec_conexant snd_hda_codec_generic iTCO_wdt 
> iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep 
> snd_pcm_oss snd_pcm
> [242060.605396]  dm_mod snd_seq snd_seq_device snd_timer coretemp kvm_intel 
> kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 uvcvideo 
> videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core btusb 
> v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 pcspkr 
> e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei cfg80211 shpchp 
> wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq i915 xhci_pci 
> xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video button processor sg 
> loop
> [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: GW
>4.0.0-rc3-2.gd5c547f-desktop #1
> [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 ) 
> 12/13/2011
> [242060.605396] task: 880103f46150 ti: 8801013d4000 task.ti: 
> 8801013d4000
> [242060.605396] RIP: 0010:[]  [] 
> show_stack_log_lvl+0x124/0x190
> [242060.605396] RSP: 0018:88023bc84e88  EFLAGS: 00010046
> [242060.605396] RAX: 7fffa55eafc0 RBX: 7fffa55eafb8 RCX: 
> 88023bc7ffc0
> [242060.605396] RDX:  RSI: 88023bc84f58 RDI: 
> 
> [242060.605396] RBP: 88023bc83fc0 R08: 81a2fe15 R09: 
> 0020
> [242060.605396] R10: 0afb R11: 88023bc84bee R12: 
> 88023bc84f58
> [242060.605396] R13:  R14: 81a2fe15 R15: 
> 
> [242060.605396] FS:  7ffa33dbfa80() GS:88023bc8() 
> knlGS:
> [242060.605396] CS:  0010 DS:  ES:  CR0: 80050033
> [242060.605396] CR2: 7fffa55eafb8 CR3: 02d7e000 CR4: 
> 000427e0
> [242060.605396] Stack:
> [242060.605396]  02d7e000 0008 88023bc84ee8 
> 7fffa55eafb8
> [242060.605396]   88023bc84f58 7fffa55eafb8 
> 0040
> [242060.605396]  7ffa356b5d60 000f 7ffa3556cf20 
> 81005c36
> [242060.605396] Call Trace:
> [242060.605396]  [] show_regs+0x86/0x210
> [242060.605396]  [] df_debug+0x1f/0x30
> [242060.605396]  [] do_double_fault+0x84/0x100
> [242060.605396]  [] double_fault+0x28/0x30
> [242060.605396]  [] page_fault+0xd/0x30
> [242060.605396] Code: fe a2 81 31 c0 89 54 24 08 48 89 0c 24 48 8b 5b f8 e8 
> cc 06 67 00 48 8b 0c 24 8b 54 24 08 85 d2 74 05 f6 c2 03 74 48 48 8d 43 08 
> <48> 8b 33 48 c7 c7 0d fe a2 81 89 54 24 14 48 89 4c 24 08 48 89 
> [242060.605396] RIP  [] 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Takashi Iwai
At Wed, 18 Mar 2015 15:16:42 +0100,
Takashi Iwai wrote:
 
 IIRC, this didn't happen with the early 4.0-rc, but can't say 100%
 sure.

I could reproduce the panic on 4.0-rc1, so scratch this comment.


Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Takashi Iwai
At Sun, 15 Mar 2015 09:17:15 +0100,
Stefan Seyfried wrote:
 
 Hi all,
 
 in 4.0-rc I have recently seen a few crashes, always when running
 KVM guests (IIRC). Today I was able to capture a crash dump, this
 is the backtrace from dmesg.txt:
 
 [242060.604870] PANIC: double fault, error_code: 0x0
 [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: GW
4.0.0-rc3-2.gd5c547f-desktop #1
 [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 ) 
 12/13/2011
 [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
 8801013d4000
 [242060.604885] RIP: 0010:[816834ad]  [816834ad] 
 page_fault+0xd/0x30
 [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
 [242060.604895] RAX: aa40 RBX: 0001 RCX: 
 81682237
 [242060.604896] RDX: aa40 RSI:  RDI: 
 7fffa55eb078
 [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
 
 [242060.604900] R10:  R11: 0293 R12: 
 004a
 [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
 7ffa3556cf20
 [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
 knlGS:
 [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
 [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
 000427e0
 [242060.604909] Stack:
 [242060.604942] BUG: unable to handle kernel paging request at 
 7fffa55eafb8
 [242060.604995] IP: [81005b44] show_stack_log_lvl+0x124/0x190
 [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
 [242060.605078] Oops:  [#1] PREEMPT SMP 
 [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc 
 fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp ppp_async 
 crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac algif_hash ctr 
 ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
 nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge stp 
 llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
 ip_tables x_tables af_packet bnep dm_crypt ecb cbc algif_skcipher af_alg xfs 
 libcrc32c snd_hda_codec_conexant snd_hda_codec_generic iTCO_wdt 
 iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep 
 snd_pcm_oss snd_pcm
 [242060.605396]  dm_mod snd_seq snd_seq_device snd_timer coretemp kvm_intel 
 kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 uvcvideo 
 videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core btusb 
 v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 pcspkr 
 e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei cfg80211 shpchp 
 wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq i915 xhci_pci 
 xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video button processor sg 
 loop
 [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: GW
4.0.0-rc3-2.gd5c547f-desktop #1
 [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 ) 
 12/13/2011
 [242060.605396] task: 880103f46150 ti: 8801013d4000 task.ti: 
 8801013d4000
 [242060.605396] RIP: 0010:[81005b44]  [81005b44] 
 show_stack_log_lvl+0x124/0x190
 [242060.605396] RSP: 0018:88023bc84e88  EFLAGS: 00010046
 [242060.605396] RAX: 7fffa55eafc0 RBX: 7fffa55eafb8 RCX: 
 88023bc7ffc0
 [242060.605396] RDX:  RSI: 88023bc84f58 RDI: 
 
 [242060.605396] RBP: 88023bc83fc0 R08: 81a2fe15 R09: 
 0020
 [242060.605396] R10: 0afb R11: 88023bc84bee R12: 
 88023bc84f58
 [242060.605396] R13:  R14: 81a2fe15 R15: 
 
 [242060.605396] FS:  7ffa33dbfa80() GS:88023bc8() 
 knlGS:
 [242060.605396] CS:  0010 DS:  ES:  CR0: 80050033
 [242060.605396] CR2: 7fffa55eafb8 CR3: 02d7e000 CR4: 
 000427e0
 [242060.605396] Stack:
 [242060.605396]  02d7e000 0008 88023bc84ee8 
 7fffa55eafb8
 [242060.605396]   88023bc84f58 7fffa55eafb8 
 0040
 [242060.605396]  7ffa356b5d60 000f 7ffa3556cf20 
 81005c36
 [242060.605396] Call Trace:
 [242060.605396]  [81005c36] show_regs+0x86/0x210
 [242060.605396]  [8104636f] df_debug+0x1f/0x30
 [242060.605396]  [810041a4] do_double_fault+0x84/0x100
 [242060.605396]  [81683088] double_fault+0x28/0x30
 [242060.605396]  [816834ad] page_fault+0xd/0x30
 [242060.605396] Code: fe a2 81 31 c0 89 54 24 08 48 89 0c 24 48 8b 5b f8 e8 
 cc 06 67 00 48 8b 0c 24 8b 54 24 08 85 d2 74 05 f6 c2 03 74 48 48 8d 43 08 
 48 8b 33 48 c7 c7 0d fe a2 81 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Takashi Iwai
At Wed, 18 Mar 2015 18:43:52 +0100,
Takashi Iwai wrote:
 
 At Wed, 18 Mar 2015 15:16:42 +0100,
 Takashi Iwai wrote:
  
  At Sun, 15 Mar 2015 09:17:15 +0100,
  Stefan Seyfried wrote:
   
   Hi all,
   
   in 4.0-rc I have recently seen a few crashes, always when running
   KVM guests (IIRC). Today I was able to capture a crash dump, this
   is the backtrace from dmesg.txt:
   
   [242060.604870] PANIC: double fault, error_code: 0x0
   [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
   W   4.0.0-rc3-2.gd5c547f-desktop #1
   [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
   (3.21 ) 12/13/2011
   [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
   8801013d4000
   [242060.604885] RIP: 0010:[816834ad]  [816834ad] 
   page_fault+0xd/0x30
   [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
   [242060.604895] RAX: aa40 RBX: 0001 RCX: 
   81682237
   [242060.604896] RDX: aa40 RSI:  RDI: 
   7fffa55eb078
   [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
   
   [242060.604900] R10:  R11: 0293 R12: 
   004a
   [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
   7ffa3556cf20
   [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
   knlGS:
   [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
   [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
   000427e0
   [242060.604909] Stack:
   [242060.604942] BUG: unable to handle kernel paging request at 
   7fffa55eafb8
   [242060.604995] IP: [81005b44] show_stack_log_lvl+0x124/0x190
   [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
   [242060.605078] Oops:  [#1] PREEMPT SMP 
   [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
   nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
   sunrpc fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp 
   ppp_async crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac 
   algif_hash ctr ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
   nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
   nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge 
   stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
   ip_tables x_tables af_packet bnep dm_crypt ecb cbc algif_skcipher af_alg 
   xfs libcrc32c snd_hda_codec_conexant snd_hda_codec_generic iTCO_wdt 
   iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec 
   snd_hwdep snd_pcm_oss snd_pcm
   [242060.605396]  dm_mod snd_seq snd_seq_device snd_timer coretemp 
   kvm_intel kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 
   uvcvideo videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core 
   btusb v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 
   pcspkr e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei 
   cfg80211 shpchp wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq 
   i915 xhci_pci xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video 
   button processor sg loop
   [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
   W   4.0.0-rc3-2.gd5c547f-desktop #1
   [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
   (3.21 ) 12/13/2011
   [242060.605396] task: 880103f46150 ti: 8801013d4000 task.ti: 
   8801013d4000
   [242060.605396] RIP: 0010:[81005b44]  [81005b44] 
   show_stack_log_lvl+0x124/0x190
   [242060.605396] RSP: 0018:88023bc84e88  EFLAGS: 00010046
   [242060.605396] RAX: 7fffa55eafc0 RBX: 7fffa55eafb8 RCX: 
   88023bc7ffc0
   [242060.605396] RDX:  RSI: 88023bc84f58 RDI: 
   
   [242060.605396] RBP: 88023bc83fc0 R08: 81a2fe15 R09: 
   0020
   [242060.605396] R10: 0afb R11: 88023bc84bee R12: 
   88023bc84f58
   [242060.605396] R13:  R14: 81a2fe15 R15: 
   
   [242060.605396] FS:  7ffa33dbfa80() GS:88023bc8() 
   knlGS:
   [242060.605396] CS:  0010 DS:  ES:  CR0: 80050033
   [242060.605396] CR2: 7fffa55eafb8 CR3: 02d7e000 CR4: 
   000427e0
   [242060.605396] Stack:
   [242060.605396]  02d7e000 0008 88023bc84ee8 
   7fffa55eafb8
   [242060.605396]   88023bc84f58 7fffa55eafb8 
   0040
   [242060.605396]  7ffa356b5d60 000f 7ffa3556cf20 
   81005c36
   [242060.605396] Call Trace:
   [242060.605396]  [81005c36] show_regs+0x86/0x210
   [242060.605396]  [8104636f] df_debug+0x1f/0x30
   [242060.605396]  [810041a4] 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Takashi Iwai
At Wed, 18 Mar 2015 15:16:42 +0100,
Takashi Iwai wrote:
 
 At Sun, 15 Mar 2015 09:17:15 +0100,
 Stefan Seyfried wrote:
  
  Hi all,
  
  in 4.0-rc I have recently seen a few crashes, always when running
  KVM guests (IIRC). Today I was able to capture a crash dump, this
  is the backtrace from dmesg.txt:
  
  [242060.604870] PANIC: double fault, error_code: 0x0
  [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: GW  
   4.0.0-rc3-2.gd5c547f-desktop #1
  [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 
  ) 12/13/2011
  [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
  8801013d4000
  [242060.604885] RIP: 0010:[816834ad]  [816834ad] 
  page_fault+0xd/0x30
  [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016
  [242060.604895] RAX: aa40 RBX: 0001 RCX: 
  81682237
  [242060.604896] RDX: aa40 RSI:  RDI: 
  7fffa55eb078
  [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
  
  [242060.604900] R10:  R11: 0293 R12: 
  004a
  [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
  7ffa3556cf20
  [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
  knlGS:
  [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
  [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
  000427e0
  [242060.604909] Stack:
  [242060.604942] BUG: unable to handle kernel paging request at 
  7fffa55eafb8
  [242060.604995] IP: [81005b44] show_stack_log_lvl+0x124/0x190
  [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
  [242060.605078] Oops:  [#1] PREEMPT SMP 
  [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
  nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
  sunrpc fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp 
  ppp_async crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac 
  algif_hash ctr ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
  nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
  nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge 
  stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
  ip_tables x_tables af_packet bnep dm_crypt ecb cbc algif_skcipher af_alg 
  xfs libcrc32c snd_hda_codec_conexant snd_hda_codec_generic iTCO_wdt 
  iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec 
  snd_hwdep snd_pcm_oss snd_pcm
  [242060.605396]  dm_mod snd_seq snd_seq_device snd_timer coretemp kvm_intel 
  kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 uvcvideo 
  videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core btusb 
  v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 pcspkr 
  e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei cfg80211 shpchp 
  wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq i915 xhci_pci 
  xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video button processor sg 
  loop
  [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: GW  
   4.0.0-rc3-2.gd5c547f-desktop #1
  [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW (3.21 
  ) 12/13/2011
  [242060.605396] task: 880103f46150 ti: 8801013d4000 task.ti: 
  8801013d4000
  [242060.605396] RIP: 0010:[81005b44]  [81005b44] 
  show_stack_log_lvl+0x124/0x190
  [242060.605396] RSP: 0018:88023bc84e88  EFLAGS: 00010046
  [242060.605396] RAX: 7fffa55eafc0 RBX: 7fffa55eafb8 RCX: 
  88023bc7ffc0
  [242060.605396] RDX:  RSI: 88023bc84f58 RDI: 
  
  [242060.605396] RBP: 88023bc83fc0 R08: 81a2fe15 R09: 
  0020
  [242060.605396] R10: 0afb R11: 88023bc84bee R12: 
  88023bc84f58
  [242060.605396] R13:  R14: 81a2fe15 R15: 
  
  [242060.605396] FS:  7ffa33dbfa80() GS:88023bc8() 
  knlGS:
  [242060.605396] CS:  0010 DS:  ES:  CR0: 80050033
  [242060.605396] CR2: 7fffa55eafb8 CR3: 02d7e000 CR4: 
  000427e0
  [242060.605396] Stack:
  [242060.605396]  02d7e000 0008 88023bc84ee8 
  7fffa55eafb8
  [242060.605396]   88023bc84f58 7fffa55eafb8 
  0040
  [242060.605396]  7ffa356b5d60 000f 7ffa3556cf20 
  81005c36
  [242060.605396] Call Trace:
  [242060.605396]  [81005c36] show_regs+0x86/0x210
  [242060.605396]  [8104636f] df_debug+0x1f/0x30
  [242060.605396]  [810041a4] do_double_fault+0x84/0x100
  [242060.605396]  [81683088] double_fault+0x28/0x30
  [242060.605396]  [816834ad] page_fault+0xd/0x30
  [242060.605396] Code: fe a2 81 

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

2015-03-18 Thread Andy Lutomirski
On Wed, Mar 18, 2015 at 10:46 AM, Takashi Iwai ti...@suse.de wrote:
 At Wed, 18 Mar 2015 18:43:52 +0100,
 Takashi Iwai wrote:

 At Wed, 18 Mar 2015 15:16:42 +0100,
 Takashi Iwai wrote:
 
  At Sun, 15 Mar 2015 09:17:15 +0100,
  Stefan Seyfried wrote:
  
   Hi all,
  
   in 4.0-rc I have recently seen a few crashes, always when running
   KVM guests (IIRC). Today I was able to capture a crash dump, this
   is the backtrace from dmesg.txt:
  
   [242060.604870] PANIC: double fault, error_code: 0x0

OK, we double faulted.  Too bad that x86 CPUs don't tell us why.

   [242060.604878] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
   W   4.0.0-rc3-2.gd5c547f-desktop #1
   [242060.604880] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
   (3.21 ) 12/13/2011
   [242060.604883] task: 880103f46150 ti: 8801013d4000 task.ti: 
   8801013d4000
   [242060.604885] RIP: 0010:[816834ad]  [816834ad] 
   page_fault+0xd/0x30

The double fault happened during page fault processing.  Could you
disassemble your page_fault function to find the offending
instruction?

   [242060.604893] RSP: 0018:7fffa55eafb8  EFLAGS: 00010016

Uh, what?  That RSP is a user address.

   [242060.604895] RAX: aa40 RBX: 0001 RCX: 
   81682237
   [242060.604896] RDX: aa40 RSI:  RDI: 
   7fffa55eb078
   [242060.604898] RBP: 7fffa55f1c1c R08: 0008 R09: 
   
   [242060.604900] R10:  R11: 0293 R12: 
   004a
   [242060.604902] R13: 7ffa356b5d60 R14: 000f R15: 
   7ffa3556cf20
   [242060.604904] FS:  7ffa33dbfa80() GS:88023bc8() 
   knlGS:
   [242060.604906] CS:  0010 DS:  ES:  CR0: 80050033
   [242060.604908] CR2: 7fffa55eafa8 CR3: 02d7e000 CR4: 
   000427e0
   [242060.604909] Stack:
   [242060.604942] BUG: unable to handle kernel paging request at 
   7fffa55eafb8
   [242060.604995] IP: [81005b44] show_stack_log_lvl+0x124/0x190

This is suspicious.  We need to have died, again, of a fatal page
fault while dumping the stack.

   [242060.605036] PGD 4779a067 PUD 40e3e067 PMD 4769e067 PTE 0
   [242060.605078] Oops:  [#1] PREEMPT SMP
   [242060.605106] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 
   nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
   sunrpc fscache nls_iso8859_1 nls_cp437 vfat fat ppp_deflate bsd_comp 
   ppp_async crc_ccitt ppp_generic slhc ses enclosure uas usb_storage cmac 
   algif_hash ctr ccm rfcomm fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
   nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
   nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp tun bridge 
   stp llc ebtable_filter ebtables ip6table_filter ip6_tables 
   iptable_filter ip_tables x_tables af_packet bnep dm_crypt ecb cbc 
   algif_skcipher af_alg xfs libcrc32c snd_hda_codec_conexant 
   snd_hda_codec_generic iTCO_wdt iTCO_vendor_support snd_hda_intel 
   snd_hda_controller snd_hda_codec snd_hwdep snd_pcm_oss snd_pcm
   [242060.605396]  dm_mod snd_seq snd_seq_device snd_timer coretemp 
   kvm_intel kvm snd_mixer_oss cdc_ether cdc_wdm cdc_acm usbnet mii arc4 
   uvcvideo videobuf2_vmalloc videobuf2_memops thinkpad_acpi videobuf2_core 
   btusb v4l2_common videodev i2c_i801 iwldvm bluetooth serio_raw mac80211 
   pcspkr e1000e iwlwifi snd lpc_ich mei_me ptp mfd_core pps_core mei 
   cfg80211 shpchp wmi soundcore rfkill battery ac tpm_tis tpm acpi_cpufreq 
   i915 xhci_pci xhci_hcd i2c_algo_bit drm_kms_helper drm thermal video 
   button processor sg loop
   [242060.605396] CPU: 1 PID: 2132 Comm: qemu-system-x86 Tainted: G
   W   4.0.0-rc3-2.gd5c547f-desktop #1
   [242060.605396] Hardware name: LENOVO 74665EG/74665EG, BIOS 6DET71WW 
   (3.21 ) 12/13/2011
   [242060.605396] task: 880103f46150 ti: 8801013d4000 task.ti: 
   8801013d4000
   [242060.605396] RIP: 0010:[81005b44]  [81005b44] 
   show_stack_log_lvl+0x124/0x190
   [242060.605396] RSP: 0018:88023bc84e88  EFLAGS: 00010046
   [242060.605396] RAX: 7fffa55eafc0 RBX: 7fffa55eafb8 RCX: 
   88023bc7ffc0
   [242060.605396] RDX:  RSI: 88023bc84f58 RDI: 
   
   [242060.605396] RBP: 88023bc83fc0 R08: 81a2fe15 R09: 
   0020
   [242060.605396] R10: 0afb R11: 88023bc84bee R12: 
   88023bc84f58
   [242060.605396] R13:  R14: 81a2fe15 R15: 
   
   [242060.605396] FS:  7ffa33dbfa80() GS:88023bc8() 
   knlGS:
   [242060.605396] CS:  0010 DS:  ES:  CR0: 80050033
   [242060.605396] CR2: 7fffa55eafb8 CR3: 02d7e000 CR4: 
   000427e0
   [242060.605396] Stack:
   [242060.605396]  02d7e000 0008 88023bc84ee8 
   

  1   2   >