Re: [Xen-ia64-devel] PATCH: slightly improve stability
Is there any reason why the Anthony's patch was dropped? I think this patch is also needed. I got the following message. I guess the cause is as follows But this happens very rarely... linux-2.6-xen-sparse/arch/ia64/xen/xenentry.S Here psr.i and psr.ic is off rse_clear_invalid: ... (pRecurse) br.call.dptk.few b0=rse_clear_invalid ;; mov loc8=0 0xa001000687c0 please notice ifs = 8000 mov loc9=0 1. Right before mov loc8=0, vcpu is switched to another cpu. 2. While the vcpu is waiting for cpu, the tlb entry which backs the rse stack is purged. 3. The vcpu gets cpu again, tlb miss fault occurs with isr.ir = 1. 4. xen ia64_do_page_fault() calls handle_lazy_cover() which sets cr.ifs = 0. 5. xen returns cpu execution to the guest. 6. mov loc8 = 0 is executed with cfm = 0. Illigal operation fault is raised 7. priv_handle_op() is called. but it fails to emulate because mov loc8 = 0 isn't privileged op. 8. ia64_handle_privop() calls panic_domain(). Thanks. (XEN) priv_emulate: priv_handle_op fails, isr=0x0 (XEN) $ PANIC in domain 0 (k6=0xf41c8000): psr.ic off, delivering fault=5400,ipsr=101208026030,iip=a001000687c0,ifa=20144f60,isr=0,PSCB.iip=20144f60 (XEN) (XEN) Call Trace: (XEN) [f409e030] show_stack+0x80/0xa0 (XEN) sp=f41cfb80 bsp=f41c8e48 (XEN) [f407d780] panic_domain+0xf0/0x1d0 (XEN) sp=f41cfd50 bsp=f41c8de0 (XEN) [f40707b0] check_bad_nested_interruption+0x110/0x120 (XEN) sp=f41cfe00 bsp=f41c8db0 (XEN) [f4070a20] reflect_interruption+0x260/0x460 (XEN) sp=f41cfe00 bsp=f41c8d60 (XEN) [f409cba0] ia64_leave_kernel+0x0/0x310 (XEN) sp=f41cfe00 bsp=f41c8d60 (XEN) [a001000687c0] ??? (XEN) sp=f41d bsp=f41c8d60 (XEN) d 0xf7ffb208 domid 0 (XEN) vcpu 0xf41c8000 vcpu 3 (XEN) (XEN) CPU 3 (XEN) psr : 101208026030 ifs : 8000 ip : [a001000687c0] (XEN) ip is at ??? (XEN) unat: pfs : 8710 rsc : 00580008 (XEN) rnat: bsps: eb328fe8 pr : 0559a7a9 (XEN) ldrs: 0060 ccv : fpsr: 0009804c0270033f (XEN) csd : ssd : (XEN) b0 : a001000687c0 b6 : 20144f60 b7 : a0010640 (XEN) f6 : 1003e f7 : 0 (XEN) f8 : 100198ff97fe0 f9 : 1003eff05 (XEN) f10 : 1003e00b0 f11 : 1001192d7b6702eedd629 (XEN) r1 : 2021c278 r2 : c309 r3 : 6fc5e7e0 (XEN) r8 : 2003eff0 r9 : 0001 r10 : (XEN) r11 : c593 r12 : 6fc5e7e0 r13 : 2048cac0 (XEN) r14 : 20144f60 r15 : 20217320 r16 : eb328fc8 (XEN) r17 : 02b0 r18 : 0058 r19 : 0058 (XEN) r20 : 0009804c8a70033f r21 : 20109c70 r22 : (XEN) r23 : 6fff7fffc128 r24 : r25 : (XEN) r26 : c48b r27 : 000f r28 : 20144f60 (XEN) r29 : 001308126030 r30 : 8002 r31 : 0559a361 (XEN) domain_crash_sync called from xenmisc.c:194 (XEN) Domain 0 (vcpu#3) crashed on cpu#3: (XEN) d 0xf7ffb208 domid 0 (XEN) vcpu 0xf41c8000 vcpu 3 On Fri, Apr 28, 2006 at 11:18:45AM +0800, Xu, Anthony wrote: Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xu, Anthony Sent: 2006?4?28? 9:48 To: Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad value in a stacked register or a use of an invalid stacked register (although cfm seems correct in gdb!) I'm looking at this too, Yes there is a bug about handle_lazy_cover. void ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs, unsigned long itir) { unsigned long iip = regs-cr_iip
Re: [Xen-ia64-devel] PATCH: slightly improve stability
On Fri, 2006-06-23 at 18:19 +0900, Isaku Yamahata wrote: Is there any reason why the Anthony's patch was dropped? I think this patch is also needed. I don't recall specifically, but I would guess it was because there were several test patches tagged onto this thread and while trying to parse out the important parts, I thought the minstate.h changes superseded these. I can add in the rest as well. I've seen the same panic on rare occasion. Thanks, Alex -- Alex Williamson HP Open Source Linux Org. ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel
Re: [Xen-ia64-devel] PATCH: slightly improve stability
On Fri, 2006-06-23 at 18:19 +0900, Isaku Yamahata wrote: Is there any reason why the Anthony's patch was dropped? I think this patch is also needed. I went ahead and applied this. Thanks, Alex -- Alex Williamson HP Open Source Linux Org. ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel
RE: [Xen-ia64-devel] PATCH: slightly improve stability
On Sun, 2006-04-30 at 13:43 +0800, Xu, Anthony wrote: --- a/linux-2.6-xen-sparse/arch/ia64/xen/xenminstate.hThu Apr 27 02:55:42 2006 +++ b/linux-2.6-xen-sparse/arch/ia64/xen/xenminstate.hSat Apr 29 13:14:58 2006 @@ -155,6 +155,8 @@ ;; \ ld4 r30=[r8]; \ ;; \ + /* set XSI_INCOMPL_REGFR 0 */ \ + st4 [r8]=r0; \ cmp.eq p6,p7=r30,r0; \ ;; /* not sure if this stop bit is necessary */ \ (p6) adds r8=XSI_PRECOVER_IFS-XSI_INCOMPL_REGFR,r8; Applied. -- Alex Williamson HP Linux Open Source Lab ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel
RE: [Xen-ia64-devel] PATCH: slightly improve stability
Excellent! I agree you have found a very difficult bug! I am now up to 167 linux compiles with no segfaults! Congratulations Anthony! One minor suggestion: I think the new added store can be in the same cycle as the previous load (no stop bit needed). I didn't look at the bundling... perhaps it doesn't matter. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Saturday, April 29, 2006 11:44 PM To: Magenheimer, Dan (HP Labs Fort Collins); Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: Magenheimer, Dan Sent: 2006年4月29日 21:58 To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- With both Tristan's stability patch and your earlier patch, I have completed 103 linux compiles now with no segfaults yet. Did you see your segfault with Tristan's patch included? I'll continue running over the weekend with the bits I have but if I see a segfault I will add in the additional store in Xen entry (minstate.h) from your newer patch. --- a/linux-2.6-xen-sparse/arch/ia64/xen/xenminstate.h Thu Apr 27 02:55:42 2006 +++ b/linux-2.6-xen-sparse/arch/ia64/xen/xenminstate.h Sat Apr 29 13:14:58 2006 @@ -155,6 +155,8 @@ ;; \ ld4 r30=[r8]; \ ;; \ + /* set XSI_INCOMPL_REGFR 0 */ \ + st4 [r8]=r0; \ cmp.eq p6,p7=r30,r0; \ ;; /* not sure if this stop bit is necessary */ \ (p6) adds r8=XSI_PRECOVER_IFS-XSI_INCOMPL_REGFR,r8; The additional store is necessary. In theory, after Guest executes cover, incomplete frame changes to complete frame. So Guest should set INCOMPL to 0 just after cover. At least before guest psr.ic and psr.i are turned on. Previously, only when Guest executes rfi, INCOMPL is set to 0. The window between cover and rfi causes trouble in below scenario. 1. Application A calls system call. 2. In OS breaks handler entry, INCOMPL is 0. Due to its system call, Linux kernel doesn't execute cover. 3. Before returning to Application A, schedule happens, Application B begins to run. 4. A TLB miss happens on the context of B, this may make INCOMPL 1, before Returning to B, (that means rfi is not executed, and INCOMPL is still 1) schedule happens again. A resumes to run with INCOMPL 1 (this is incorrect now). 5. As mentioned before, this is system call, cover is executed in ia64_leave_kernel path. Because INCOMPL is 1, this cover is not actually executed, but this cover should be executed. 5. Thus application A's frame is destroyed. Issue appears. I did catch this scenario. Thanks, Anthony Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Saturday, April 29, 2006 12:03 AM To: Magenheimer, Dan (HP Labs Fort Collins); Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Dan, Yes, we also got a segmentation fault in 1 run out of 30. Could you please try this new patch? Thanks, -Anthony -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) [mailto:[EMAIL PROTECTED] Sent: 2006年4月28日 22:49 To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- I tried your patch overnight and still got a segmentation fault in 1 run out of 50. I didn't try Tristan's patch yet, so will try both at the same time next... perhaps there are two different problems that show up as the segmentation fault. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 9:19 PM To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From
RE: [Xen-ia64-devel] PATCH: slightly improve stability
Hi Dan, Yes, we also got a segmentation fault in 1 run out of 30. Could you please try this new patch? Thanks, -Anthony -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) [mailto:[EMAIL PROTECTED] Sent: 2006?4?28? 22:49 To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- I tried your patch overnight and still got a segmentation fault in 1 run out of 50. I didn't try Tristan's patch yet, so will try both at the same time next... perhaps there are two different problems that show up as the segmentation fault. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 9:19 PM To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xu, Anthony Sent: 2006?4?28? 9:48 To: Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad value in a stacked register or a use of an invalid stacked register (although cfm seems correct in gdb!) I'm looking at this too, Yes there is a bug about handle_lazy_cover. void ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs, unsigned long itir) { unsigned long iip = regs-cr_iip, iha; // FIXME should validate address here unsigned long pteval; unsigned long is_data = !((isr IA64_ISR_X_BIT) 1UL); IA64FAULT fault; if ((isr IA64_ISR_IR) handle_lazy_cover(current, isr, regs)) return; This code sequence is intended to handle following scenario. 1. Guest executes br.ret, this may cause mandatory RSE load, and this load may cause TLB miss. 2. VMM gets control, but VMM can't handle this TLB miss itself, then VMM injects TLB miss to Guest TLB miss handler, when VMM executing rfi to jump to Guest TLB miss handler, this TLB miss happens again. 3. At this time, interrupt_collection_enabled is 0, so handle_lazy_cover executes cover on behalf of Guest, and return to Guest TLB miss handler again, this time there is no TLB miss. Following code sequence is in ia64_leave_kernel path with psr.ic and psr.i off. When br.ret.dptk.many b0 is executed, there may be a mandatory load, thus There may be a tlb miss, according to above description handle_lazy_cover executes cover on behalf of Guest and return to Guest, this is no correct in this scenario. I didn't find an easy way to fix this bug. mov loc6=0 mov loc7=0 (pRecurse) br.call.dptk.few b0=rse_clear_invalid ;; mov loc8=0 mov loc9=0 cmp.ne pReturn,p0=r0,in1// if recursion count != 0, we need to do a br.ret mov loc10=0 mov loc11=0 (pReturn) br.ret.dptk.many b0 #endif /* !CONFIG_ITANIUM */ # undef pRecurse # undef pReturn ;; alloc r17=ar.pfs,0,0,0,0// drop current register frame ;; loadrs Thanks, Anthony Tested by doing many linux kernel compilation in SMP domU ( 100). Tristan. ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel RSE_incomplete_cfm.patch Description: RSE_incomplete_cfm.patch ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel
RE: [Xen-ia64-devel] PATCH: slightly improve stability
Hi Anthony -- With both Tristan's stability patch and your earlier patch, I have completed 103 linux compiles now with no segfaults yet. Did you see your segfault with Tristan's patch included? I'll continue running over the weekend with the bits I have but if I see a segfault I will add in the additional store in Xen entry (minstate.h) from your newer patch. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Saturday, April 29, 2006 12:03 AM To: Magenheimer, Dan (HP Labs Fort Collins); Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Dan, Yes, we also got a segmentation fault in 1 run out of 30. Could you please try this new patch? Thanks, -Anthony -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) [mailto:[EMAIL PROTECTED] Sent: 2006年4月28日 22:49 To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- I tried your patch overnight and still got a segmentation fault in 1 run out of 50. I didn't try Tristan's patch yet, so will try both at the same time next... perhaps there are two different problems that show up as the segmentation fault. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 9:19 PM To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xu, Anthony Sent: 2006?4?28? 9:48 To: Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad value in a stacked register or a use of an invalid stacked register (although cfm seems correct in gdb!) I'm looking at this too, Yes there is a bug about handle_lazy_cover. void ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs, unsigned long itir) { unsigned long iip = regs-cr_iip, iha; // FIXME should validate address here unsigned long pteval; unsigned long is_data = !((isr IA64_ISR_X_BIT) 1UL); IA64FAULT fault; if ((isr IA64_ISR_IR) handle_lazy_cover(current, isr, regs)) return; This code sequence is intended to handle following scenario. 1. Guest executes br.ret, this may cause mandatory RSE load, and this load may cause TLB miss. 2. VMM gets control, but VMM can't handle this TLB miss itself, then VMM injects TLB miss to Guest TLB miss handler, when VMM executing rfi to jump to Guest TLB miss handler, this TLB miss happens again. 3. At this time, interrupt_collection_enabled is 0, so handle_lazy_cover executes cover on behalf of Guest, and return to Guest TLB miss handler again, this time there is no TLB miss. Following code sequence is in ia64_leave_kernel path with psr.ic and psr.i off. When br.ret.dptk.many b0 is executed, there may be a mandatory load, thus There may be a tlb miss, according to above description handle_lazy_cover executes cover on behalf of Guest and return to Guest, this is no correct in this scenario. I didn't find an easy way to fix this bug. mov loc6=0 mov loc7=0 (pRecurse) br.call.dptk.few b0=rse_clear_invalid ;; mov loc8=0 mov loc9=0 cmp.ne pReturn,p0=r0,in1// if recursion count != 0, we need to do a br.ret mov loc10=0 mov loc11=0 (pReturn) br.ret.dptk.many b0 #endif /* !CONFIG_ITANIUM */ # undef pRecurse # undef pReturn ;; alloc r17=ar.pfs,0,0,0,0// drop current register frame ;; loadrs Thanks, Anthony Tested by doing many linux kernel compilation in SMP domU ( 100). Tristan. ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel
RE: [Xen-ia64-devel] PATCH: slightly improve stability
Argh! After 103 successful linux compiles, two of the next 10 had a segfault. Restarting again with Anthony's updated patch (plus Tristan's stability patch)... -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) Sent: Saturday, April 29, 2006 7:58 AM To: 'Xu, Anthony'; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- With both Tristan's stability patch and your earlier patch, I have completed 103 linux compiles now with no segfaults yet. Did you see your segfault with Tristan's patch included? I'll continue running over the weekend with the bits I have but if I see a segfault I will add in the additional store in Xen entry (minstate.h) from your newer patch. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Saturday, April 29, 2006 12:03 AM To: Magenheimer, Dan (HP Labs Fort Collins); Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Dan, Yes, we also got a segmentation fault in 1 run out of 30. Could you please try this new patch? Thanks, -Anthony -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) [mailto:[EMAIL PROTECTED] Sent: 2006年4月28日 22:49 To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- I tried your patch overnight and still got a segmentation fault in 1 run out of 50. I didn't try Tristan's patch yet, so will try both at the same time next... perhaps there are two different problems that show up as the segmentation fault. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 9:19 PM To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xu, Anthony Sent: 2006?4?28? 9:48 To: Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad value in a stacked register or a use of an invalid stacked register (although cfm seems correct in gdb!) I'm looking at this too, Yes there is a bug about handle_lazy_cover. void ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs, unsigned long itir) { unsigned long iip = regs-cr_iip, iha; // FIXME should validate address here unsigned long pteval; unsigned long is_data = !((isr IA64_ISR_X_BIT) 1UL); IA64FAULT fault; if ((isr IA64_ISR_IR) handle_lazy_cover(current, isr, regs)) return; This code sequence is intended to handle following scenario. 1. Guest executes br.ret, this may cause mandatory RSE load, and this load may cause TLB miss. 2. VMM gets control, but VMM can't handle this TLB miss itself, then VMM injects TLB miss to Guest TLB miss handler, when VMM executing rfi to jump to Guest TLB miss handler, this TLB miss happens again. 3. At this time, interrupt_collection_enabled is 0, so handle_lazy_cover executes cover on behalf of Guest, and return to Guest TLB miss handler again, this time there is no TLB miss. Following code sequence is in ia64_leave_kernel path with psr.ic and psr.i off. When br.ret.dptk.many b0 is executed, there may be a mandatory load, thus There may be a tlb miss, according to above description handle_lazy_cover executes cover on behalf of Guest and return to Guest, this is no correct in this scenario. I didn't find an easy way to fix this bug. mov loc6=0 mov loc7=0 (pRecurse) br.call.dptk.few b0
RE: [Xen-ia64-devel] PATCH: slightly improve stability
With this new patch (not including Tristan's stability patch by far), we can Successfully finish 50 linux compiles. We'll continue the test. Thanks, -Anthony -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) [mailto:[EMAIL PROTECTED] Sent: 2006年4月30日 0:13 To: Magenheimer, Dan (HP Labs Fort Collins); Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Argh! After 103 successful linux compiles, two of the next 10 had a segfault. Restarting again with Anthony's updated patch (plus Tristan's stability patch)... -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) Sent: Saturday, April 29, 2006 7:58 AM To: 'Xu, Anthony'; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- With both Tristan's stability patch and your earlier patch, I have completed 103 linux compiles now with no segfaults yet. Did you see your segfault with Tristan's patch included? I'll continue running over the weekend with the bits I have but if I see a segfault I will add in the additional store in Xen entry (minstate.h) from your newer patch. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Saturday, April 29, 2006 12:03 AM To: Magenheimer, Dan (HP Labs Fort Collins); Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Dan, Yes, we also got a segmentation fault in 1 run out of 30. Could you please try this new patch? Thanks, -Anthony -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) [mailto:[EMAIL PROTECTED] Sent: 2006年4月28日 22:49 To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- I tried your patch overnight and still got a segmentation fault in 1 run out of 50. I didn't try Tristan's patch yet, so will try both at the same time next... perhaps there are two different problems that show up as the segmentation fault. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 9:19 PM To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xu, Anthony Sent: 2006?4?28? 9:48 To: Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad value in a stacked register or a use of an invalid stacked register (although cfm seems correct in gdb!) I'm looking at this too, Yes there is a bug about handle_lazy_cover. void ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs, unsigned long itir) { unsigned long iip = regs-cr_iip, iha; // FIXME should validate address here unsigned long pteval; unsigned long is_data = !((isr IA64_ISR_X_BIT) 1UL); IA64FAULT fault; if ((isr IA64_ISR_IR) handle_lazy_cover(current, isr, regs)) return; This code sequence is intended to handle following scenario. 1. Guest executes br.ret, this may cause mandatory RSE load, and this load may cause TLB miss. 2. VMM gets control, but VMM can't handle this TLB miss itself, then VMM injects TLB miss to Guest TLB miss handler, when VMM executing rfi to jump to Guest TLB miss handler, this TLB miss happens again. 3. At this time, interrupt_collection_enabled is 0, so handle_lazy_cover executes cover on behalf of Guest, and return to Guest TLB miss handler again, this time there is no TLB miss. Following code
RE: [Xen-ia64-devel] PATCH: slightly improve stability
From: Magenheimer, Dan Sent: 2006年4月29日 21:58 To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- With both Tristan's stability patch and your earlier patch, I have completed 103 linux compiles now with no segfaults yet. Did you see your segfault with Tristan's patch included? I'll continue running over the weekend with the bits I have but if I see a segfault I will add in the additional store in Xen entry (minstate.h) from your newer patch. --- a/linux-2.6-xen-sparse/arch/ia64/xen/xenminstate.h Thu Apr 27 02:55:42 2006 +++ b/linux-2.6-xen-sparse/arch/ia64/xen/xenminstate.h Sat Apr 29 13:14:58 2006 @@ -155,6 +155,8 @@ ;; \ ld4 r30=[r8]; \ ;; \ + /* set XSI_INCOMPL_REGFR 0 */ \ + st4 [r8]=r0; \ cmp.eq p6,p7=r30,r0; \ ;; /* not sure if this stop bit is necessary */ \ (p6) adds r8=XSI_PRECOVER_IFS-XSI_INCOMPL_REGFR,r8; The additional store is necessary. In theory, after Guest executes cover, incomplete frame changes to complete frame. So Guest should set INCOMPL to 0 just after cover. At least before guest psr.ic and psr.i are turned on. Previously, only when Guest executes rfi, INCOMPL is set to 0. The window between cover and rfi causes trouble in below scenario. 1. Application A calls system call. 2. In OS breaks handler entry, INCOMPL is 0. Due to its system call, Linux kernel doesn't execute cover. 3. Before returning to Application A, schedule happens, Application B begins to run. 4. A TLB miss happens on the context of B, this may make INCOMPL 1, before Returning to B, (that means rfi is not executed, and INCOMPL is still 1) schedule happens again. A resumes to run with INCOMPL 1 (this is incorrect now). 5. As mentioned before, this is system call, cover is executed in ia64_leave_kernel path. Because INCOMPL is 1, this cover is not actually executed, but this cover should be executed. 5. Thus application A's frame is destroyed. Issue appears. I did catch this scenario. Thanks, Anthony Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Saturday, April 29, 2006 12:03 AM To: Magenheimer, Dan (HP Labs Fort Collins); Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Dan, Yes, we also got a segmentation fault in 1 run out of 30. Could you please try this new patch? Thanks, -Anthony -Original Message- From: Magenheimer, Dan (HP Labs Fort Collins) [mailto:[EMAIL PROTECTED] Sent: 2006年4月28日 22:49 To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Anthony -- I tried your patch overnight and still got a segmentation fault in 1 run out of 50. I didn't try Tristan's patch yet, so will try both at the same time next... perhaps there are two different problems that show up as the segmentation fault. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 9:19 PM To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xu, Anthony Sent: 2006?4?28? 9:48 To: Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad
RE: [Xen-ia64-devel] PATCH: slightly improve stability
Hi Anthony -- I tried your patch overnight and still got a segmentation fault in 1 run out of 50. I didn't try Tristan's patch yet, so will try both at the same time next... perhaps there are two different problems that show up as the segmentation fault. Dan -Original Message- From: Xu, Anthony [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 9:19 PM To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Williamson, Alex (Linux Kernel Dev) Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xu, Anthony Sent: 2006?4?28? 9:48 To: Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad value in a stacked register or a use of an invalid stacked register (although cfm seems correct in gdb!) I'm looking at this too, Yes there is a bug about handle_lazy_cover. void ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs, unsigned long itir) { unsigned long iip = regs-cr_iip, iha; // FIXME should validate address here unsigned long pteval; unsigned long is_data = !((isr IA64_ISR_X_BIT) 1UL); IA64FAULT fault; if ((isr IA64_ISR_IR) handle_lazy_cover(current, isr, regs)) return; This code sequence is intended to handle following scenario. 1. Guest executes br.ret, this may cause mandatory RSE load, and this load may cause TLB miss. 2. VMM gets control, but VMM can't handle this TLB miss itself, then VMM injects TLB miss to Guest TLB miss handler, when VMM executing rfi to jump to Guest TLB miss handler, this TLB miss happens again. 3. At this time, interrupt_collection_enabled is 0, so handle_lazy_cover executes cover on behalf of Guest, and return to Guest TLB miss handler again, this time there is no TLB miss. Following code sequence is in ia64_leave_kernel path with psr.ic and psr.i off. When br.ret.dptk.many b0 is executed, there may be a mandatory load, thus There may be a tlb miss, according to above description handle_lazy_cover executes cover on behalf of Guest and return to Guest, this is no correct in this scenario. I didn't find an easy way to fix this bug. mov loc6=0 mov loc7=0 (pRecurse) br.call.dptk.few b0=rse_clear_invalid ;; mov loc8=0 mov loc9=0 cmp.ne pReturn,p0=r0,in1// if recursion count != 0, we need to do a br.ret mov loc10=0 mov loc11=0 (pReturn) br.ret.dptk.many b0 #endif /* !CONFIG_ITANIUM */ #undef pRecurse #undef pReturn ;; alloc r17=ar.pfs,0,0,0,0// drop current register frame ;; loadrs Thanks, Anthony Tested by doing many linux kernel compilation in SMP domU ( 100). Tristan. ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel
RE: [Xen-ia64-devel] PATCH: slightly improve stability
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad value in a stacked register or a use of an invalid stacked register (although cfm seems correct in gdb!) I'm looking at this too, Yes there is a bug about handle_lazy_cover. void ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs, unsigned long itir) { unsigned long iip = regs-cr_iip, iha; // FIXME should validate address here unsigned long pteval; unsigned long is_data = !((isr IA64_ISR_X_BIT) 1UL); IA64FAULT fault; if ((isr IA64_ISR_IR) handle_lazy_cover(current, isr, regs)) return; This code sequence is intended to handle following scenario. 1. Guest executes br.ret, this may cause mandatory RSE load, and this load may cause TLB miss. 2. VMM gets control, but VMM can't handle this TLB miss itself, then VMM injects TLB miss to Guest TLB miss handler, when VMM executing rfi to jump to Guest TLB miss handler, this TLB miss happens again. 3. At this time, interrupt_collection_enabled is 0, so handle_lazy_cover executes cover on behalf of Guest, and return to Guest TLB miss handler again, this time there is no TLB miss. Following code sequence is in ia64_leave_kernel path with psr.ic and psr.i off. When br.ret.dptk.many b0 is executed, there may be a mandatory load, thus There may be a tlb miss, according to above description handle_lazy_cover executes cover on behalf of Guest and return to Guest, this is no correct in this scenario. I didn't find an easy way to fix this bug. mov loc6=0 mov loc7=0 (pRecurse) br.call.dptk.few b0=rse_clear_invalid ;; mov loc8=0 mov loc9=0 cmp.ne pReturn,p0=r0,in1// if recursion count != 0, we need to do a br.ret mov loc10=0 mov loc11=0 (pReturn) br.ret.dptk.many b0 #endif /* !CONFIG_ITANIUM */ # undef pRecurse # undef pReturn ;; alloc r17=ar.pfs,0,0,0,0// drop current register frame ;; loadrs Thanks, Anthony Tested by doing many linux kernel compilation in SMP domU ( 100). Tristan. ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel
RE: [Xen-ia64-devel] PATCH: slightly improve stability
Hi Tristan, Could you please check whether this patch address RSE issue? Yes, Intel QA team is doing the test in the meantime. Thanks, -Anthony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xu, Anthony Sent: 2006?4?28? 9:48 To: Tristan Gingold; xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tristan Gingold Sent: 2006?4?27? 23:14 To: xen-ia64-devel@lists.xensource.com; Magenheimer, Dan (HP Labs Fort Collins); Alex Williamson Subject: [Xen-ia64-devel] PATCH: slightly improve stability Hi, as reported earlier, this patch seems to improve stability: crashes are at least more coherent and maybe less frequent. RSE handling seems to have a bug: crahes are now due to either a bad value in a stacked register or a use of an invalid stacked register (although cfm seems correct in gdb!) I'm looking at this too, Yes there is a bug about handle_lazy_cover. void ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs, unsigned long itir) { unsigned long iip = regs-cr_iip, iha; // FIXME should validate address here unsigned long pteval; unsigned long is_data = !((isr IA64_ISR_X_BIT) 1UL); IA64FAULT fault; if ((isr IA64_ISR_IR) handle_lazy_cover(current, isr, regs)) return; This code sequence is intended to handle following scenario. 1. Guest executes br.ret, this may cause mandatory RSE load, and this load may cause TLB miss. 2. VMM gets control, but VMM can't handle this TLB miss itself, then VMM injects TLB miss to Guest TLB miss handler, when VMM executing rfi to jump to Guest TLB miss handler, this TLB miss happens again. 3. At this time, interrupt_collection_enabled is 0, so handle_lazy_cover executes cover on behalf of Guest, and return to Guest TLB miss handler again, this time there is no TLB miss. Following code sequence is in ia64_leave_kernel path with psr.ic and psr.i off. When br.ret.dptk.many b0 is executed, there may be a mandatory load, thus There may be a tlb miss, according to above description handle_lazy_cover executes cover on behalf of Guest and return to Guest, this is no correct in this scenario. I didn't find an easy way to fix this bug. mov loc6=0 mov loc7=0 (pRecurse) br.call.dptk.few b0=rse_clear_invalid ;; mov loc8=0 mov loc9=0 cmp.ne pReturn,p0=r0,in1// if recursion count != 0, we need to do a br.ret mov loc10=0 mov loc11=0 (pReturn) br.ret.dptk.many b0 #endif /* !CONFIG_ITANIUM */ # undef pRecurse # undef pReturn ;; alloc r17=ar.pfs,0,0,0,0// drop current register frame ;; loadrs Thanks, Anthony Tested by doing many linux kernel compilation in SMP domU ( 100). Tristan. ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel rse.patch Description: rse.patch ___ Xen-ia64-devel mailing list Xen-ia64-devel@lists.xensource.com http://lists.xensource.com/xen-ia64-devel