Re: [XenPPC] [RFC] 'xm restore' following boot
On Dec 7, 2006, at 6:16 PM, Hollis Blanchard wrote: On Thu, 2006-12-07 at 17:11 -0500, poff wrote: Also today there have been several runs similar to example 2. I modified python code to skip the 'unpause' at the end of domain restore. The drill: boot, xend start, xm restore, then another activity eg rebuild tools or search kernel tree, finally xm unpause. The guest domain often runs ok! If the 'other activity' is skipped and restored domain is unpaused immediately, almost always wedges. Could this be an issue with flushing the icache? We are still being _really_ dumb about the icache and flushing it on every switch in context_switch(). -JX ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] [RFC] 'xm restore' following boot
I now think the console prints in previous mail are useless. Example 2 runs while example 3 wedges, yet the prints are roughly equivalent... Also today there have been several runs similar to example 2. I modified python code to skip the 'unpause' at the end of domain restore. The drill: boot, xend start, xm restore, then another activity eg rebuild tools or search kernel tree, finally xm unpause. The guest domain often runs ok! If the 'other activity' is skipped and restored domain is unpaused immediately, almost always wedges. However, sometimes the restored domain will wedge regardless of other activity or multiple trys at restoring. Earlier this week I was sure the wedging occured due to interrupts or execptions in a loop, but have placed some counters, but see nothing when wedging (via BUG() or printk()). Have not installed gdb or tracing patches, thinking would not help with interrupt problems. Yi thought there may be some kernel initialization during boot that is missing with restore... Anyway I see no way to proceed without knowing where the wedge occurs. ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] [RFC] 'xm restore' following boot
On Thu, 2006-12-07 at 17:11 -0500, poff wrote: Also today there have been several runs similar to example 2. I modified python code to skip the 'unpause' at the end of domain restore. The drill: boot, xend start, xm restore, then another activity eg rebuild tools or search kernel tree, finally xm unpause. The guest domain often runs ok! If the 'other activity' is skipped and restored domain is unpaused immediately, almost always wedges. Could this be an issue with flushing the icache? -- Hollis Blanchard IBM Linux Technology Center ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
[XenPPC] [RFC] 'xm restore' following boot
'xm restore' immediately following boot usually wedges the cpu. However, xm save followed by xm restore works fine (even when guest domain and htab are relocated to new memory areas). ^AAA shows: with .plpar_hcall_norets @ c003af78 and .HYPERVISOR_sched_op @ c004415c (XEN) *** Dumping CPU3 state: *** (XEN) [ Xen-3.0-unstable ] (XEN) CPU: 0003 DOMID: 0001 (XEN) pc c003af88 msr 80009032 (XEN) lr c0044210 ctr c0044238 (XEN) srr0 srr1 (XEN) r00: 2448 c065bcb0 c0656630 (XEN) r04: 0001 2442 c000fc24 (XEN) r08: ecf515a8 c0044238 00989680 c00441a4 (XEN) r12: 01a9f9f8 c052e300 (XEN) r16: (XEN) r20: (XEN) r24: 4000 c000 (XEN) r28: 0010 c053d3c8 0001 (XEN) reprogram_timer[00] Timeout in the past 0x004332DBA479 0x0042C2424DF3 Here are typical console with debug prints and execptions: If 'xm restore' is run several times, often it will start working, though the exceptions still occur... (user domain has ramdisk networking) At the bottom, some code specified by a couple Exceptions... 1. 'xm restore' following xm save: cso84:~ # xm console 6 mfdec: -12 TIMEBASE_FREQ: 71592390 Here we're resuming hid4: 0x62001242 arch_gnttab_map: grant table at d8008000 irq_resume() switch_idle_mm() mfdec: 14315899 __sti() xencons_resume() xenbus_resume() smp_resume() mfdec: 63024 returning netfront: device eth0 has copying receive path. [EMAIL PROTECTED] /]# 2. reboot with 'xm restore' that worked 1st time: cso84:~ # xm console 1 mfdec: -14 TIMEBASE_FREQ: 71592390 Here we're resuming hid4: 0x60001241 arch_gnttab_map: grant table at d8008000 irq_resume() switch_idle_mm() mfdec: 14315924 __sti() xencons_resume() xenbus_resume() BUG: soft lockup detected on CPU#0! Call Trace: [C065B090] [C001062C] .show_stack+0x50/0x1cc (unreliable) [C065B140] [C008956C] .softlockup_tick+0x100/0x128 [C065B200] [C0065BC0] .run_local_timers+0x1c/0x30 [C065B280] [C0023C60] .timer_interrupt+0x108/0x4f0 [C065B3B0] [C00034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .handle_IRQ_event+0x4c/0x13c LR = .__do_IRQ+0x1ac/0x2b4 [C065B6A0] [C05AB7B0] 0xc05ab7b0 (unreliable) [C065B740] [C0089FC8] .__do_IRQ+0x1ac/0x2b4 [C065B800] [C02B7134] .evtchn_do_upcall+0x128/0x1a4 [C065B8C0] [C0043664] .xen_get_irq+0x10/0x28 [C065B940] [C000BD7C] .do_IRQ+0x7c/0x100 [C065B9C0] [C00041EC] hardware_interrupt_entry+0xc/0x10 --- Exception: 501 at .plpar_hcall_norets+0x10/0x1c LR = .HYPERVISOR_sched_op+0xb4/0x10c [C065BCB0] [C00BDA74] .kmem_cache_free+0xe4/0x2f4 (unreliable) [C065BD60] [C00455CC] .xen_power_save+0x80/0x98 [C065BDE0] [C00120E4] .cpu_idle+0x14c/0x154 [C065BE70] [C0009174] .rest_init+0x44/0x5c [C065BEF0] [C04E58D8] .start_kernel+0x2a0/0x308 [C065BF90] [C00084FC] .start_here_common+0x50/0x54 smp_resume() mfdec: 90178 returning netfront: device eth0 has copying receive path. [EMAIL PROTECTED] /]# 3. reboot with typical wedge: cso84:~ # xm console 1 mfdec: -12 TIMEBASE_FREQ: 71592390 Here we're resuming hid4: 0x60001241 arch_gnttab_map: grant table at d8008000 irq_resume() switch_idle_mm() mfdec: 14315903 __sti() xencons_resume() xenbus_resume() smp_resume() mfdec: 14218880 returning BUG: soft lockup detected on CPU#0! Call Trace: [C065B090] [C001062C] .show_stack+0x50/0x1cc (unreliable) [C065B140] [C008956C] .softlockup_tick+0x100/0x128 [C065B200] [C0065BC0] .run_local_timers+0x1c/0x30 [C065B280] [C0023C60] .timer_interrupt+0x108/0x4f0 [C065B3B0] [C00034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .handle_IRQ_event+0x4c/0x13c LR = .__do_IRQ+0x1ac/0x2b4 [C065B6A0] [C05AB7B0] 0xc05ab7b0 (unreliable) [C065B740] [C0089FC8] .__do_IRQ+0x1ac/0x2b4 [C065B800] [C02B7134] .evtchn_do_upcall+0x128/0x1a4 [C065B8C0] [C0043664] .xen_get_irq+0x10/0x28 [C065B940] [C000BD7C] .do_IRQ+0x7c/0x100 [C065B9C0] [C00041EC] hardware_interrupt_entry+0xc/0x10 --- Exception: 501 at .plpar_hcall_norets+0x10/0x1c LR = .HYPERVISOR_sched_op+0xb4/0x10c [C065BCB0] [C00BDA74]