Re: [XenPPC] [RFC] 'xm restore' following boot

2006-12-08 Thread Jimi Xenidis


On Dec 7, 2006, at 6:16 PM, Hollis Blanchard wrote:


On Thu, 2006-12-07 at 17:11 -0500, poff wrote:


Also today there have been several runs similar to example 2.
I modified python code to skip the 'unpause' at the end of
domain restore. The drill: boot, xend start, xm restore,
then another activity eg rebuild tools or search kernel tree,
finally xm unpause. The guest domain often runs ok!

If the 'other activity' is skipped and restored domain
is unpaused immediately, almost always wedges.


Could this be an issue with flushing the icache?



We are still being _really_ dumb about the icache and flushing it on  
every switch in context_switch().


-JX




___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] [RFC] 'xm restore' following boot

2006-12-07 Thread poff
I now think the console prints in previous mail are useless.
Example 2 runs while example 3 wedges, yet the prints are
roughly equivalent...

Also today there have been several runs similar to example 2.
I modified python code to skip the 'unpause' at the end of
domain restore. The drill: boot, xend start, xm restore,
then another activity eg rebuild tools or search kernel tree,
finally xm unpause. The guest domain often runs ok!

If the 'other activity' is skipped and restored domain
is unpaused immediately, almost always wedges.

However, sometimes the restored domain will wedge regardless
of other activity or multiple trys at restoring.

Earlier this week I was sure the wedging occured due to interrupts
or execptions in a loop, but have placed some counters, but see
nothing when wedging (via BUG() or printk()). Have not installed
gdb or tracing patches, thinking would not help with interrupt problems.

Yi thought there may be some kernel initialization during boot
that is missing with restore...


Anyway I see no way to proceed without knowing where the wedge occurs.

___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] [RFC] 'xm restore' following boot

2006-12-07 Thread Hollis Blanchard
On Thu, 2006-12-07 at 17:11 -0500, poff wrote:
 
 Also today there have been several runs similar to example 2.
 I modified python code to skip the 'unpause' at the end of
 domain restore. The drill: boot, xend start, xm restore,
 then another activity eg rebuild tools or search kernel tree,
 finally xm unpause. The guest domain often runs ok!
 
 If the 'other activity' is skipped and restored domain
 is unpaused immediately, almost always wedges. 

Could this be an issue with flushing the icache?

-- 
Hollis Blanchard
IBM Linux Technology Center


___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


[XenPPC] [RFC] 'xm restore' following boot

2006-12-06 Thread poff
'xm restore' immediately following boot usually wedges the cpu.
However, xm save followed by xm restore works fine (even when
guest domain and htab are relocated to new memory areas).

^AAA shows:  with .plpar_hcall_norets  @ c003af78
 and  .HYPERVISOR_sched_op @ c004415c
(XEN) *** Dumping CPU3 state: ***
(XEN) [ Xen-3.0-unstable ]
(XEN) CPU: 0003   DOMID: 0001
(XEN) pc c003af88 msr 80009032
(XEN) lr c0044210 ctr c0044238
(XEN) srr0  srr1 
(XEN) r00: 2448 c065bcb0 c0656630 
(XEN) r04: 0001  2442 c000fc24
(XEN) r08: ecf515a8 c0044238 00989680 c00441a4
(XEN) r12: 01a9f9f8 c052e300  
(XEN) r16:    
(XEN) r20:    
(XEN) r24:   4000 c000
(XEN) r28:  0010 c053d3c8 0001
(XEN) reprogram_timer[00] Timeout in the past 0x004332DBA479  
0x0042C2424DF3


Here are typical console with debug prints and execptions:
If 'xm restore' is run several times, often it will start working,
though the exceptions still occur... (user domain has ramdisk  networking)
At the bottom, some code specified by a couple Exceptions...


1. 'xm restore' following xm save:

cso84:~ # xm console 6
mfdec: -12
TIMEBASE_FREQ: 71592390
Here we're resuming 
hid4: 0x62001242
arch_gnttab_map: grant table at d8008000
irq_resume() 
switch_idle_mm()
mfdec: 14315899
__sti()
xencons_resume() 
xenbus_resume()
smp_resume()
mfdec: 63024
returning
netfront: device eth0 has copying receive path.

[EMAIL PROTECTED] /]# 


2. reboot with 'xm restore' that worked 1st time:

cso84:~ # xm console 1
mfdec: -14
TIMEBASE_FREQ: 71592390
Here we're resuming 
hid4: 0x60001241
arch_gnttab_map: grant table at d8008000
irq_resume() 
switch_idle_mm()
mfdec: 14315924
__sti()
xencons_resume() 
xenbus_resume()
BUG: soft lockup detected on CPU#0!
Call Trace:
[C065B090] [C001062C] .show_stack+0x50/0x1cc (unreliable)
[C065B140] [C008956C] .softlockup_tick+0x100/0x128
[C065B200] [C0065BC0] .run_local_timers+0x1c/0x30
[C065B280] [C0023C60] .timer_interrupt+0x108/0x4f0
[C065B3B0] [C00034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .handle_IRQ_event+0x4c/0x13c
LR = .__do_IRQ+0x1ac/0x2b4
[C065B6A0] [C05AB7B0] 0xc05ab7b0 (unreliable)
[C065B740] [C0089FC8] .__do_IRQ+0x1ac/0x2b4
[C065B800] [C02B7134] .evtchn_do_upcall+0x128/0x1a4
[C065B8C0] [C0043664] .xen_get_irq+0x10/0x28
[C065B940] [C000BD7C] .do_IRQ+0x7c/0x100
[C065B9C0] [C00041EC] hardware_interrupt_entry+0xc/0x10
--- Exception: 501 at .plpar_hcall_norets+0x10/0x1c
LR = .HYPERVISOR_sched_op+0xb4/0x10c
[C065BCB0] [C00BDA74] .kmem_cache_free+0xe4/0x2f4 (unreliable)
[C065BD60] [C00455CC] .xen_power_save+0x80/0x98
[C065BDE0] [C00120E4] .cpu_idle+0x14c/0x154
[C065BE70] [C0009174] .rest_init+0x44/0x5c
[C065BEF0] [C04E58D8] .start_kernel+0x2a0/0x308
[C065BF90] [C00084FC] .start_here_common+0x50/0x54
smp_resume()
mfdec: 90178
returning
netfront: device eth0 has copying receive path.

[EMAIL PROTECTED] /]# 


3. reboot with typical wedge:

cso84:~ # xm console 1
mfdec: -12
TIMEBASE_FREQ: 71592390
Here we're resuming 
hid4: 0x60001241
arch_gnttab_map: grant table at d8008000
irq_resume() 
switch_idle_mm()
mfdec: 14315903
__sti()
xencons_resume() 
xenbus_resume()
smp_resume()
mfdec: 14218880
returning
BUG: soft lockup detected on CPU#0!
Call Trace:
[C065B090] [C001062C] .show_stack+0x50/0x1cc (unreliable)
[C065B140] [C008956C] .softlockup_tick+0x100/0x128
[C065B200] [C0065BC0] .run_local_timers+0x1c/0x30
[C065B280] [C0023C60] .timer_interrupt+0x108/0x4f0
[C065B3B0] [C00034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .handle_IRQ_event+0x4c/0x13c
LR = .__do_IRQ+0x1ac/0x2b4
[C065B6A0] [C05AB7B0] 0xc05ab7b0 (unreliable)
[C065B740] [C0089FC8] .__do_IRQ+0x1ac/0x2b4
[C065B800] [C02B7134] .evtchn_do_upcall+0x128/0x1a4
[C065B8C0] [C0043664] .xen_get_irq+0x10/0x28
[C065B940] [C000BD7C] .do_IRQ+0x7c/0x100
[C065B9C0] [C00041EC] hardware_interrupt_entry+0xc/0x10
--- Exception: 501 at .plpar_hcall_norets+0x10/0x1c
LR = .HYPERVISOR_sched_op+0xb4/0x10c
[C065BCB0] [C00BDA74]