Re: [kvm-devel] still seeing network freezes with rtl8139 nic
Try adding the noapic option to your guest kernel. I re-ran that test on kvm-62 and my VM was able to run under load for more than 3-1/2 days (the network never locked up; I stopped the test to try other variations). One side effect of the noapic option is that irq balancing is disabled -- all interrupts are delivered via CPU 0. I ran a few tests earlier this week without the noapic option (hence with the apic) but with irq balancing disabled and still had the lockups. It seems to be something specific to the apic. david Eckersid SIlapaswang wrote: david ahern daahern at cisco.com writes: I know this issue has been discussed on this list before, but I am still experiencing network freezes in a guest that requires a restart to clear. When the network freezes in the guest I no longer see the network interrupts counter incrementing (i.e., the eth0 counter in /proc/interrupts in the guest). Using the crash utility, I verified that the interrupt is still enabled on the guest side and that no interrupts are pending. This suggests that the interrupts are not getting delivered to the VM. I just wanted to let the developers know that I'm having similar problems concerning interrupts with networking dying as well. Running a stress test of kvm using an EnGarde Secure Linux 1.5 guest OS. Under a heavy network email load, the guest OS networking gets knocked out - unable to ping, ssh, etc. Can only get things started again by going into vncviewer and restarting the networking services from there. CPUs: 8 x Intel(R) Xeon(R) CPU E5335 @ 2.00GHz KVM 52-1 Host Kernel: 2.6.25-rc2 Kernel Arch: x86_64 Guest OS: EnGarde Secure Linux 32bit i686, 2.4.31-1.5.60 Command Line: /usr/bin/qemu-system -hda /root/images/bwimail01.img -boot c -m 384 -smp 4 -std-vga -net nic,vlan=0,macaddr=52:54:00:12:34:6F -net tap,ifname=tap1,script=/etc/qemu-ifup -vnc 192.168.1.57:1 Please let me know if you need anymore information and if I could be of any assistance in providing information to have this issue resolved. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
david ahern wrote: Try adding the noapic option to your guest kernel. I re-ran that test on kvm-62 and my VM was able to run under load for more than 3-1/2 days (the network never locked up; I stopped the test to try other variations). One side effect of the noapic option is that irq balancing is disabled -- all interrupts are delivered via CPU 0. I ran a few tests earlier this week without the noapic option (hence with the apic) but with irq balancing disabled and still had the lockups. It seems to be something specific to the apic. I got good results with apic and e1000. Can you try it? May be a guest driver bug. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
I did not have any better luck with the e1000 or pcnet nics when running kvm-61. I'll try again with kvm-63 and get back to you. david Avi Kivity wrote: david ahern wrote: Try adding the noapic option to your guest kernel. I re-ran that test on kvm-62 and my VM was able to run under load for more than 3-1/2 days (the network never locked up; I stopped the test to try other variations). One side effect of the noapic option is that irq balancing is disabled -- all interrupts are delivered via CPU 0. I ran a few tests earlier this week without the noapic option (hence with the apic) but with irq balancing disabled and still had the lockups. It seems to be something specific to the apic. I got good results with apic and e1000. Can you try it? May be a guest driver bug. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
david ahern daahern at cisco.com writes: I know this issue has been discussed on this list before, but I am still experiencing network freezes in a guest that requires a restart to clear. When the network freezes in the guest I no longer see the network interrupts counter incrementing (i.e., the eth0 counter in /proc/interrupts in the guest). Using the crash utility, I verified that the interrupt is still enabled on the guest side and that no interrupts are pending. This suggests that the interrupts are not getting delivered to the VM. I just wanted to let the developers know that I'm having similar problems concerning interrupts with networking dying as well. Running a stress test of kvm using an EnGarde Secure Linux 1.5 guest OS. Under a heavy network email load, the guest OS networking gets knocked out - unable to ping, ssh, etc. Can only get things started again by going into vncviewer and restarting the networking services from there. CPUs: 8 x Intel(R) Xeon(R) CPU E5335 @ 2.00GHz KVM 52-1 Host Kernel: 2.6.25-rc2 Kernel Arch: x86_64 Guest OS: EnGarde Secure Linux 32bit i686, 2.4.31-1.5.60 Command Line: /usr/bin/qemu-system -hda /root/images/bwimail01.img -boot c -m 384 -smp 4 -std-vga -net nic,vlan=0,macaddr=52:54:00:12:34:6F -net tap,ifname=tap1,script=/etc/qemu-ifup -vnc 192.168.1.57:1 Please let me know if you need anymore information and if I could be of any assistance in providing information to have this issue resolved. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
david ahern wrote: Almost 7 hours and the uniprocessor case is still chugging along. How long does it usually take to hang? How do I go about reproducing this? apachebench (host) against httpd (guest) doesn't seem to trigger it. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
Usually within a few hours, sometimes within 30 minutes. Load averages as computed by sysstat in the nightly sar files: rxpck/s txpck/s rxbyt/s txbyt/s eth0975.18 1188.34 82044.06 171655.38 Interrupts come in at 1830/sec for eth0, 250/sec for the timer and 20/sec or ide0. Are you using a RHEL4 image? david Avi Kivity wrote: david ahern wrote: Almost 7 hours and the uniprocessor case is still chugging along. How long does it usually take to hang? How do I go about reproducing this? apachebench (host) against httpd (guest) doesn't seem to trigger it. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
Avi Kivity wrote: david ahern wrote: Almost 7 hours and the uniprocessor case is still chugging along. How long does it usually take to hang? How do I go about reproducing this? apachebench (host) against httpd (guest) doesn't seem to trigger it. ab (on host) against httpd (on guest) reproduces within a few minutes. uniprocessor guest: works smp guest: fails smp guest with one offlined cpu: works smp guest with httpd pinned to cpu 0: works so it's either a guest driver smp problem, or an ioapic problem. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
I noted in my original post that if I increase the weight parameter in the driver to have it pull more packets on each poll before taking it break then it takes longer to freeze. I have looked at a newer version of the 8139cp driver. Very few changes to the poll function; most of them seem to be accommodating changes to the netdevice/napi api. I'll take a closer look at it today. david Avi Kivity wrote: Avi Kivity wrote: david ahern wrote: Almost 7 hours and the uniprocessor case is still chugging along. How long does it usually take to hang? How do I go about reproducing this? apachebench (host) against httpd (guest) doesn't seem to trigger it. ab (on host) against httpd (on guest) reproduces within a few minutes. uniprocessor guest: works smp guest: fails smp guest with one offlined cpu: works smp guest with httpd pinned to cpu 0: works so it's either a guest driver smp problem, or an ioapic problem. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
david ahern wrote: Usually within a few hours, sometimes within 30 minutes. Load averages as computed by sysstat in the nightly sar files: rxpck/s txpck/s rxbyt/s txbyt/s eth0975.18 1188.34 82044.06 171655.38 Interrupts come in at 1830/sec for eth0, 250/sec for the timer and 20/sec or ide0. Are you using a RHEL4 image? Nope, FC6 x86_64. My apachebench command line is ab -c 150 -n 1000 -k http://guest/ 12K eth interrupts/sec, 20 ide interrupts/sec, 250Hz timer. The served file is quite small, so many small packets. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
david ahern wrote: I've run a lot more tests: - if I remove the if (!change) return optimization from pci_set_irq the rtl8139 nic worked fine for 16+ hours. I'm not recommending this as a fix, just confirming that the problem goes away. Interesting. What can cause this to happen? - some non-pci device shares the same irq (unlikely) - the pci link sharing is broken. Is the eth0 irq shared? Please post /proc/interrupts. - the in-kernel ioapic is buggy and needs the extra kicking the optimization prevents. Can be checked by re-adding the optimization to kvm_ioapic_set_irq() (keeping it removed in qemu). If it works, the problem is in userspace. If it fails, the problem is in the kernel. Something like static int old_level[16]; if (level == old_level[irq]) return; old_level[irq] = level; -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
Avi Kivity wrote: david ahern wrote: I've run a lot more tests: - if I remove the if (!change) return optimization from pci_set_irq the rtl8139 nic worked fine for 16+ hours. I'm not recommending this as a fix, just confirming that the problem goes away. Interesting. What can cause this to happen? - some non-pci device shares the same irq (unlikely) - the pci link sharing is broken. Is the eth0 irq shared? interrupt is not shared. Please post /proc/interrupts. # cat /proc/interrupts CPU0 CPU1 0: 10566 46468IO-APIC-edge timer 1: 5 5IO-APIC-edge i8042 8: 0 1IO-APIC-edge rtc 9: 0 0 IO-APIC-level acpi 11: 243118 5656 IO-APIC-level eth0 12:180 45IO-APIC-edge i8042 14: 2021 12592IO-APIC-edge ide0 15: 14 10IO-APIC-edge ide1 NMI: 0 0 LOC: 56947 56946 ERR: 0 MIS: 31 - the in-kernel ioapic is buggy and needs the extra kicking the optimization prevents. Can be checked by re-adding the optimization to kvm_ioapic_set_irq() (keeping it removed in qemu). If it works, the problem is in userspace. If it fails, the problem is in the kernel. Something like static int old_level[16]; if (level == old_level[irq]) return; old_level[irq] = level; I'll give this a shot and let you know. If you are interested, here's some more info on the -no-kvm-irqchip option: qemu ends up spinning with 1 thread consuming 100% cpu. Output from top (literally the top 11 lines) with 'show threads' and individual cpu stats: Tasks: 125 total, 2 running, 123 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 1.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 4046804k total, 4013480k used,33324k free,42512k buffers Swap: 2096472k total, 120k used, 2096352k free, 1159892k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 4441 root 20 0 2675m 2.5g 9808 R 100 65.0 499:34.09 qemu-system-x86 4426 root 20 0 2675m 2.5g 9808 S1 65.0 16:24.50 qemu-system-x86 ... Hooking up gdb shows it cycling with the following backtrace: (gdb) bt #0 0x2ad97b5ee3e8 in do_sigtimedwait () from /lib64/libc.so.6 #1 0x2ad97b5ee4ae in sigtimedwait () from /lib64/libc.so.6 #2 0x004fb7df in kvm_eat_signal (env=0x2ade460, timeout=10) at /opt/kvm/kvm-61/qemu/qemu-kvm.c:156 #3 0x004fb9e4 in kvm_eat_signals (env=0x2ade460, timeout=10) at /opt/kvm/kvm-61/qemu/qemu-kvm.c:192 #4 0x004fba49 in kvm_main_loop_wait (env=0x2ade460, timeout=10) at /opt/kvm/kvm-61/qemu/qemu-kvm.c:211 #5 0x004fc278 in kvm_main_loop_cpu (env=0x2ade460) at /opt/kvm/kvm-61/qemu/qemu-kvm.c:299 #6 0x0040ff2d in main (argc=value optimized out, argv=0x7fff304607b8) at /opt/kvm/kvm-61/qemu/vl.c:7856 I have a dump of CPUX86State *env if you want to see it. david - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
david ahern wrote: Avi Kivity wrote: - the in-kernel ioapic is buggy and needs the extra kicking the optimization prevents. Can be checked by re-adding the optimization to kvm_ioapic_set_irq() (keeping it removed in qemu). If it works, the problem is in userspace. If it fails, the problem is in the kernel. Something like static int old_level[16]; if (level == old_level[irq]) return; old_level[irq] = level; With the if (!change) return; taken out of pci_set_irq() and the above code added to kvm_ioapic_set_irq() networking froze. david - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
david ahern wrote: david ahern wrote: Avi Kivity wrote: - the in-kernel ioapic is buggy and needs the extra kicking the optimization prevents. Can be checked by re-adding the optimization to kvm_ioapic_set_irq() (keeping it removed in qemu). If it works, the problem is in userspace. If it fails, the problem is in the kernel. Something like static int old_level[16]; if (level == old_level[irq]) return; old_level[irq] = level; With the if (!change) return; taken out of pci_set_irq() and the above code added to kvm_ioapic_set_irq() networking froze. That points the finger at the kernel ioapic. I saw from the /proc/interrupts dump that it's an smp guest. Does it freeze on uniprocessor as well? Maybe it's bad locking in the kernel. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
Almost 7 hours and the uniprocessor case is still chugging along. david Avi Kivity wrote: david ahern wrote: david ahern wrote: Avi Kivity wrote: - the in-kernel ioapic is buggy and needs the extra kicking the optimization prevents. Can be checked by re-adding the optimization to kvm_ioapic_set_irq() (keeping it removed in qemu). If it works, the problem is in userspace. If it fails, the problem is in the kernel. Something like static int old_level[16]; if (level == old_level[irq]) return; old_level[irq] = level; With the if (!change) return; taken out of pci_set_irq() and the above code added to kvm_ioapic_set_irq() networking froze. That points the finger at the kernel ioapic. I saw from the /proc/interrupts dump that it's an smp guest. Does it freeze on uniprocessor as well? Maybe it's bad locking in the kernel. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
david ahern wrote: I know this issue has been discussed on this list before, but I am still experiencing network freezes in a guest that requires a restart to clear. When the network freezes in the guest I no longer see the network interrupts counter incrementing (i.e., the eth0 counter in /proc/interrupts in the guest). Using the crash utility, I verified that the interrupt is still enabled on the guest side and that no interrupts are pending. This suggests that the interrupts are not getting delivered to the VM. [...] I am continuing to look into the irq processing on the kvm/qemu side. I'd like to know if anyone has suggestions on what to look at. This is my first foray into the kvm and qemu code, and it's a lot to take in all at once. Standard procedure is to run with -no-kvm and -no-kvm-irqchip, to see if the problem is in qemu proper, the in-kernel irq handling, or the rest of kvm. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] still seeing network freezes with rtl8139 nic
I've run a lot more tests: - with the -no-kvm-irqchip option the vm eventully stops responding to network or console, - with the -no-kvm option the performance is so bad I cannot get our ap up and running so the results are inconclusive, - I've tried the e1000 and pcnet nic models and both showed the network lockup; with the ne2k_pci nic I did not see dropped packets and the network never locked up in 12+ hours, but system CPU time was 10% higher than when the rtl8139 nic was working - if I remove the if (!change) return optimization from pci_set_irq the rtl8139 nic worked fine for 16+ hours. I'm not recommending this as a fix, just confirming that the problem goes away. - I tried adding a thread mutex to the rtl8139 device model around accesses to the RTL8139State data, but the network still locked up. david Avi Kivity wrote: david ahern wrote: I know this issue has been discussed on this list before, but I am still experiencing network freezes in a guest that requires a restart to clear. When the network freezes in the guest I no longer see the network interrupts counter incrementing (i.e., the eth0 counter in /proc/interrupts in the guest). Using the crash utility, I verified that the interrupt is still enabled on the guest side and that no interrupts are pending. This suggests that the interrupts are not getting delivered to the VM. [...] I am continuing to look into the irq processing on the kvm/qemu side. I'd like to know if anyone has suggestions on what to look at. This is my first foray into the kvm and qemu code, and it's a lot to take in all at once. Standard procedure is to run with -no-kvm and -no-kvm-irqchip, to see if the problem is in qemu proper, the in-kernel irq handling, or the rest of kvm. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel