Bug#651914: linux-image-2.6.32-5-xen-amd64: Poor IPv6 performance from Xen guest; GSO-related?

2011-12-12 Thread Andy Smith
Package: linux-2.6
Version: 2.6.32-38
Severity: important
Tags: ipv6

I have three squeeze servers running:

ii  linux-image-2.6.32-5-xen-amd64  2.6.32-38 Linux 2.6.32 for 64-bit 
PCs, Xen dom0 support

All three servers have Intel gigabit NICs, but one server uses the
e1000e driver and the other two use the igb driver.

They've been in production for around 6 months now and it seems like
somewhat embarrassingly we've only just now discovered a problem
with IPv6 performance on the two servers with the igb driver.

The problem manifests itself as awful TCP performance to a Xen domU,
on the order of 15-30KB/sec data transfer. Doing the same data
transfer from the server dom0 itself does not show the same issue,
and the expected tens of MB/sec data transfer is achieved.

Here's an example tcpdump from the dom0 host when the problem is
occurring:

# tcpdump -vpni bond0 'host 2a00:801:0:11::2'
[...]
23:59:00.672905 IP6 (hlim 55, next-header TCP (6) payload length: 4316) 
2a00:801:0:11::2.80  2001:db8:1f1:f240::2.35241: Flags [P.], cksum 0x62d3 
(incorrect - 0x1c84), seq 15709:19993, ack 127, win 9, options [nop,nop,TS val 
1771553020 ecr 1086205224], length 4284
23:59:00.672987 IP6 (hlim 64, next-header ICMPv6 (58) payload length: 1240) 
2001:db8:0:1f1::8  2a00:801:0:11::2: [icmp6 sum ok] ICMP6, packet too big, 
length 1240, mtu 1500
23:59:00.673161 IP6 (hlim 63, next-header TCP (6) payload length: 32) 
2001:db8:1f1:f240::2.35241  2a00:801:0:11::2.80: Flags [.], cksum 0x24e4 
(correct), ack 17137, win 716, options [nop,nop,TS val 1086205237 ecr 
1771553020], length 0
23:59:00.725659 IP6 (hlim 55, next-header TCP (6) payload length: 1460) 
2a00:801:0:11::2.80  2001:db8:1f1:f240::2.35241: Flags [.], cksum 0x16de 
(correct), seq 19993:21421, ack 127, win 9, options [nop,nop,TS val 1771553033 
ecr 1086205237], length 1428
23:59:00.725940 IP6 (hlim 63, next-header TCP (6) payload length: 44) 
2001:db8:1f1:f240::2.35241  2a00:801:0:11::2.80: Flags [.], cksum 0x25f5 
(correct), ack 17137, win 716, options [nop,nop,TS val 1086205250 ecr 
1771553020,nop,nop,sack 1 {19993:21421}], length 0
[...]
23:59:01.188463 IP6 (hlim 63, next-header TCP (6) payload length: 32) 
2001:db8:1f1:f240::2.35241  2a00:801:0:11::2.80: Flags [.], cksum 0x0105 
(correct), ack 25705, win 1073, options [nop,nop,TS val 1086205366 ecr 
1771553149], length 0
23:59:01.240946 IP6 (hlim 55, next-header TCP (6) payload length: 2888) 
2a00:801:0:11::2.80  2001:db8:1f1:f240::2.35241: Flags [P.], cksum 0x5d3f 
(incorrect - 0xf9ef), seq 25705:28561, ack 127, win 9, options [nop,nop,TS val 
1771553162 ecr 1086205366], length 2856
23:59:01.241040 IP6 (hlim 64, next-header ICMPv6 (58) payload length: 1240) 
2001:db8:0:1f1::8  2a00:801:0:11::2: [icmp6 sum ok] ICMP6, packet too big, 
length 1240, mtu 1500

2a00:801:0:11::2 is speedtest.tele2.net which helpfully hosts files
like http://speedtest.tele2.net/100MB.zip for testing purposes. The
above is the result of me using wget to download that file from a
domU on this server. The domU is at 2001:db8:1f1:f240::2 and the
dom0 is at 2001:db8:0:1f1::8.

What I'm noticing is the occasional incorrect checksum and ICMPv6
packet too big messages seen above around 23:59:00.672905 and
23:59:01.240946 after a packet of length 2856.  These do not occur
on the server with the e1000e driver, where all the packets top out
at 1428. They always occur on the two servers with the igb driver
where the poor throughput is observed.

This also does not occur when the IPv6 traffic is sourced on the dom0.
It's only when it's coming from a Xen domU.

I'm wondering if I am hitting something like this:

http://amailbox.org/mailarchive/linux-kvm/2010/2/2/6257539/thread

I have played with disabling and enabling GSO and checksums on every
interface I can (using ethtool), both in dom0 and domUs, and that makes
no difference.

Looking at linux-source-2.6.32 on squeeze, it does not have this
patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8e1e8a4779cb23c1d9f51e9223795e07ec54d77a

although I notice that this commit also touches e1000e where I am
not currently having any problems, even without this commit.

(severity: important seemed correct because this renders IPv6
effectively unusable for large transfers to/from domUs)

-- Package-specific info:
** Version:
Linux version 2.6.32-5-xen-amd64 (Debian 2.6.32-34squeeze1) (da...@debian.org) 
(gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Thu May 19 01:16:47 UTC 2011

** Command line:
placeholder root=UUID=2586c7d6-dea0-43db-a809-3a131bbfb128 ro console=tty0 
console=hvc0

** Not tainted

** Kernel log:

** Model information
sys_vendor: Supermicro
product_name: X8DTN+-F
product_version: 1234567890
chassis_vendor: Supermicro
chassis_version: 1234567890
bios_vendor: American Megatrends Inc.
bios_version: 080016 
board_vendor: Supermicro
board_name: X8DTN+-F
board_version: 1234567890

** Loaded modules:
Module  Size  Used by
dm_snapshot  

Bug#651914: linux-image-2.6.32-5-xen-amd64: Poor IPv6 performance from Xen guest; GSO-related?

2011-12-12 Thread Andy Smith
On Tue, Dec 13, 2011 at 05:23:23AM +, Andy Smith wrote:
 What I'm noticing is the occasional incorrect checksum and ICMPv6
 packet too big messages seen above around 23:59:00.672905 and
 23:59:01.240946 after a packet of length 2856.  These do not occur
 on the server with the e1000e driver, where all the packets top out
 at 1428. They always occur on the two servers with the igb driver
 where the poor throughput is observed.

Ah, could this be a duplicate of #630730? That was fixed in
2.6.32-39 while the machine in question is running 2.6.32-34.

#630730 seems to suggest that the bug would also have affected
e1000e, though. Another machine running 2.6.32-34 and using driver
e1000e is fine..

Thanks,
Andy



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org