Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Alexander Duyck
On Mon, Nov 20, 2017 at 3:35 PM, Sarah Newman  wrote:
> On 11/20/2017 02:56 PM, Alexander Duyck wrote:
>> On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman  
>> wrote:
>>> On 11/20/2017 08:36 AM, Alexander Duyck wrote:
 Hi Sarah,

 I am adding the netdev mailing list as I am not certain this is an
 i350 specific issue. The traces themselves aren't anything I recognize
 as an existing issue. From what I can tell it looks like you are
 running Xen, so would I be correct in assuming you are bridging
 between VMs? If so are you using any sort of tunnels on your network,
 if so what type? This information would be useful as we may be looking
 at a bug in a tunnel offload for GRO.
>>>
>>> Yes, there's bridging. The traffic on the physical device is tagged with 
>>> vlans and the bridges use untagged traffic. There are no tunnels. I do not
>>> own the VMs traffic.
>>>
>>> Because I have only seen this on a single server with unique hardware, I 
>>> think it's most likely related to the hardware or to a particular VM on that
>>> server.
>>
>> So I would suspect traffic coming from the VM if anything. The i350 is
>> a pretty common device. If we were seeing issues specific to it > would 
>> expect we would have more reports than just the one so far.
>
> My confusion was primarily related to the release notes for an older version 
> of a different intel driver.
>
> But regarding traffic coming from a VM, the backtraces both include igb_poll. 
> Doesn't that mean the problem is related to inbound traffic on the igb
> device and not traffic direct from a local VM?
>
> --Sarah

All the igb driver is doing is taking the data off of the network,
populating sk_buff structures, and then handing them off to the stack.
The format of the sk_buff's has been pretty consistent for the last
several years so I am not really suspecting a driver issue.

The issue with network traffic is that it is usually symmetric meaning
if the VM sends something it will get some sort of reply.  The actual
traffic itself and how the kernel handles it has changed quite a bit
over the years, and a VM could be setting up a tunnel, or stack of
VLANs, or some other type of traffic that the kernel might have
recognized and tried to do GRO for but didn't fully support. If
turning off GRO solves the problem then the issue is likely in the GRO
code, not in the igb driver.

- Alex


Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Sarah Newman
On 11/20/2017 02:56 PM, Alexander Duyck wrote:
> On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman  
> wrote:
>> On 11/20/2017 08:36 AM, Alexander Duyck wrote:
>>> Hi Sarah,
>>>
>>> I am adding the netdev mailing list as I am not certain this is an
>>> i350 specific issue. The traces themselves aren't anything I recognize
>>> as an existing issue. From what I can tell it looks like you are
>>> running Xen, so would I be correct in assuming you are bridging
>>> between VMs? If so are you using any sort of tunnels on your network,
>>> if so what type? This information would be useful as we may be looking
>>> at a bug in a tunnel offload for GRO.
>>
>> Yes, there's bridging. The traffic on the physical device is tagged with 
>> vlans and the bridges use untagged traffic. There are no tunnels. I do not
>> own the VMs traffic.
>>
>> Because I have only seen this on a single server with unique hardware, I 
>> think it's most likely related to the hardware or to a particular VM on that
>> server.
> 
> So I would suspect traffic coming from the VM if anything. The i350 is
> a pretty common device. If we were seeing issues specific to it > would 
> expect we would have more reports than just the one so far.

My confusion was primarily related to the release notes for an older version of 
a different intel driver.

But regarding traffic coming from a VM, the backtraces both include igb_poll. 
Doesn't that mean the problem is related to inbound traffic on the igb
device and not traffic direct from a local VM?

--Sarah


Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Alexander Duyck
On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman  wrote:
> On 11/20/2017 08:36 AM, Alexander Duyck wrote:
>> Hi Sarah,
>>
>> I am adding the netdev mailing list as I am not certain this is an
>> i350 specific issue. The traces themselves aren't anything I recognize
>> as an existing issue. From what I can tell it looks like you are
>> running Xen, so would I be correct in assuming you are bridging
>> between VMs? If so are you using any sort of tunnels on your network,
>> if so what type? This information would be useful as we may be looking
>> at a bug in a tunnel offload for GRO.
>
> Yes, there's bridging. The traffic on the physical device is tagged with 
> vlans and the bridges use untagged traffic. There are no tunnels. I do not
> own the VMs traffic.
>
> Because I have only seen this on a single server with unique hardware, I 
> think it's most likely related to the hardware or to a particular VM on that
> server.

So I would suspect traffic coming from the VM if anything. The i350 is
a pretty common device. If we were seeing issues specific to it I
would expect we would have more reports than just the one so far.

>>
>> On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman  
>> wrote:
>>> Hi,
>>>
>>> I have an X10 supermicro with two I350's that has crashed twice now under 
>>> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:
>>
>> What was the last kernel you tested before v4.9.39? Just wondering as
>> it will help to rule out certain patches as possibly being the issue.
>
> 4.9.31.
>
> If the problem is related to a particular VM, then I don't think the last 
> known good kernel is necessarily pertinent, as the problematic traffic could
> have started at any time.
>
>>> I see in the release notes 
>>> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When 
>>> Routing Packets."
>>>
>>> We are bridging traffic, not routing, and the crashes are in the GRO code.
>>>
>>> Is it possible there are problems with GRO for bridging in the igb driver 
>>> now? If I disable GRO can I have some confidence it will fix the issue?
>>
>> As far as LRO not being used when routing, just so you know LRO and
>> GRO are two very different things. One of the issues with LRO is that
>> it wasn't reversible in some cases and so could lead to the packet
>> being changed if they were rerouted. With GRO that shouldn't be the
>> case as we should be able to get back out the original packets that
>> were put into a frame. So there shouldn't be any issues using GRO with
>> bridging or routing.
>
> In some very old release notes for the ixgbe 
> https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO 
> for bridging/routing, and it
> wasn't clear it was not specific to the driver. I didn't originally notice 
> how old the release notes were and that the notice was removed in newer
> versions, I apologize.
>
>>> First crash:
>>>
>>> [4083386.299221] [ cut here ]
>>> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 
>>> inet_gro_complete+0xbb/0xd0
>>> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp 
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev 
>>> ip6table_filter
>>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
>>> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 
>>> ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
>>> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 
>>> async_raid6_recov async_pq
>>>  async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev 
>>> shpchp i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler
>>> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core 
>>> mlx4_core mpt3sas
>>>  scsi_transport_sas raid_class wmi ast ttm
>>> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1
>>> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 
>>> 2.0a 09/16/2016
>>> [4083386.301109]  880306603d90 813f5935  
>>> 
>>> [4083386.301221]  880306603dd0 810a7e01 05c18174578a 
>>> 8802f94a9a00
>>> [4083386.301333]  8802f0824450  0040 
>>> 0040
>>> [4083386.301445] Call Trace:
>>> [4083386.301483]   [4083386.301519]   dump_stack+0x63/0x8e
>>> [4083386.301596]   __warn+0xd1/0xf0
>>> [4083386.301665]   warn_slowpath_null+0x1d/0x20
>>> [4083386.301747]   inet_gro_complete+0xbb/0xd0
>>> [4083386.301830]   napi_gro_complete+0x73/0xa0
>>> [4083386.301911]   napi_gro_flush+0x5f/0x80
>>> [4083386.301988]   napi_complete_done+0x6a/0xb0
>>> [4083386.302075]   igb_poll+0x38d/0x720 [igb]
>>> [4083386.302156]   ? igb_msix_ring+0x2e/0x40 [igb]
>>> [4083386.302255]   ? __handle_irq_event_percpu+0x4b/0x1a0
>>> [4083386.302349]   net_rx_action+0x158/0x360
>>> [4083386.302430]   

Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Sarah Newman
On 11/20/2017 08:36 AM, Alexander Duyck wrote:
> Hi Sarah,
> 
> I am adding the netdev mailing list as I am not certain this is an
> i350 specific issue. The traces themselves aren't anything I recognize
> as an existing issue. From what I can tell it looks like you are
> running Xen, so would I be correct in assuming you are bridging
> between VMs? If so are you using any sort of tunnels on your network,
> if so what type? This information would be useful as we may be looking
> at a bug in a tunnel offload for GRO.

Yes, there's bridging. The traffic on the physical device is tagged with vlans 
and the bridges use untagged traffic. There are no tunnels. I do not
own the VMs traffic.

Because I have only seen this on a single server with unique hardware, I think 
it's most likely related to the hardware or to a particular VM on that
server.

> 
> On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman  
> wrote:
>> Hi,
>>
>> I have an X10 supermicro with two I350's that has crashed twice now under 
>> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:
> 
> What was the last kernel you tested before v4.9.39? Just wondering as
> it will help to rule out certain patches as possibly being the issue.

4.9.31.

If the problem is related to a particular VM, then I don't think the last known 
good kernel is necessarily pertinent, as the problematic traffic could
have started at any time.

>> I see in the release notes 
>> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When 
>> Routing Packets."
>>
>> We are bridging traffic, not routing, and the crashes are in the GRO code.
>>
>> Is it possible there are problems with GRO for bridging in the igb driver 
>> now? If I disable GRO can I have some confidence it will fix the issue?
> 
> As far as LRO not being used when routing, just so you know LRO and
> GRO are two very different things. One of the issues with LRO is that
> it wasn't reversible in some cases and so could lead to the packet
> being changed if they were rerouted. With GRO that shouldn't be the
> case as we should be able to get back out the original packets that
> were put into a frame. So there shouldn't be any issues using GRO with
> bridging or routing.

In some very old release notes for the ixgbe 
https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO 
for bridging/routing, and it
wasn't clear it was not specific to the driver. I didn't originally notice how 
old the release notes were and that the notice was removed in newer
versions, I apologize.

>> First crash:
>>
>> [4083386.299221] [ cut here ]
>> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 
>> inet_gro_complete+0xbb/0xd0
>> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp 
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev 
>> ip6table_filter
>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
>> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 
>> ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
>> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 
>> async_raid6_recov async_pq
>>  async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev shpchp 
>> i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler
>> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core 
>> mlx4_core mpt3sas
>>  scsi_transport_sas raid_class wmi ast ttm
>> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1
>> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 
>> 2.0a 09/16/2016
>> [4083386.301109]  880306603d90 813f5935  
>> 
>> [4083386.301221]  880306603dd0 810a7e01 05c18174578a 
>> 8802f94a9a00
>> [4083386.301333]  8802f0824450  0040 
>> 0040
>> [4083386.301445] Call Trace:
>> [4083386.301483]   [4083386.301519]   dump_stack+0x63/0x8e
>> [4083386.301596]   __warn+0xd1/0xf0
>> [4083386.301665]   warn_slowpath_null+0x1d/0x20
>> [4083386.301747]   inet_gro_complete+0xbb/0xd0
>> [4083386.301830]   napi_gro_complete+0x73/0xa0
>> [4083386.301911]   napi_gro_flush+0x5f/0x80
>> [4083386.301988]   napi_complete_done+0x6a/0xb0
>> [4083386.302075]   igb_poll+0x38d/0x720 [igb]
>> [4083386.302156]   ? igb_msix_ring+0x2e/0x40 [igb]
>> [4083386.302255]   ? __handle_irq_event_percpu+0x4b/0x1a0
>> [4083386.302349]   net_rx_action+0x158/0x360
>> [4083386.302430]   __do_softirq+0xd1/0x283
>> [4083386.302507]   irq_exit+0xe9/0x100
>> [4083386.302580]   xen_evtchn_do_upcall+0x35/0x50
>> [4083386.302665]   xen_do_hypervisor_callback+0x1e/0x40
>> [4083386.302754]   [4083386.302787]   ? xen_hypercall_sched_op+0xa/0x20
>> [4083386.302876]   ? xen_hypercall_sched_op+0xa/0x20
>> [4083386.302965]   ? xen_safe_halt+0x10/0x20
>> [4083386.303043]   ? 

Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Alexander Duyck
Hi Sarah,

I am adding the netdev mailing list as I am not certain this is an
i350 specific issue. The traces themselves aren't anything I recognize
as an existing issue. From what I can tell it looks like you are
running Xen, so would I be correct in assuming you are bridging
between VMs? If so are you using any sort of tunnels on your network,
if so what type? This information would be useful as we may be looking
at a bug in a tunnel offload for GRO.

On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman  wrote:
> Hi,
>
> I have an X10 supermicro with two I350's that has crashed twice now under 
> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:

What was the last kernel you tested before v4.9.39? Just wondering as
it will help to rule out certain patches as possibly being the issue.

> $ /sbin/lspci | grep -i ethernet
> 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 04:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
>
> And some X9 supermicro's that have not crashed, with a single I350 I believe:
> $ /sbin/lspci | grep -i ethernet
> 06:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 06:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 06:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 06:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
>
> I see in the release notes 
> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When 
> Routing Packets."
>
> We are bridging traffic, not routing, and the crashes are in the GRO code.
>
> Is it possible there are problems with GRO for bridging in the igb driver 
> now? If I disable GRO can I have some confidence it will fix the issue?

As far as LRO not being used when routing, just so you know LRO and
GRO are two very different things. One of the issues with LRO is that
it wasn't reversible in some cases and so could lead to the packet
being changed if they were rerouted. With GRO that shouldn't be the
case as we should be able to get back out the original packets that
were put into a frame. So there shouldn't be any issues using GRO with
bridging or routing.

GRO isn't in the driver. It is in the network stack of the kernel
itself. The only responsibility of igb is to provide the frames in the
correct format so that they can be assembled by GRO if it is enabled.

> Here are my offload settings:
> Features for eth0:
> rx-checksumming: on
> tx-checksumming: on
> tx-checksum-ipv4: off [fixed]
> tx-checksum-ip-generic: on
> tx-checksum-ipv6: off [fixed]
> tx-checksum-fcoe-crc: off [fixed]
> tx-checksum-sctp: on
> scatter-gather: on
> tx-scatter-gather: on
> tx-scatter-gather-fraglist: off [fixed]
> tcp-segmentation-offload: on
> tx-tcp-segmentation: on
> tx-tcp-ecn-segmentation: off [fixed]
> tx-tcp-mangleid-segmentation: off
> tx-tcp6-segmentation: on
> udp-fragmentation-offload: off [fixed]
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off [fixed]
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: off
> receive-hashing: on
> highdma: on [fixed]
> rx-vlan-filter: on [fixed]
> vlan-challenged: off [fixed]
> tx-lockless: off [fixed]
> netns-local: off [fixed]
> tx-gso-robust: off [fixed]
> tx-fcoe-segmentation: off [fixed]
> tx-gre-segmentation: on
> tx-gre-csum-segmentation: on
> tx-ipxip4-segmentation: on
> tx-ipxip6-segmentation: on
> tx-udp_tnl-segmentation: on
> tx-udp_tnl-csum-segmentation: on
> tx-gso-partial: on
> tx-sctp-segmentation: off [fixed]
> fcoe-mtu: off [fixed]
> tx-nocache-copy: off
> loopback: off [fixed]
> rx-fcs: off [fixed]
> rx-all: off
> tx-vlan-stag-hw-insert: off [fixed]
> rx-vlan-stag-hw-parse: off [fixed]
> rx-vlan-stag-filter: off [fixed]
> l2-fwd-offload: off [fixed]
> busy-poll: off [fixed]
> hw-tc-offload: off [fixed]
>
> First crash:
>
> [4083386.299221] [ cut here ]
> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 
> inet_gro_complete+0xbb/0xd0
> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev 
> ip6table_filter
> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark 
> ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 
> async_raid6_recov async_pq
>  async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c