Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Hi all, I backported mps commits and ask customer pass pci=pcie_bus_peer2pee to kernel to limited MPS to 128 and issue disappeared, sound like this is a BIOS bug. Thanks all of your help. Best Regards, Joe On 11/29/12 23:52, Fujinaka, Todd wrote: Someone else pointed this out to me locally. If you have a non-client BIOS, you should be able to set the MaxPayloadSize using setpci. You have to make sure that you're being consistent throughout all the associated links. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Ethan Zhao [mailto:ethan.ker...@gmail.com] Sent: Wednesday, November 28, 2012 7:10 PM To: Fujinaka, Todd Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang Joe, Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?). Anyway, to see if is a payload issue or, you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause. I thinks it is much easier than modify the BIOS or eeprom of NIC. e.g. set device control register to 0f 00 (128 bytes payload size) # setpci -v -s 00:02.0 98.w=000f set device link control register to 60h (retrain the link) # setpci -v -s 00:02.0 a0.b=60 Hope it works, Just my 2 cents. ethan.z...@oracle.com On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com wrote: The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, November 28, 2012 12:31 AM To: Ben Hutchings Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/28/12 02:10, Ben Hutchings wrote: On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). Ben, Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. So I'm trying to use ethtool modify it from eeprom to see if help or no. Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom. Thanks in advance, Joe -- Oracle http://www.oracle.com Joe Jin | Software Development Senior Manager | +8610.6106.5624 ORACLE | Linux and Virtualization No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Hi Yijing, Thanks for your reference, the patch looks good for me, but I have no chance to test it on customer's env. Best Regards, Joe On 12/19/12 13:52, Yijing Wang wrote: On 2012/12/19 11:04, Joe Jin wrote: Hi all, I backported mps commits and ask customer pass pci=pcie_bus_peer2pee to kernel to limited MPS to 128 and issue disappeared, sound like this is a BIOS bug. Hi Joe, I found similar problem when I do pci hotplug, discussion is here:http://marc.info/?l=linux-pcim=134810569924220w=2. We try to improve Linux kernel to debug this problem easily based Bjorn's suggestion. Jon sent out the first version patch http://marc.info/?l=linux-pcim=135002016005274w=2. I think we can do further here, http://marc.info/?l=linux-pcim=135115581307869w=2. I hope this information can help you. Thanks! Yijing. Thanks all of your help. Best Regards, Joe On 11/29/12 23:52, Fujinaka, Todd wrote: Someone else pointed this out to me locally. If you have a non-client BIOS, you should be able to set the MaxPayloadSize using setpci. You have to make sure that you're being consistent throughout all the associated links. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Ethan Zhao [mailto:ethan.ker...@gmail.com] Sent: Wednesday, November 28, 2012 7:10 PM To: Fujinaka, Todd Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang Joe, Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?). Anyway, to see if is a payload issue or, you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause. I thinks it is much easier than modify the BIOS or eeprom of NIC. e.g. set device control register to 0f 00 (128 bytes payload size) # setpci -v -s 00:02.0 98.w=000f set device link control register to 60h (retrain the link) # setpci -v -s 00:02.0 a0.b=60 Hope it works, Just my 2 cents. ethan.z...@oracle.com On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com wrote: The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, November 28, 2012 12:31 AM To: Ben Hutchings Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/28/12 02:10, Ben Hutchings wrote: On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). Ben, Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. So I'm trying to use ethtool modify it from eeprom to see if help or no. Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom. Thanks in advance, Joe -- Oracle http://www.oracle.com Joe Jin | Software Development Senior Manager | +8610.6106.5624 ORACLE | Linux and Virtualization No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 2012/12/19 11:04, Joe Jin wrote: Hi all, I backported mps commits and ask customer pass pci=pcie_bus_peer2pee to kernel to limited MPS to 128 and issue disappeared, sound like this is a BIOS bug. Hi Joe, I found similar problem when I do pci hotplug, discussion is here:http://marc.info/?l=linux-pcim=134810569924220w=2. We try to improve Linux kernel to debug this problem easily based Bjorn's suggestion. Jon sent out the first version patch http://marc.info/?l=linux-pcim=135002016005274w=2. I think we can do further here, http://marc.info/?l=linux-pcim=135115581307869w=2. I hope this information can help you. Thanks! Yijing. Thanks all of your help. Best Regards, Joe On 11/29/12 23:52, Fujinaka, Todd wrote: Someone else pointed this out to me locally. If you have a non-client BIOS, you should be able to set the MaxPayloadSize using setpci. You have to make sure that you're being consistent throughout all the associated links. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Ethan Zhao [mailto:ethan.ker...@gmail.com] Sent: Wednesday, November 28, 2012 7:10 PM To: Fujinaka, Todd Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang Joe, Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?). Anyway, to see if is a payload issue or, you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause. I thinks it is much easier than modify the BIOS or eeprom of NIC. e.g. set device control register to 0f 00 (128 bytes payload size) # setpci -v -s 00:02.0 98.w=000f set device link control register to 60h (retrain the link) # setpci -v -s 00:02.0 a0.b=60 Hope it works, Just my 2 cents. ethan.z...@oracle.com On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com wrote: The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, November 28, 2012 12:31 AM To: Ben Hutchings Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/28/12 02:10, Ben Hutchings wrote: On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). Ben, Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. So I'm trying to use ethtool modify it from eeprom to see if help or no. Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom. Thanks in advance, Joe -- Thanks! Yijing -- LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Someone else pointed this out to me locally. If you have a non-client BIOS, you should be able to set the MaxPayloadSize using setpci. You have to make sure that you're being consistent throughout all the associated links. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Ethan Zhao [mailto:ethan.ker...@gmail.com] Sent: Wednesday, November 28, 2012 7:10 PM To: Fujinaka, Todd Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang Joe, Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?). Anyway, to see if is a payload issue or, you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause. I thinks it is much easier than modify the BIOS or eeprom of NIC. e.g. set device control register to 0f 00 (128 bytes payload size) # setpci -v -s 00:02.0 98.w=000f set device link control register to 60h (retrain the link) # setpci -v -s 00:02.0 a0.b=60 Hope it works, Just my 2 cents. ethan.z...@oracle.com On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com wrote: The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, November 28, 2012 12:31 AM To: Ben Hutchings Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/28/12 02:10, Ben Hutchings wrote: On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). Ben, Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. So I'm trying to use ethtool modify it from eeprom to see if help or no. Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom. Thanks in advance, Joe -- Keep yourself connected to Go Parallel: VERIFY Test and improve your parallel project with help from experts and peers. http://goparallel.sourceforge.net ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 11/28/12 02:10, Ben Hutchings wrote: On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). Ben, Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. So I'm trying to use ethtool modify it from eeprom to see if help or no. Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom. Thanks in advance, Joe -- Keep yourself connected to Go Parallel: INSIGHTS What's next for parallel hardware, programming and related areas? Interviews and blogs by thought leaders keep you ahead of the curve. http://goparallel.sourceforge.net ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, November 28, 2012 12:31 AM To: Ben Hutchings Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/28/12 02:10, Ben Hutchings wrote: On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). Ben, Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. So I'm trying to use ethtool modify it from eeprom to see if help or no. Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom. Thanks in advance, Joe -- Keep yourself connected to Go Parallel: INSIGHTS What's next for parallel hardware, programming and related areas? Interviews and blogs by thought leaders keep you ahead of the curve. http://goparallel.sourceforge.net ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Joe, Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?). Anyway, to see if is a payload issue or, you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause. I thinks it is much easier than modify the BIOS or eeprom of NIC. e.g. set device control register to 0f 00 (128 bytes payload size) # setpci -v -s 00:02.0 98.w=000f set device link control register to 60h (retrain the link) # setpci -v -s 00:02.0 a0.b=60 Hope it works, Just my 2 cents. ethan.z...@oracle.com On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com wrote: The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, November 28, 2012 12:31 AM To: Ben Hutchings Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/28/12 02:10, Ben Hutchings wrote: On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). Ben, Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. So I'm trying to use ethtool modify it from eeprom to see if help or no. Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom. Thanks in advance, Joe -- Keep yourself connected to Go Parallel: VERIFY Test and improve your parallel project with help from experts and peers. http://goparallel.sourceforge.net ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). You also need to check all the PCIe links to get to the device. There can be several to get from the root complex, through bridges, to the endpoint Ethernet controller. The Ethernet part and driver has no control over any other links. You'll have to talk to the motherboard manufacturer about those links. Your original problem appears to be hangs and Tushar asked you to the entire path of PCIe connections from the root complex to the endpoint. Any mismatches in payload can cause hangs and I believe you have had the problem in the past. I'm sure you remember all the lspci commands to list the tree view and to dump all the details from each of the links and I would suggest you do that to check to see that the payload sizes match. What I do is lspci -tvvv to see what's connected, then lspci -s xx:xx.x -vvv to check the devices on the link. Thanks. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Mary Mcgrath [mailto:mary.mcgr...@oracle.com] Sent: Monday, November 26, 2012 6:07 PM To: Joe Jin Cc: net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang Joe Thank you for working this. I would love to find out how they expect a customer to make the modification To word 0x1A, and see if the 8th bit is 0 or 1, and to change to 0. I have in turn asked the ct for the lspci command on eth3, maybe the incorrect setting is upstream. Again, thank you. Regards Mary -Original Message- From: Joe Jin Sent: Monday, November 26, 2012 8:00 PM To: Fujinaka, Todd Cc: Dave, Tushar N; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; Mary Mcgrath Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/27/12 00:23, Fujinaka, Todd wrote: If you look at the previous section, DevCap, you'll see that it's correctly advertising 256 bytes but the system is negotiating 128 for the link to the Ethernet controller. Things on the other side of the link are controlled outside of the e1000 driver. Tushar's first suggestion was to check the PCIe payload settings in the entire chain. Have you done that? Mismatches will cause hangs. Hi Todd, So far I had to know how to modify the maxpayload size, since BIOS have not entry to change this, so I had to use ethtool, now I need to get the offset of MaxPayload size in eeprom, I ever tried to find from Intel online document but failed, any idea? Thanks in advance, Joe -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Thanks for the clarification. I was just going by the PCIe spec, which says the lowest value of both ends is used, and I figured SOMETHING had to be looking at that and doing some sort of negotiation. I'm no BIOS guy, so I'm not sure what's actually going on, whether something walks the PCIe tree or if the BIOS just sets all the values to the minimum. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Ben Hutchings [mailto:bhutchi...@solarflare.com] Sent: Tuesday, November 27, 2012 10:11 AM To: Fujinaka, Todd; Mary Mcgrath Cc: Joe Jin; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). (These parameters are not documented in Documentation/kernel-parameters.txt! Someone ought to fix that.) Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). (These parameters are not documented in Documentation/kernel-parameters.txt! Someone ought to fix that.) Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Tue, 20 Nov 2012, Joe Jin wrote: On 11/20/12 16:59, Dave, Tushar N wrote: Have you power off the system completely after modifying eeprom? If not please do so. Hi Tushar, Seems not works for me, would you please help to check what is wrong of my operations? ... # lspci -s :52:00.1 -vvv 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) --snip-- Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 4096 bytes ^ --snip-- If you look at the previous section, DevCap, you'll see that it's correctly advertising 256 bytes but the system is negotiating 128 for the link to the Ethernet controller. Things on the other side of the link are controlled outside of the e1000 driver. Tushar's first suggestion was to check the PCIe payload settings in the entire chain. Have you done that? Mismatches will cause hangs. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 11/27/12 00:23, Fujinaka, Todd wrote: If you look at the previous section, DevCap, you'll see that it's correctly advertising 256 bytes but the system is negotiating 128 for the link to the Ethernet controller. Things on the other side of the link are controlled outside of the e1000 driver. Tushar's first suggestion was to check the PCIe payload settings in the entire chain. Have you done that? Mismatches will cause hangs. Hi Todd, So far I had to know how to modify the maxpayload size, since BIOS have not entry to change this, so I had to use ethtool, now I need to get the offset of MaxPayload size in eeprom, I ever tried to find from Intel online document but failed, any idea? Thanks in advance, Joe -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Joe Thank you for working this. I would love to find out how they expect a customer to make the modification To word 0x1A, and see if the 8th bit is 0 or 1, and to change to 0. I have in turn asked the ct for the lspci command on eth3, maybe the incorrect setting is upstream. Again, thank you. Regards Mary -Original Message- From: Joe Jin Sent: Monday, November 26, 2012 8:00 PM To: Fujinaka, Todd Cc: Dave, Tushar N; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; Mary Mcgrath Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/27/12 00:23, Fujinaka, Todd wrote: If you look at the previous section, DevCap, you'll see that it's correctly advertising 256 bytes but the system is negotiating 128 for the link to the Ethernet controller. Things on the other side of the link are controlled outside of the e1000 driver. Tushar's first suggestion was to check the PCIe payload settings in the entire chain. Have you done that? Mismatches will cause hangs. Hi Todd, So far I had to know how to modify the maxpayload size, since BIOS have not entry to change this, so I had to use ethtool, now I need to get the offset of MaxPayload size in eeprom, I ever tried to find from Intel online document but failed, any idea? Thanks in advance, Joe -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Sunday, November 18, 2012 9:38 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org; Mary Mcgrath Subject: Re: 82571EB: Detected Hardware Unit Hang On 11/16/12 04:26, Dave, Tushar N wrote: Would you please help to fine the offset of max payload size in eeprom? I'd like to have a try to modify it by ethtool. It is defined using bit 8 of word 0x1A. Bit value 0 = 128B , bit value 1 = 256B Hi Tushar, I checked one of my server which Max Payload Size is 128: # lspci -vvv -s 52:00.1 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 266 Region 0: Memory at dfea (32-bit, non-prefetchable) [size=128K] Region 1: Memory at dfe8 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at 6020 [size=32] [virtual] Expansion ROM at d812 [disabled] [size=128K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2- ,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee0 Data: 409a Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 4us, L1 64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr- AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed- 86 Kernel driver in use: e1000e Kernel modules: e1000e And eeprom dump as below: Offset Values -- -- 0x 00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0060 00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the bit 8 is 1, but my NIC's MPS is 128b, anything I'm wrong? Have you power off the system completely after modifying eeprom? If not please do so. -Tushar -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 11/20/12 16:59, Dave, Tushar N wrote: Have you power off the system completely after modifying eeprom? If not please do so. Hi Tushar, Seems not works for me, would you please help to check what is wrong of my operations? Original eeprom dump: # ethtool -e eth3 | head -8 Offset Values -- -- 0x 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 ^ 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # lspci -s :52:00.1 -vvv 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) --snip-- Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 4096 bytes ^ --snip-- # ethtool eth3 Settings for eth3: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: off Supports Wake-on: d Wake-on: d Current message level: 0x0007 (7) Link detected: yes # ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7 # ethtool -e eth3 | head -8 Offset Values -- -- 0x 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 ^ == a6 -- a7 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # reboot # ethtool -e eth3 | head -8 Offset Values -- -- 0x 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # lspci -s :52:00.1 -vvv 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) --snip-- Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 4096 bytes ^ DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 4us, L1 64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- --snip-- # ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17 # ethtool -e eth3 | head -8 Offset Values -- -- 0x 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06 ^== 07 - 17 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # reboot # ethtool -e eth3 | head -8 Offset Values -- --
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 11/20/12 16:59, Dave, Tushar N wrote: Have you power off the system completely after modifying eeprom? If not please do so. seems not works for me, would you please help to check what is wrong of my operations? Original eeprom dump: # ethtool -e eth3 | head -8 Offset Values -- -- 0x 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 ^ 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # lspci -s :52:00.1 -vvv 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) --snip-- Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 4096 bytes ^ --snip-- # ethtool eth3 Settings for eth3: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: off Supports Wake-on: d Wake-on: d Current message level: 0x0007 (7) Link detected: yes # ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7 # ethtool -e eth3 | head -8 Offset Values -- -- 0x 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 ^ == a6 -- a7 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # reboot # ethtool -e eth3 | head -8 Offset Values -- -- 0x 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # lspci -s :52:00.1 -vvv 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) --snip-- Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 4096 bytes ^ DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 4us, L1 64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- --snip-- # ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17 # ethtool -e eth3 | head -8 Offset Values -- -- 0x 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06 ^== 07 - 17 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 # reboot # ethtool -e eth3 | head -8 Offset Values -- -- 0x
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 11/16/12 04:26, Dave, Tushar N wrote: Would you please help to fine the offset of max payload size in eeprom? I'd like to have a try to modify it by ethtool. It is defined using bit 8 of word 0x1A. Bit value 0 = 128B , bit value 1 = 256B Hi Tushar, I checked one of my server which Max Payload Size is 128: # lspci -vvv -s 52:00.1 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 266 Region 0: Memory at dfea (32-bit, non-prefetchable) [size=128K] Region 1: Memory at dfe8 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at 6020 [size=32] [virtual] Expansion ROM at d812 [disabled] [size=128K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee0 Data: 409a Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 4us, L1 64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr- AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed-86 Kernel driver in use: e1000e Kernel modules: e1000e And eeprom dump as below: Offset Values -- -- 0x 00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff 0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 0x0030 f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0060 00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the bit 8 is 1, but my NIC's MPS is 128b, anything I'm wrong? Thanks, Joe -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 11/09/12 04:35, Dave, Tushar N wrote: All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang. Can you double check this? Hi Tushar, Checked with hardware vendor and they said no way to modify the max payload size from BIOS, can I modify it from driver side? Thanks, Joe -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
于 2012年11月09日 04:35, Dave, Tushar N 写道: -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Joe Jin Sent: Wednesday, November 07, 2012 10:25 PM To: e1000-de...@lists.sf.net Cc: net...@vger.kernel.org; linux-ker...@vger.kernel.org; Mary Mcgrath Subject: 82571EB: Detected Hardware Unit Hang Hi list, IHAC reported 82571EB Detected Hardware Unit Hang on HP ProLiant DL360 G6, and have to reboot the server to recover: e1000e :06:00.1: eth3: Detected Hardware Unit Hang: TDH 1a TDT 1a next_to_use 1a next_to_clean18 buffer_info[next_to_clean]: time_stamp 10047a74e next_to_watch18 jiffies 10047a88c next_to_watch.status 1 MAC Status 80383 PHY Status 792d PHY 1000BASE-T Status 3800 PHY Extended Status3000 PCI Status 10 With newer kernel 2.0.0.1 the issue still reproducible. Device info: 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 06:00.1 0200: 8086:10bc (rev 06) I compared lspci output before and after the issue, different as below: 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port Gigabit Server Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx- -Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- +Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- +TAbort- MAbort- SERR- PERR- INTx+ Are you sure this is not similar issue as before that you reported. i.e. On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote: I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang. Can you double check this? We also found such hang problem on 82599EB (ixgbe driver) in RHEL6.3 kernel, we ever tried to upgrade to latest version (3.8.21 or 3.10.17), but it still happens. Is it probably also due to wrong max payload size set in BIOS? Thanks Yu -Tushar -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Tuesday, November 13, 2012 6:48 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org; Mary Mcgrath Subject: Re: 82571EB: Detected Hardware Unit Hang On 11/09/12 04:35, Dave, Tushar N wrote: All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang. Can you double check this? Hi Tushar, Checked with hardware vendor and they said no way to modify the max payload size from BIOS, can I modify it from driver side? If you want to change value for 82571 device you can do it from eeprom but for other upstream devices I am not sure. I will check with my team. -Tushar -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Joe Jin Sent: Wednesday, November 07, 2012 10:25 PM To: e1000-de...@lists.sf.net Cc: net...@vger.kernel.org; linux-ker...@vger.kernel.org; Mary Mcgrath Subject: 82571EB: Detected Hardware Unit Hang Hi list, IHAC reported 82571EB Detected Hardware Unit Hang on HP ProLiant DL360 G6, and have to reboot the server to recover: e1000e :06:00.1: eth3: Detected Hardware Unit Hang: TDH 1a TDT 1a next_to_use 1a next_to_clean18 buffer_info[next_to_clean]: time_stamp 10047a74e next_to_watch18 jiffies 10047a88c next_to_watch.status 1 MAC Status 80383 PHY Status 792d PHY 1000BASE-T Status 3800 PHY Extended Status3000 PCI Status 10 With newer kernel 2.0.0.1 the issue still reproducible. Device info: 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 06:00.1 0200: 8086:10bc (rev 06) I compared lspci output before and after the issue, different as below: 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port Gigabit Server Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx- - Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- + Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- +TAbort- MAbort- SERR- PERR- INTx+ Are you sure this is not similar issue as before that you reported. i.e. On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote: I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang. Can you double check this? -Tushar -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 11/09/12 04:35, Dave, Tushar N wrote: Are you sure this is not similar issue as before that you reported. i.e. Tushar, Thanks for your quick response, I'll check with customer if they can modify the Max payload size from BIOS, this time issue hit on HP's server. Thanks again, Joe On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote: I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang. Can you double check this? -- Oracle http://www.oracle.com Joe Jin | Software Development Senior Manager | +8610.6106.5624 ORACLE | Linux and Virtualization No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
This is the output: ~$ sudo ethtool -S eth1 | grep tx_timeout_count tx_timeout_count: 0 ~$ I will try new driver, but this is a production server. I don't have any actual problems with the nic, but I do keep seeing the hardware hand message pop up in the logs. When I can take the server down for routine maintenance I will get the new driver in and report back. Thank you all for the help. --Andrew On Fri, Aug 24, 2012 at 2:39 PM, Dave, Tushar N tushar.n.d...@intel.comwrote: You are right that driver only dump HW ring if adapter resets. However, in case of **true** tx hang , driver should tx_timeout that will reset the adapter and if msglvl is set correctly it will dump HW ring. If you’re not seeing tx_timeout I believe it’s a false tx hang. Check with ‘ethtool –S ethx | grep tx_timeout_count’ ** ** -Tushar PS: I would suggest try latest e1000e driver *From:* Andrew Peng [mailto:peng...@gmail.com] *Sent:* Friday, August 24, 2012 10:29 AM *To:* Dave, Tushar N *Cc:* e1000-devel@lists.sourceforge.net *Subject:* Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang ** ** Hi, in regards to the ring dump, this is the response I received from the Debian kernel team: ** The ring dump is only shown in case the driver resets the chip, and it doesn't do that in the case of Hardware Unit Hang. So I think whichever developer told you this was confused. ** I haven't gotten to using the new driver, but when I do i'll report back. --Andrew On Thu, Jul 19, 2012 at 9:20 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: In that case, you can use our e1000e outbox driver from Sourceforge (which should have patches mentioned by Flavio). -Tushar -Original Message- From: Flavio Leitner [mailto:f...@redhat.com] Sent: Thursday, July 19, 2012 6:39 PM To: Andrew Peng Cc: Dave, Tushar N; e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang On Thu, 19 Jul 2012 20:17:14 -0500 Andrew Peng peng...@gmail.com wrote: Flavio; I am using the stock kernel driver with the stock Debian Squeeze kernel. Well, I don't have the debian kernel sources handy to check, but based on the version 2.6.32-5-amd64, It sounds like you don't have. I pointed that patch because your card supports the write-back feature and TDT and TDH are close to each other, less than 4, which is a signature of the bug fixed by the first patch. fbl Tushar; I've double checked that the message level is set correctly: Current message level: 0x2c01 (11265) Link detected: yes However, I just checked all of the logs on the server and I do not see a HW ring dump. Thanks all again for help --Andrew On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: Andrew, I don't think current message level set correctly. Have you ran 'ethtool -s ethx msglvl 0x2c01' I don't see HW ring dump in the log. Please confirm that msglvl is set correctly by running 'ethtool ethx' -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Thursday, July 19, 2012 4:42 PM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f [411932.038654] next_to_clean3d [411932.038655] buffer_info[next_to_clean]: [411932.038657] time_stamp 106223f55 [411932.038658] next_to_watch3d [411932.038659] jiffies 106224069 [411932.038660] next_to_watch.status 0 [411932.038661] MAC Status 80383 [411932.038662] PHY Status 792d [411932.038663] PHY 1000BASE-T Status 3800 [411932.038664] PHY Extended Status3000 [411932.038665] PCI Status 10 [422584.120473
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
Andrew, There is no tx_timeout . So as I motioned in previous email this is a false hang. If issue persist with latest driver let me know and I look into it. -Tushar From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, August 29, 2012 11:41 AM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang This is the output: ~$ sudo ethtool -S eth1 | grep tx_timeout_count tx_timeout_count: 0 ~$ I will try new driver, but this is a production server. I don't have any actual problems with the nic, but I do keep seeing the hardware hand message pop up in the logs. When I can take the server down for routine maintenance I will get the new driver in and report back. Thank you all for the help. --Andrew On Fri, Aug 24, 2012 at 2:39 PM, Dave, Tushar N tushar.n.d...@intel.commailto:tushar.n.d...@intel.com wrote: You are right that driver only dump HW ring if adapter resets. However, in case of *true* tx hang , driver should tx_timeout that will reset the adapter and if msglvl is set correctly it will dump HW ring. If you’re not seeing tx_timeout I believe it’s a false tx hang. Check with ‘ethtool –S ethx | grep tx_timeout_count’ -Tushar PS: I would suggest try latest e1000e driver From: Andrew Peng [mailto:peng...@gmail.commailto:peng...@gmail.com] Sent: Friday, August 24, 2012 10:29 AM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.netmailto:e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Hi, in regards to the ring dump, this is the response I received from the Debian kernel team: ** The ring dump is only shown in case the driver resets the chip, and it doesn't do that in the case of Hardware Unit Hang. So I think whichever developer told you this was confused. ** I haven't gotten to using the new driver, but when I do i'll report back. --Andrew On Thu, Jul 19, 2012 at 9:20 PM, Dave, Tushar N tushar.n.d...@intel.commailto:tushar.n.d...@intel.com wrote: In that case, you can use our e1000e outbox driver from Sourceforge (which should have patches mentioned by Flavio). -Tushar -Original Message- From: Flavio Leitner [mailto:f...@redhat.commailto:f...@redhat.com] Sent: Thursday, July 19, 2012 6:39 PM To: Andrew Peng Cc: Dave, Tushar N; e1000-devel@lists.sourceforge.netmailto:e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang On Thu, 19 Jul 2012 20:17:14 -0500 Andrew Peng peng...@gmail.commailto:peng...@gmail.com wrote: Flavio; I am using the stock kernel driver with the stock Debian Squeeze kernel. Well, I don't have the debian kernel sources handy to check, but based on the version 2.6.32-5-amd64, It sounds like you don't have. I pointed that patch because your card supports the write-back feature and TDT and TDH are close to each other, less than 4, which is a signature of the bug fixed by the first patch. fbl Tushar; I've double checked that the message level is set correctly: Current message level: 0x2c01 (11265) Link detected: yes However, I just checked all of the logs on the server and I do not see a HW ring dump. Thanks all again for help --Andrew On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N tushar.n.d...@intel.commailto:tushar.n.d...@intel.com wrote: Andrew, I don't think current message level set correctly. Have you ran 'ethtool -s ethx msglvl 0x2c01' I don't see HW ring dump in the log. Please confirm that msglvl is set correctly by running 'ethtool ethx' -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.commailto:peng...@gmail.com] Sent: Thursday, July 19, 2012 4:42 PM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.netmailto:e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
29.08.2012 6:29, Dave, Tushar N пишет: Thanks for the info. For both, 82571 and 80003ES2LAN, I see UnsuppReq+ and UncorrErr+ in lspci (DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend+) Have you tried disabling tso (ethtool -K tso off)? Yes, this doesn't help Was this working okay before with old driver or old kernel? At least at 3.3.6 I don't see this warning messages in syslog Regards, Nikolay -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
-Original Message- From: Nikolay Popov [mailto:niko...@popoff.net.ua] Sent: Tuesday, August 28, 2012 9:00 PM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang 29.08.2012 6:29, Dave, Tushar N wrote: Have you tried disabling tso (ethtool -K tso off)? I also tried recompiling driver with DISABLE_PM, disabling gro and other offload types, boot kernel with acpi_aspm=off, increase ring buffers to 4096, playing around flow control - nothing helped. Okay thanks for info. I will check changes went into e1000e driver since 3.3.6 then. Also, would you please run 'ethtool -s ethx msglvl 0x2c01' so that next time when tx hang occurs it will log hw desc ring info. Send me the full dmesg log once issue occur. -Tushar -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
Hi, Dave! Ok, I have set msglevel as you requested, let's wait for some logs Also, about versions - we using 1.11.3-NAPI on both 3.3.6 and 3.5.2 hosts. We was enforced to do that because with default kernel driver (at least 2.0.0 at 3.5.2) we see some misterious drops and delays (~1-2%, and delays up to 2000ms) that appears once per few minutes. Downgrading driver to 1.11.3-NAPI solves this issue (that we'll discuss in separate topic I suppose) but with this driver version we're running into TX hang trouble we're trying to find now. I can't test if this problem appears in 2.x.x driver versions because hosts are in production and such kind of delays/losses aren't acceptable at all. Regards, Nikolay -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
-Original Message- From: Nikolay Popov [mailto:niko...@popoff.net.ua] Sent: Saturday, August 25, 2012 1:29 AM To: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Hi, All It seems that I'm getting same problems with 3.5.2 kernel - 80003ES2LAN onboard NIC is going to reset from time to time under load Aug 25 10:27:53 bras2 kernel: [134612.808590] e1000e :05:00.0: eth2: Detected Hardware Unit Hang: Aug 25 10:27:53 bras2 kernel: [134612.808590] TDH cd Aug 25 10:27:53 bras2 kernel: [134612.808590] TDT b9 Aug 25 10:27:53 bras2 kernel: [134612.808590] next_to_use b9 Aug 25 10:27:53 bras2 kernel: [134612.808590] next_to_clean cc Aug 25 10:27:53 bras2 kernel: [134612.808590] buffer_info[next_to_clean]: Aug 25 10:27:53 bras2 kernel: [134612.808590] time_stamp 1020057ff Aug 25 10:27:53 bras2 kernel: [134612.808590] next_to_watch cf Aug 25 10:27:53 bras2 kernel: [134612.808590] jiffies 102005cda Aug 25 10:27:53 bras2 kernel: [134612.808590] next_to_watch.status 0 Aug 25 10:27:53 bras2 kernel: [134612.808590] MAC Status 2080783 Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY Status 792d Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY 1000BASE-T Status 7800 Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY Extended Status 3000 Aug 25 10:27:53 bras2 kernel: [134612.808590] PCI Status 10 Aug 25 10:27:55 bras2 kernel: [134614.816086] e1000e :05:00.0: eth2: Reset adapter Aug 25 10:27:58 bras2 kernel: [134617.654599] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx Please send full dmesg log and 'ethtool -S ethx' output after issue occurs. -Tushar root@bras2:~# ethtool -i eth2 driver: e1000e version: 1.11.3-NAPI firmware-version: 1.0-0 bus-info: :05:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no root@bras2:~# lspci | grep 05:00.0 05:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) Mainboard: Intel S5000PAL I used to fall back to 1.11.3-NAPI driver version because with kernel 2.0.0 (and also with 2.0.0.1 from sf.net) there were a lot of random packet drops and latency spikes, so 1.11.3 is more acceptable to production. While reset traffic stop going, iowait increase up to 100% and then link flaps and all became normal until next reset that could happen in 1 hour, or in 1 day. Also I noticed, that resets aren't correlate with traffic load. It could happen ever when NIC is almost idle, transferring ~30-40 mbps. Is there anything we can do to fix this issue? Regards, Nikolay -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
Hi, All It seems that I'm getting same problems with 3.5.2 kernel - 80003ES2LAN onboard NIC is going to reset from time to time under load Aug 25 10:27:53 bras2 kernel: [134612.808590] e1000e :05:00.0: eth2: Detected Hardware Unit Hang: Aug 25 10:27:53 bras2 kernel: [134612.808590] TDH cd Aug 25 10:27:53 bras2 kernel: [134612.808590] TDT b9 Aug 25 10:27:53 bras2 kernel: [134612.808590] next_to_use b9 Aug 25 10:27:53 bras2 kernel: [134612.808590] next_to_clean cc Aug 25 10:27:53 bras2 kernel: [134612.808590] buffer_info[next_to_clean]: Aug 25 10:27:53 bras2 kernel: [134612.808590] time_stamp 1020057ff Aug 25 10:27:53 bras2 kernel: [134612.808590] next_to_watch cf Aug 25 10:27:53 bras2 kernel: [134612.808590] jiffies 102005cda Aug 25 10:27:53 bras2 kernel: [134612.808590] next_to_watch.status 0 Aug 25 10:27:53 bras2 kernel: [134612.808590] MAC Status 2080783 Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY Status 792d Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY 1000BASE-T Status 7800 Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY Extended Status 3000 Aug 25 10:27:53 bras2 kernel: [134612.808590] PCI Status 10 Aug 25 10:27:55 bras2 kernel: [134614.816086] e1000e :05:00.0: eth2: Reset adapter Aug 25 10:27:58 bras2 kernel: [134617.654599] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx root@bras2:~# ethtool -i eth2 driver: e1000e version: 1.11.3-NAPI firmware-version: 1.0-0 bus-info: :05:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no root@bras2:~# lspci | grep 05:00.0 05:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) Mainboard: Intel S5000PAL I used to fall back to 1.11.3-NAPI driver version because with kernel 2.0.0 (and also with 2.0.0.1 from sf.net) there were a lot of random packet drops and latency spikes, so 1.11.3 is more acceptable to production. While reset traffic stop going, iowait increase up to 100% and then link flaps and all became normal until next reset that could happen in 1 hour, or in 1 day. Also I noticed, that resets aren't correlate with traffic load. It could happen ever when NIC is almost idle, transferring ~30-40 mbps. Is there anything we can do to fix this issue? Regards, Nikolay -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
Hi, in regards to the ring dump, this is the response I received from the Debian kernel team: ** The ring dump is only shown in case the driver resets the chip, and it doesn't do that in the case of Hardware Unit Hang. So I think whichever developer told you this was confused. ** I haven't gotten to using the new driver, but when I do i'll report back. --Andrew On Thu, Jul 19, 2012 at 9:20 PM, Dave, Tushar N tushar.n.d...@intel.comwrote: In that case, you can use our e1000e outbox driver from Sourceforge (which should have patches mentioned by Flavio). -Tushar -Original Message- From: Flavio Leitner [mailto:f...@redhat.com] Sent: Thursday, July 19, 2012 6:39 PM To: Andrew Peng Cc: Dave, Tushar N; e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang On Thu, 19 Jul 2012 20:17:14 -0500 Andrew Peng peng...@gmail.com wrote: Flavio; I am using the stock kernel driver with the stock Debian Squeeze kernel. Well, I don't have the debian kernel sources handy to check, but based on the version 2.6.32-5-amd64, It sounds like you don't have. I pointed that patch because your card supports the write-back feature and TDT and TDH are close to each other, less than 4, which is a signature of the bug fixed by the first patch. fbl Tushar; I've double checked that the message level is set correctly: Current message level: 0x2c01 (11265) Link detected: yes However, I just checked all of the logs on the server and I do not see a HW ring dump. Thanks all again for help --Andrew On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: Andrew, I don't think current message level set correctly. Have you ran 'ethtool -s ethx msglvl 0x2c01' I don't see HW ring dump in the log. Please confirm that msglvl is set correctly by running 'ethtool ethx' -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Thursday, July 19, 2012 4:42 PM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f [411932.038654] next_to_clean3d [411932.038655] buffer_info[next_to_clean]: [411932.038657] time_stamp 106223f55 [411932.038658] next_to_watch3d [411932.038659] jiffies 106224069 [411932.038660] next_to_watch.status 0 [411932.038661] MAC Status 80383 [411932.038662] PHY Status 792d [411932.038663] PHY 1000BASE-T Status 3800 [411932.038664] PHY Extended Status3000 [411932.038665] PCI Status 10 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [422584.120475] TDH 15 [422584.120477] TDT 16 [422584.120478] next_to_use 16 [422584.120479] next_to_clean15 [422584.120480] buffer_info[next_to_clean]: [422584.120481] time_stamp 1064ae19c [422584.120483] next_to_watch15 [422584.120484] jiffies 1064ae2d6 [422584.120485] next_to_watch.status 0 [422584.120486] MAC Status 80383 [422584.120487] PHY Status 792d [422584.120488] PHY 1000BASE-T Status 3800 [422584.120489] PHY Extended Status3000 [422584.120491] PCI Status 10 Thank you again for all the help --Andrew On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com wrote: We can find the reason now. Please enable TSO back. Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur next time please send me the full dmesg log. -Tushar -Original Message- From: Andrew Peng
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f [411932.038654] next_to_clean3d [411932.038655] buffer_info[next_to_clean]: [411932.038657] time_stamp 106223f55 [411932.038658] next_to_watch3d [411932.038659] jiffies 106224069 [411932.038660] next_to_watch.status 0 [411932.038661] MAC Status 80383 [411932.038662] PHY Status 792d [411932.038663] PHY 1000BASE-T Status 3800 [411932.038664] PHY Extended Status3000 [411932.038665] PCI Status 10 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [422584.120475] TDH 15 [422584.120477] TDT 16 [422584.120478] next_to_use 16 [422584.120479] next_to_clean15 [422584.120480] buffer_info[next_to_clean]: [422584.120481] time_stamp 1064ae19c [422584.120483] next_to_watch15 [422584.120484] jiffies 1064ae2d6 [422584.120485] next_to_watch.status 0 [422584.120486] MAC Status 80383 [422584.120487] PHY Status 792d [422584.120488] PHY 1000BASE-T Status 3800 [422584.120489] PHY Extended Status3000 [422584.120491] PCI Status 10 Thank you again for all the help --Andrew On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com wrote: We can find the reason now. Please enable TSO back. Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur next time please send me the full dmesg log. -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 18, 2012 6:24 AM To: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Thus far disabling TSO via ethtool has seemed to work - can anyone explain the technical reason why this appears to have fixed the issue? --Andrew On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote: Sorry folks, but I just realized that I hadn't been replying to the list properly and instead I was mistakenly emailing Dave directly. I'm consolidating and re-sending the information to the list. BIOS on the HP N40L does not specify any options for AER or PCIe error management, or packet size (referenced in another thread) I have also tried to disable PCIe power management to no success. I did see one options in the BIOS relating to ACPI functionality, and referencing a document that Dave sent me saying the AER kernel driver may not be loaded if certain ACPI modules are loaded, I will disable this and check for errors. I don't have convenient physical access to the server so this will take a few days. I am attaching the dmesg and lspci -vvv (as root) output to this message. Thanks for all the help folks. --Andrew On Wed, Jul 11, 2012 at 8:37 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 11, 2012 8:50 AM To: e1000-devel@lists.sourceforge.net Subject: [E1000-devel] 82571EB - Detected Hardware Unit Hang Folks, I've been getting some strange error messages in my home server / router that I've been having trouble debugging. I'm decently proficient in Linux, but I fear I'm in over my head with this one. The hardware is a HP N40L Microserver - here are the hardware details - http://n40l.wikia.com/wiki/Base_Hardware I am running Debian Squeeze 6.0: pengc99@gaia:/$ sudo uname -a Linux gaia 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux I also subscribe to Ksplice's Uptrack system but since I have the newest kernel installed (as released by Debian) there have been no hot-patches yet. This is the message I've been getting in /var/log/kern.log: Jul 11 08:55:38 gaia kernel: [402056.009687] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: Jul 11 08:55:38 gaia kernel: [402056.009690] TDH fc Jul 11 08:55:38 gaia kernel: [402056.009692
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
Those messages reminds me this bug: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=09357b00255c233705b1cf6d76a8d147340545b8 and then: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bf03085f85112eac2d19036ea3003071220285bb Can you check if you have those patches applied? fbl On Thu, 19 Jul 2012 18:42:16 -0500 Andrew Peng peng...@gmail.com wrote: Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f [411932.038654] next_to_clean3d [411932.038655] buffer_info[next_to_clean]: [411932.038657] time_stamp 106223f55 [411932.038658] next_to_watch3d [411932.038659] jiffies 106224069 [411932.038660] next_to_watch.status 0 [411932.038661] MAC Status 80383 [411932.038662] PHY Status 792d [411932.038663] PHY 1000BASE-T Status 3800 [411932.038664] PHY Extended Status3000 [411932.038665] PCI Status 10 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [422584.120475] TDH 15 [422584.120477] TDT 16 [422584.120478] next_to_use 16 [422584.120479] next_to_clean15 [422584.120480] buffer_info[next_to_clean]: [422584.120481] time_stamp 1064ae19c [422584.120483] next_to_watch15 [422584.120484] jiffies 1064ae2d6 [422584.120485] next_to_watch.status 0 [422584.120486] MAC Status 80383 [422584.120487] PHY Status 792d [422584.120488] PHY 1000BASE-T Status 3800 [422584.120489] PHY Extended Status3000 [422584.120491] PCI Status 10 Thank you again for all the help --Andrew On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com wrote: We can find the reason now. Please enable TSO back. Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur next time please send me the full dmesg log. -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 18, 2012 6:24 AM To: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Thus far disabling TSO via ethtool has seemed to work - can anyone explain the technical reason why this appears to have fixed the issue? --Andrew On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote: Sorry folks, but I just realized that I hadn't been replying to the list properly and instead I was mistakenly emailing Dave directly. I'm consolidating and re-sending the information to the list. BIOS on the HP N40L does not specify any options for AER or PCIe error management, or packet size (referenced in another thread) I have also tried to disable PCIe power management to no success. I did see one options in the BIOS relating to ACPI functionality, and referencing a document that Dave sent me saying the AER kernel driver may not be loaded if certain ACPI modules are loaded, I will disable this and check for errors. I don't have convenient physical access to the server so this will take a few days. I am attaching the dmesg and lspci -vvv (as root) output to this message. Thanks for all the help folks. --Andrew On Wed, Jul 11, 2012 at 8:37 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 11, 2012 8:50 AM To: e1000-devel@lists.sourceforge.net Subject: [E1000-devel] 82571EB - Detected Hardware Unit Hang Folks, I've been getting some strange error messages in my home server / router that I've been having trouble debugging. I'm decently proficient in Linux, but I fear I'm in over my head with this one. The hardware is a HP N40L Microserver - here are the hardware details - http://n40l.wikia.com/wiki/Base_Hardware I am running
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
Andrew, I don't think current message level set correctly. Have you ran 'ethtool -s ethx msglvl 0x2c01' I don't see HW ring dump in the log. Please confirm that msglvl is set correctly by running 'ethtool ethx' -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Thursday, July 19, 2012 4:42 PM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f [411932.038654] next_to_clean3d [411932.038655] buffer_info[next_to_clean]: [411932.038657] time_stamp 106223f55 [411932.038658] next_to_watch3d [411932.038659] jiffies 106224069 [411932.038660] next_to_watch.status 0 [411932.038661] MAC Status 80383 [411932.038662] PHY Status 792d [411932.038663] PHY 1000BASE-T Status 3800 [411932.038664] PHY Extended Status3000 [411932.038665] PCI Status 10 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [422584.120475] TDH 15 [422584.120477] TDT 16 [422584.120478] next_to_use 16 [422584.120479] next_to_clean15 [422584.120480] buffer_info[next_to_clean]: [422584.120481] time_stamp 1064ae19c [422584.120483] next_to_watch15 [422584.120484] jiffies 1064ae2d6 [422584.120485] next_to_watch.status 0 [422584.120486] MAC Status 80383 [422584.120487] PHY Status 792d [422584.120488] PHY 1000BASE-T Status 3800 [422584.120489] PHY Extended Status3000 [422584.120491] PCI Status 10 Thank you again for all the help --Andrew On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com wrote: We can find the reason now. Please enable TSO back. Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur next time please send me the full dmesg log. -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 18, 2012 6:24 AM To: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Thus far disabling TSO via ethtool has seemed to work - can anyone explain the technical reason why this appears to have fixed the issue? --Andrew On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote: Sorry folks, but I just realized that I hadn't been replying to the list properly and instead I was mistakenly emailing Dave directly. I'm consolidating and re-sending the information to the list. BIOS on the HP N40L does not specify any options for AER or PCIe error management, or packet size (referenced in another thread) I have also tried to disable PCIe power management to no success. I did see one options in the BIOS relating to ACPI functionality, and referencing a document that Dave sent me saying the AER kernel driver may not be loaded if certain ACPI modules are loaded, I will disable this and check for errors. I don't have convenient physical access to the server so this will take a few days. I am attaching the dmesg and lspci -vvv (as root) output to this message. Thanks for all the help folks. --Andrew On Wed, Jul 11, 2012 at 8:37 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 11, 2012 8:50 AM To: e1000-devel@lists.sourceforge.net Subject: [E1000-devel] 82571EB - Detected Hardware Unit Hang Folks, I've been getting some strange error messages in my home server / router that I've been having trouble debugging. I'm decently proficient in Linux, but I fear I'm in over my head with this one. The hardware is a HP N40L Microserver - here are the hardware details - http://n40l.wikia.com/wiki/Base_Hardware I am running Debian Squeeze 6.0: pengc99@gaia:/$ sudo uname -a Linux gaia 2.6.32-5-amd64 #1 SMP
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
Flavio; I am using the stock kernel driver with the stock Debian Squeeze kernel. Tushar; I've double checked that the message level is set correctly: Current message level: 0x2c01 (11265) Link detected: yes However, I just checked all of the logs on the server and I do not see a HW ring dump. Thanks all again for help --Andrew On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: Andrew, I don't think current message level set correctly. Have you ran 'ethtool -s ethx msglvl 0x2c01' I don't see HW ring dump in the log. Please confirm that msglvl is set correctly by running 'ethtool ethx' -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Thursday, July 19, 2012 4:42 PM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f [411932.038654] next_to_clean3d [411932.038655] buffer_info[next_to_clean]: [411932.038657] time_stamp 106223f55 [411932.038658] next_to_watch3d [411932.038659] jiffies 106224069 [411932.038660] next_to_watch.status 0 [411932.038661] MAC Status 80383 [411932.038662] PHY Status 792d [411932.038663] PHY 1000BASE-T Status 3800 [411932.038664] PHY Extended Status3000 [411932.038665] PCI Status 10 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [422584.120475] TDH 15 [422584.120477] TDT 16 [422584.120478] next_to_use 16 [422584.120479] next_to_clean15 [422584.120480] buffer_info[next_to_clean]: [422584.120481] time_stamp 1064ae19c [422584.120483] next_to_watch15 [422584.120484] jiffies 1064ae2d6 [422584.120485] next_to_watch.status 0 [422584.120486] MAC Status 80383 [422584.120487] PHY Status 792d [422584.120488] PHY 1000BASE-T Status 3800 [422584.120489] PHY Extended Status3000 [422584.120491] PCI Status 10 Thank you again for all the help --Andrew On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com wrote: We can find the reason now. Please enable TSO back. Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur next time please send me the full dmesg log. -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 18, 2012 6:24 AM To: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Thus far disabling TSO via ethtool has seemed to work - can anyone explain the technical reason why this appears to have fixed the issue? --Andrew On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote: Sorry folks, but I just realized that I hadn't been replying to the list properly and instead I was mistakenly emailing Dave directly. I'm consolidating and re-sending the information to the list. BIOS on the HP N40L does not specify any options for AER or PCIe error management, or packet size (referenced in another thread) I have also tried to disable PCIe power management to no success. I did see one options in the BIOS relating to ACPI functionality, and referencing a document that Dave sent me saying the AER kernel driver may not be loaded if certain ACPI modules are loaded, I will disable this and check for errors. I don't have convenient physical access to the server so this will take a few days. I am attaching the dmesg and lspci -vvv (as root) output to this message. Thanks for all the help folks. --Andrew On Wed, Jul 11, 2012 at 8:37 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 11, 2012 8:50 AM To: e1000-devel@lists.sourceforge.net Subject: [E1000-devel] 82571EB
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
On Thu, 19 Jul 2012 20:17:14 -0500 Andrew Peng peng...@gmail.com wrote: Flavio; I am using the stock kernel driver with the stock Debian Squeeze kernel. Well, I don't have the debian kernel sources handy to check, but based on the version 2.6.32-5-amd64, It sounds like you don't have. I pointed that patch because your card supports the write-back feature and TDT and TDH are close to each other, less than 4, which is a signature of the bug fixed by the first patch. fbl Tushar; I've double checked that the message level is set correctly: Current message level: 0x2c01 (11265) Link detected: yes However, I just checked all of the logs on the server and I do not see a HW ring dump. Thanks all again for help --Andrew On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: Andrew, I don't think current message level set correctly. Have you ran 'ethtool -s ethx msglvl 0x2c01' I don't see HW ring dump in the log. Please confirm that msglvl is set correctly by running 'ethtool ethx' -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Thursday, July 19, 2012 4:42 PM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f [411932.038654] next_to_clean3d [411932.038655] buffer_info[next_to_clean]: [411932.038657] time_stamp 106223f55 [411932.038658] next_to_watch3d [411932.038659] jiffies 106224069 [411932.038660] next_to_watch.status 0 [411932.038661] MAC Status 80383 [411932.038662] PHY Status 792d [411932.038663] PHY 1000BASE-T Status 3800 [411932.038664] PHY Extended Status3000 [411932.038665] PCI Status 10 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [422584.120475] TDH 15 [422584.120477] TDT 16 [422584.120478] next_to_use 16 [422584.120479] next_to_clean15 [422584.120480] buffer_info[next_to_clean]: [422584.120481] time_stamp 1064ae19c [422584.120483] next_to_watch15 [422584.120484] jiffies 1064ae2d6 [422584.120485] next_to_watch.status 0 [422584.120486] MAC Status 80383 [422584.120487] PHY Status 792d [422584.120488] PHY 1000BASE-T Status 3800 [422584.120489] PHY Extended Status3000 [422584.120491] PCI Status 10 Thank you again for all the help --Andrew On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com wrote: We can find the reason now. Please enable TSO back. Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur next time please send me the full dmesg log. -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 18, 2012 6:24 AM To: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Thus far disabling TSO via ethtool has seemed to work - can anyone explain the technical reason why this appears to have fixed the issue? --Andrew On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote: Sorry folks, but I just realized that I hadn't been replying to the list properly and instead I was mistakenly emailing Dave directly. I'm consolidating and re-sending the information to the list. BIOS on the HP N40L does not specify any options for AER or PCIe error management, or packet size (referenced in another thread) I have also tried to disable PCIe power management to no success. I did see one options in the BIOS relating to ACPI functionality, and referencing a document that Dave sent me saying the AER kernel driver may not be loaded if certain ACPI
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
In that case, you can use our e1000e outbox driver from Sourceforge (which should have patches mentioned by Flavio). -Tushar -Original Message- From: Flavio Leitner [mailto:f...@redhat.com] Sent: Thursday, July 19, 2012 6:39 PM To: Andrew Peng Cc: Dave, Tushar N; e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang On Thu, 19 Jul 2012 20:17:14 -0500 Andrew Peng peng...@gmail.com wrote: Flavio; I am using the stock kernel driver with the stock Debian Squeeze kernel. Well, I don't have the debian kernel sources handy to check, but based on the version 2.6.32-5-amd64, It sounds like you don't have. I pointed that patch because your card supports the write-back feature and TDT and TDH are close to each other, less than 4, which is a signature of the bug fixed by the first patch. fbl Tushar; I've double checked that the message level is set correctly: Current message level: 0x2c01 (11265) Link detected: yes However, I just checked all of the logs on the server and I do not see a HW ring dump. Thanks all again for help --Andrew On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N tushar.n.d...@intel.com wrote: Andrew, I don't think current message level set correctly. Have you ran 'ethtool -s ethx msglvl 0x2c01' I don't see HW ring dump in the log. Please confirm that msglvl is set correctly by running 'ethtool ethx' -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Thursday, July 19, 2012 4:42 PM To: Dave, Tushar N Cc: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Attached is the dmesg output. Please let me know if this looks right. There are two instances of the error here: [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [361106.726604] TDH c5 [361106.726606] TDT c7 [361106.726607] next_to_use c7 [361106.726608] next_to_cleanc5 [361106.726609] buffer_info[next_to_clean]: [361106.726610] time_stamp 105605cd5 [361106.726611] next_to_watchc5 [361106.726612] jiffies 105605e51 [361106.726614] next_to_watch.status 0 [361106.726615] MAC Status 80383 [361106.726616] PHY Status 792d [361106.726617] PHY 1000BASE-T Status 3800 [361106.726618] PHY Extended Status3000 [361106.726619] PCI Status 10 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [411932.038651] TDH 3d [411932.038652] TDT 3f [411932.038653] next_to_use 3f [411932.038654] next_to_clean3d [411932.038655] buffer_info[next_to_clean]: [411932.038657] time_stamp 106223f55 [411932.038658] next_to_watch3d [411932.038659] jiffies 106224069 [411932.038660] next_to_watch.status 0 [411932.038661] MAC Status 80383 [411932.038662] PHY Status 792d [411932.038663] PHY 1000BASE-T Status 3800 [411932.038664] PHY Extended Status3000 [411932.038665] PCI Status 10 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: [422584.120475] TDH 15 [422584.120477] TDT 16 [422584.120478] next_to_use 16 [422584.120479] next_to_clean15 [422584.120480] buffer_info[next_to_clean]: [422584.120481] time_stamp 1064ae19c [422584.120483] next_to_watch15 [422584.120484] jiffies 1064ae2d6 [422584.120485] next_to_watch.status 0 [422584.120486] MAC Status 80383 [422584.120487] PHY Status 792d [422584.120488] PHY 1000BASE-T Status 3800 [422584.120489] PHY Extended Status3000 [422584.120491] PCI Status 10 Thank you again for all the help --Andrew On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com wrote: We can find the reason now. Please enable TSO back. Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur next time please send me the full dmesg log. -Tushar -Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 18, 2012 6:24 AM To: e1000-devel@lists.sourceforge.net Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang Thus far disabling TSO via ethtool has seemed to work - can anyone explain the technical reason why this appears to have fixed the issue? --Andrew On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote: Sorry folks, but I just realized that I hadn't been replying to the list properly and instead I was mistakenly emailing Dave directly. I'm consolidating and re-sending the information to the list. BIOS on the HP N40L
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote: On Sun, 15 Jul 2012, Dave, Tushar N wrote: Somehow setting max payload to 256 from BIOS does not set this value for all devices. I believe this is a BIOS bug. And preferably, Linux should complain about it. Since we know it is going to cause problems, and since we know it does happen, we should be raising a ruckus about it in the kernel log (and probably fixing it to min(path) while at it)... Is this something that should be raised as a feature request with the PCI/PCIe subsystem? The feature is there, but we ended up with: commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c Author: Jon Mason ma...@myri.com Date: Mon Oct 3 09:50:20 2011 -0500 PCI: Disable MPS configuration by default But you are welcome to share use of the fixup_mpss_256() quirk. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Mon, 16 Jul 2012, Ben Hutchings wrote: On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote: On Sun, 15 Jul 2012, Dave, Tushar N wrote: Somehow setting max payload to 256 from BIOS does not set this value for all devices. I believe this is a BIOS bug. And preferably, Linux should complain about it. Since we know it is going to cause problems, and since we know it does happen, we should be raising a ruckus about it in the kernel log (and probably fixing it to min(path) while at it)... Is this something that should be raised as a feature request with the PCI/PCIe subsystem? The feature is there, but we ended up with: commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c Author: Jon Mason ma...@myri.com Date: Mon Oct 3 09:50:20 2011 -0500 PCI: Disable MPS configuration by default But you are welcome to share use of the fixup_mpss_256() quirk. Meh. I'd be happy with a warning if MPSS decreases when walking up to the tree root... i.e. a warning if any child has a MPSS larger than the parent. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Mon, Jul 16, 2012 at 9:08 AM, Henrique de Moraes Holschuh h...@hmh.eng.br wrote: On Mon, 16 Jul 2012, Ben Hutchings wrote: On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote: On Sun, 15 Jul 2012, Dave, Tushar N wrote: Somehow setting max payload to 256 from BIOS does not set this value for all devices. I believe this is a BIOS bug. And preferably, Linux should complain about it. Since we know it is going to cause problems, and since we know it does happen, we should be raising a ruckus about it in the kernel log (and probably fixing it to min(path) while at it)... Is this something that should be raised as a feature request with the PCI/PCIe subsystem? The feature is there, but we ended up with: commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c Author: Jon Mason ma...@myri.com Date: Mon Oct 3 09:50:20 2011 -0500 PCI: Disable MPS configuration by default But you are welcome to share use of the fixup_mpss_256() quirk. Meh. I'd be happy with a warning if MPSS decreases when walking up to the tree root... i.e. a warning if any child has a MPSS larger than the parent. You can add pci=pcie_bus_safe to the kernel params and it should resolve your issue. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Sun, 15 Jul 2012, Dave, Tushar N wrote: Somehow setting max payload to 256 from BIOS does not set this value for all devices. I believe this is a BIOS bug. And preferably, Linux should complain about it. Since we know it is going to cause problems, and since we know it does happen, we should be raising a ruckus about it in the kernel log (and probably fixing it to min(path) while at it)... Is this something that should be raised as a feature request with the PCI/PCIe subsystem? -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Thursday, July 12, 2012 9:34 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/13/12 12:10, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Thursday, July 12, 2012 4:46 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang Thanks for sending full dmesg log. I am still investigating. I think this issue can occur if two PCIe link partner *i.e pcie bridge and pcie device do not have same max payload size. I need 2 more info. 1) PBA number of the card. This is a remote server and I could not get this. 2) full lspci -vvv output of entire system 'after you have changed max payload size to 128'. Somehow setting max payload to 256 from BIOS does not set this value for all devices. I believe this is a BIOS bug. All devices in path from root complex to 82571, should have same max payload size otherwise it can cause hang. When you set max payload to 128 from BIOS, all device in path from root complex to 82571 got assigned same max payload size. This resolves the issue. I hope this helps. -Tushar -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/15/12 11:42, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Thursday, July 12, 2012 9:34 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/13/12 12:10, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Thursday, July 12, 2012 4:46 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang Thanks for sending full dmesg log. I am still investigating. I think this issue can occur if two PCIe link partner *i.e pcie bridge and pcie device do not have same max payload size. I need 2 more info. 1) PBA number of the card. This is a remote server and I could not get this. 2) full lspci -vvv output of entire system 'after you have changed max payload size to 128'. Somehow setting max payload to 256 from BIOS does not set this value for all devices. I believe this is a BIOS bug. All devices in path from root complex to 82571, should have same max payload size otherwise it can cause hang. When you set max payload to 128 from BIOS, all device in path from root complex to 82571 got assigned same max payload size. This resolves the issue. I hope this helps. Tushar, Thanks a lot for your help, will send this to hardware engineer. Regards, Joe -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/12/12 13:57, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 8:13 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/12/12 11:07, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 7:58 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/12/12 10:52, Dave, Tushar N wrote: What is the exact error messages in BIOS log? Error message from BIOS event log: 07/12/12 05:54:00 PCI Express Non-Fatal Error Thanks, Joe Hi Tushar, Please find eeprom from attachment. Do you have lspci -vvv dump of entire system before and after issue occurs? If you have can you send it to me? Sorry but I meant the full lspci -vvv of *entire system* before and after issue occurs and not of 82571 only. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Thursday, July 12, 2012 12:11 AM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/12/12 14:41, Dave, Tushar N wrote: On 07/12/12 13:57, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 8:13 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/12/12 11:07, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 7:58 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/12/12 10:52, Dave, Tushar N wrote: What is the exact error messages in BIOS log? Error message from BIOS event log: 07/12/12 05:54:00 PCI Express Non-Fatal Error Thanks, Joe Hi Tushar, Please find eeprom from attachment. Do you have lspci -vvv dump of entire system before and after issue occurs? If you have can you send it to me? Sorry but I meant the full lspci -vvv of *entire system* before and after issue occurs and not of 82571 only. Before: === 00:00.0 Host bridge: Intel Corporation 5500 I/O Hub to ESI Port (rev 22) Subsystem: Oracle Corporation Device 5352 Joe, thanks for all the data. You said you have changed max payload size and issue stop occurring. How did you change it? Where did you make that change in BIOS or EEPROM or in PCIe config space? Also please send me the full dmesg of entire system after you change max payload size. Thanks. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Thursday, July 12, 2012 4:46 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang Thanks for sending full dmesg log. I am still investigating. I think this issue can occur if two PCIe link partner *i.e pcie bridge and pcie device do not have same max payload size. I need 2 more info. 1) PBA number of the card. 2) full lspci -vvv output of entire system 'after you have changed max payload size to 128'. Thanks. -Tushar -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Tuesday, July 10, 2012 10:03 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 12:05, Dave, Tushar N wrote: When you said you had this issue with RHEL5 and RHEL6 drivers, have you install RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try reproduce it locally! Yes I reproduced this on both RHEL5 and RHEL6. So far I tried to scp big file (~1GB) will hit it at once. Thanks, Joe Joe, Can you please send lspci -vvv output for failing port before issue occurs. Thanks. -Tushar -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/11/12 15:11, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Tuesday, July 10, 2012 10:03 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 12:05, Dave, Tushar N wrote: When you said you had this issue with RHEL5 and RHEL6 drivers, have you install RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try reproduce it locally! Yes I reproduced this on both RHEL5 and RHEL6. So far I tried to scp big file (~1GB) will hit it at once. Thanks, Joe Joe, Can you please send lspci -vvv output for failing port before issue occurs. Thanks. # lspci -s 05:00.0 -vvv 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP Low Profile Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 256 bytes Interrupt: pin B routed to IRQ 80 Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at dc00 [size=32] Expansion ROM at fbda [disabled] [size=128K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee21000 Data: 40cb Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 4us, L1 64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c Kernel driver in use: e1000e Kernel modules: e1000e Thanks, Joe -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/11/12 15:37, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 12:18 AM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 15:11, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Tuesday, July 10, 2012 10:03 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 12:05, Dave, Tushar N wrote: When you said you had this issue with RHEL5 and RHEL6 drivers, have you install RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try reproduce it locally! Yes I reproduced this on both RHEL5 and RHEL6. So far I tried to scp big file (~1GB) will hit it at once. Thanks, Joe Joe, Can you please send lspci -vvv output for failing port before issue occurs. Thanks. # lspci -s 05:00.0 -vvv 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP Low Profile Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 256 bytes Interrupt: pin B routed to IRQ 80 Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at dc00 [size=32] Expansion ROM at fbda [disabled] [size=128K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2- ,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee21000 Data: 40cb Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 4us, L1 64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c Kernel driver in use: e1000e Kernel modules: e1000e Thanks, Joe was this lspci output taken on freshly booted system? Yes, any issue do you find? Thanks, Joe -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 12:39 AM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 15:37, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 12:18 AM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 15:11, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Tuesday, July 10, 2012 10:03 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 12:05, Dave, Tushar N wrote: When you said you had this issue with RHEL5 and RHEL6 drivers, have you install RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try reproduce it locally! Yes I reproduced this on both RHEL5 and RHEL6. So far I tried to scp big file (~1GB) will hit it at once. Thanks, Joe Joe, Can you please send lspci -vvv output for failing port before issue occurs. Thanks. # lspci -s 05:00.0 -vvv 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP Low Profile Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 256 bytes Interrupt: pin B routed to IRQ 80 Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at dc00 [size=32] Expansion ROM at fbda [disabled] [size=128K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2- ,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee21000 Data: 40cb Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 4us, L1 64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c Kernel driver in use: e1000e Kernel modules: e1000e Thanks, Joe was this lspci output taken on freshly booted system? Yes, any issue do you find? Thanks, Joe Device status and AER sections show some errors that looks little suspicious to me but I'm not too sure. I will get back tomorrow. -Tushar -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/11/12 15:50, Dave, Tushar N wrote: Device status and AER sections show some errors that looks little suspicious to me but I'm not too sure. I will get back tomorrow. Thanks a lot, Tushar! Joe -- Oracle http://www.oracle.com Joe Jin | Software Development Senior Manager | +8610.6106.5624 ORACLE | Linux and Virtualization No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Tuesday, July 10, 2012 10:03 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 12:05, Dave, Tushar N wrote: When you said you had this issue with RHEL5 and RHEL6 drivers, have you install RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try reproduce it locally! Yes I reproduced this on both RHEL5 and RHEL6. So far I tried to scp big file (~1GB) will hit it at once. Thanks, Joe Joe, I see couple of errors in lspci output. Device capability status register shows UnCorrectable PCIe error. This means there is certainly something went wrong. The only way to recover from Uncorrectable errors is reset. DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend- Also AER sections in lspci output shows PCIe completion timeout. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol- I suggest you should load AER driver and check for any error messages in log. Also please check any error message reported by system in BIOS log. Are there any machine check errors? When did you notice this issue? have 82571 ever been working before on this server? One more thing, Cache line size 256 is little unusual( I never seen this value before, mostly it's 64). Does BIOS settings have been changed? Are you using default BIOS setting? Thanks. -Tushar -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
-Original Message- From: Andrew Peng [mailto:peng...@gmail.com] Sent: Wednesday, July 11, 2012 8:50 AM To: e1000-devel@lists.sourceforge.net Subject: [E1000-devel] 82571EB - Detected Hardware Unit Hang Folks, I've been getting some strange error messages in my home server / router that I've been having trouble debugging. I'm decently proficient in Linux, but I fear I'm in over my head with this one. The hardware is a HP N40L Microserver - here are the hardware details - http://n40l.wikia.com/wiki/Base_Hardware I am running Debian Squeeze 6.0: pengc99@gaia:/$ sudo uname -a Linux gaia 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux I also subscribe to Ksplice's Uptrack system but since I have the newest kernel installed (as released by Debian) there have been no hot-patches yet. This is the message I've been getting in /var/log/kern.log: Jul 11 08:55:38 gaia kernel: [402056.009687] e1000e :02:00.0: eth1: Detected Hardware Unit Hang: Jul 11 08:55:38 gaia kernel: [402056.009690] TDH fc Jul 11 08:55:38 gaia kernel: [402056.009692] TDT fd Jul 11 08:55:38 gaia kernel: [402056.009693] next_to_use fd Jul 11 08:55:38 gaia kernel: [402056.009694] next_to_cleanfc Jul 11 08:55:38 gaia kernel: [402056.009695] buffer_info[next_to_clean]: Jul 11 08:55:38 gaia kernel: [402056.009697] time_stamp 105fc92b2 Jul 11 08:55:38 gaia kernel: [402056.009698] next_to_watchfc Jul 11 08:55:38 gaia kernel: [402056.009699] jiffies 105fc93da Jul 11 08:55:38 gaia kernel: [402056.009700] next_to_watch.status 0 Jul 11 08:55:38 gaia kernel: [402056.009701] MAC Status 80383 Jul 11 08:55:38 gaia kernel: [402056.009702] PHY Status 792d Jul 11 08:55:38 gaia kernel: [402056.009703] PHY 1000BASE-T Status 3800 Jul 11 08:55:38 gaia kernel: [402056.009705] PHY Extended Status3000 Jul 11 08:55:38 gaia kernel: [402056.009706] PCI Status 10 Complete output of lspci: pengc99@gaia:/$ lspci 00:00.0 Host bridge: Advanced Micro Devices [AMD] RS880 Host Bridge 00:01.0 PCI bridge: Hewlett-Packard Company Device 9602 00:02.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge (ext gfx port 0) 00:06.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge (PCIE port 2) 00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode] (rev 40) 00:12.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller 00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller 00:13.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller 00:13.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller 00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 42) 00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller (rev 40) 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40) 00:16.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller 00:16.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller 00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control 00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Link Control 01:05.0 VGA compatible controller: ATI Technologies Inc M880G [Mobility Radeon HD 4200] 02:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 02:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5723 Gigabit Ethernet PCIe (rev 10) Output of lspci -vvv (as root, network adapter section): 02:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) Subsystem: Hewlett-Packard Company NC360T PCI Express Dual Port Gigabit Server Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 26 Region 0: Memory at fe8e (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fe8c (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at e800 [size=32] Expansion ROM at fe8a [disabled] [size=128K] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable-
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/12/12 02:51, Dave, Tushar N wrote: Joe, I see couple of errors in lspci output. Device capability status register shows UnCorrectable PCIe error. This means there is certainly something went wrong. The only way to recover from Uncorrectable errors is reset. DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend- Also AER sections in lspci output shows PCIe completion timeout. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol- I suggest you should load AER driver and check for any error messages in log. Also please check any error message reported by system in BIOS log. Are there any machine check errors? When did you notice this issue? have 82571 ever been working before on this server? One more thing, Cache line size 256 is little unusual( I never seen this value before, mostly it's 64). Does BIOS settings have been changed? Are you using default BIOS setting? I checked BIOS's log found the fault from the device, I changed PCI-E Payload Size from 256(default) to 128, now the device works. I compared lspci output found Address for data of MSI Capabilities's be changed: Old: Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee21000 Data: 40cb New: Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee24000 Data: 405c Mostly like it's a BIOS bug? please comments. Thanks, Joe -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/12/12 10:52, Dave, Tushar N wrote: What is the exact error messages in BIOS log? Error message from BIOS event log: 07/12/12 05:54:00 PCI Express Non-Fatal Error Thanks, Joe -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 7:58 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/12/12 10:52, Dave, Tushar N wrote: What is the exact error messages in BIOS log? Error message from BIOS event log: 07/12/12 05:54:00 PCI Express Non-Fatal Error Thanks, Joe Thanks. Well, I will check with team tomorrow if this (max payload size) can be treated as solution to this issue. We can know more about what exact non-fatal error occurred if we capture bus trace. We should check the eeprom on this device to make sure they are up-to-date. Send me the full eeprom dump in a file and I will confirm with team that it is up-to-date. Thanks for your work. -Tushar -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 8:13 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/12/12 11:07, Dave, Tushar N wrote: -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, July 11, 2012 7:58 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/12/12 10:52, Dave, Tushar N wrote: What is the exact error messages in BIOS log? Error message from BIOS event log: 07/12/12 05:54:00 PCI Express Non-Fatal Error Thanks, Joe Hi Tushar, Please find eeprom from attachment. Do you have lspci -vvv dump of entire system before and after issue occurs? If you have can you send it to me? -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
When I debug the driver I found before Detected HW hang, driver unable to clean and reclaim the resources: 1457 while ((eop_desc-upper.data cpu_to_le32(E1000_TXD_STAT_DD)) == at here upper.data always is 0x300 1458(count tx_ring-count)) { --- snip --- 1487 } I checked all driver codes I did not found anywhere will set the upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware? If OS is 32bit system, what which happen? Thanks in advance, Joe On 07/09/12 16:51, Joe Jin wrote: Hi list, I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. Would you please help on this? device info: # lspci -s 05:00.0 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) # lspci -s 05:00.0 -n 05:00.0 0200: 8086:10bc (rev 06) # ethtool -i eth0 driver: e1000e version: 2.0.0-NAPI firmware-version: 5.10-2 bus-info: :05:00.0 # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: on kernel log: --- e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc8c0c next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc9bac next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 [ cut here ] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230() Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call Trace: [c07d9ac5] ? dev_watchdog+0x225/0x230 [c045ba61] warn_slowpath_common+0x81/0xa0 [c07d9ac5] ? dev_watchdog+0x225/0x230 [c045bb23] warn_slowpath_fmt+0x33/0x40 [c07d9ac5] dev_watchdog+0x225/0x230 [c07d98a0] ? dev_activate+0xb0/0xb0 [c0468e82] call_timer_fn+0x32/0xf0 [c04bceb0] ? rcu_check_callbacks+0x80/0x80 [c046a76d] run_timer_softirq+0xed/0x1b0 [c07d98a0] ? dev_activate+0xb0/0xb0 [c0461a81] __do_softirq+0x91/0x1a0 [c04619f0] ? local_bh_enable+0x80/0x80 IRQ [c0462295] ? irq_exit+0x95/0xa0 [c087f8b8] ? smp_apic_timer_interrupt+0x38/0x42 [c08784f5] ? apic_timer_interrupt+0x31/0x38 [c046007b] ? do_exit+0x11b/0x370 [c065eae4] ? intel_idle+0xa4/0x100 [c078d9b9] ? cpuidle_idle_call+0xb9/0x1e0 [c0411d77] ? cpu_idle+0x97/0xd0 [c085cbbd] ? rest_init+0x5d/0x70 [c0b07a7a] ? start_kernel+0x28a/0x340 [c0b074b0] ? obsolete_checksetup+0xb0/0xb0 [c0b070a4] ? i386_start_kernel+0x64/0xb0 ---[ end trace 5502b55cd4d4e5cb ]--- e1000e :05:00.0: eth0: Reset adapter e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Thanks, Joe -- Oracle http://www.oracle.com Joe Jin | Software Development Senior Manager | +8610.6106.5624 ORACLE | Linux and Virtualization No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Joe Jin Sent: Tuesday, July 10, 2012 12:40 AM To: Joe Jin Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang When I debug the driver I found before Detected HW hang, driver unable to clean and reclaim the resources: 1457 while ((eop_desc-upper.data cpu_to_le32(E1000_TXD_STAT_DD)) == at here upper.data always is 0x300 1458(count tx_ring-count)) { --- snip --- 1487 } I checked all driver codes I did not found anywhere will set the upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware? Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means HW has processed that descriptor and driver can now clean that descriptor. With value 0x300 , DD bit is not set. That means HW has not processed that descriptor. How fast does tx hang reproduce? I suggest you to enable debug code in driver so when tx hang occurs it will dump the HW desc ring info into kernel log. You can run ethtool -s ethx msglvl 0x2c00 to enable debug. Once tx hang occurs please send me the full dmesg log. Does tx hang occur with in-kernel e1000e driver too? Thanks. -Tushar If OS is 32bit system, what which happen? Thanks in advance, Joe On 07/09/12 16:51, Joe Jin wrote: Hi list, I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. Would you please help on this? device info: # lspci -s 05:00.0 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) # lspci -s 05:00.0 -n 05:00.0 0200: 8086:10bc (rev 06) # ethtool -i eth0 driver: e1000e version: 2.0.0-NAPI firmware-version: 5.10-2 bus-info: :05:00.0 # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: on kernel log: --- e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc8c0c next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc9bac next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 [ cut here ] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230() Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call Trace: [c07d9ac5] ? dev_watchdog+0x225/0x230 [c045ba61] warn_slowpath_common+0x81/0xa0 [c07d9ac5] ? dev_watchdog+0x225/0x230 [c045bb23] warn_slowpath_fmt+0x33/0x40 [c07d9ac5] dev_watchdog+0x225/0x230 [c07d98a0] ? dev_activate+0xb0/0xb0 [c0468e82] call_timer_fn+0x32/0xf0 [c04bceb0] ? rcu_check_callbacks+0x80/0x80 [c046a76d] run_timer_softirq+0xed/0x1b0 [c07d98a0] ? dev_activate+0xb0/0xb0 [c0461a81] __do_softirq+0x91/0x1a0 [c04619f0] ? local_bh_enable+0x80/0x80 IRQ [c0462295] ? irq_exit+0x95/0xa0 [c087f8b8] ? smp_apic_timer_interrupt+0x38/0x42 [c08784f5] ? apic_timer_interrupt+0x31/0x38 [c046007b] ? do_exit+0x11b/0x370 [c065eae4] ? intel_idle+0xa4/0x100
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Dave, Tushar N Sent: Tuesday, July 10, 2012 12:02 PM To: Joe Jin Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org; Dave, Tushar N Subject: RE: 82571EB: Detected Hardware Unit Hang -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Joe Jin Sent: Tuesday, July 10, 2012 12:40 AM To: Joe Jin Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang When I debug the driver I found before Detected HW hang, driver unable to clean and reclaim the resources: 1457 while ((eop_desc-upper.data cpu_to_le32(E1000_TXD_STAT_DD)) == at here upper.data always is 0x300 1458(count tx_ring-count)) { --- snip --- 1487 } I checked all driver codes I did not found anywhere will set the upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware? Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means HW has processed that descriptor and driver can now clean that descriptor. With value 0x300 , DD bit is not set. That means HW has not processed that descriptor. How fast does tx hang reproduce? I suggest you to enable debug code in driver so when tx hang occurs it will dump the HW desc ring info into kernel log. You can run ethtool -s ethx msglvl 0x2c00 to enable debug. Once tx hang occurs please send me the full dmesg log. Does tx hang occur with in-kernel e1000e driver too? Thanks. -Tushar One change , please use ethtool -s ethx msglvl 0x2c01 so to keep default 'drv' msglvl enabled. Confirm the message level set correctly by running command 'ethtool ethx'. Last few will be Current message level: 0x2c01 (11265) drv tx_done rx_status hw If OS is 32bit system, what which happen? Thanks in advance, Joe On 07/09/12 16:51, Joe Jin wrote: Hi list, I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. Would you please help on this? device info: # lspci -s 05:00.0 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) # lspci -s 05:00.0 -n 05:00.0 0200: 8086:10bc (rev 06) # ethtool -i eth0 driver: e1000e version: 2.0.0-NAPI firmware-version: 5.10-2 bus-info: :05:00.0 # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: on kernel log: --- e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc8c0c next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc9bac next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 [ cut here ] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230() Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call Trace: [c07d9ac5] ? dev_watchdog+0x225/0x230 [c045ba61] warn_slowpath_common+0x81/0xa0 [c07d9ac5] ? dev_watchdog+0x225/0x230
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/11/12 03:02, Dave, Tushar N wrote: -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Joe Jin Sent: Tuesday, July 10, 2012 12:40 AM To: Joe Jin Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang When I debug the driver I found before Detected HW hang, driver unable to clean and reclaim the resources: 1457 while ((eop_desc-upper.data cpu_to_le32(E1000_TXD_STAT_DD)) == at here upper.data always is 0x300 1458(count tx_ring-count)) { --- snip --- 1487 } I checked all driver codes I did not found anywhere will set the upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware? Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means HW has processed that descriptor and driver can now clean that descriptor. With value 0x300 , DD bit is not set. That means HW has not processed that descriptor. Thanks for the clarify, might be firmware issue? How fast does tx hang reproduce? I suggest you to enable debug code in driver so when tx hang occurs it will dump the HW desc ring info into kernel log. Once I copy a file from other server, issue to be reproduced at once. I'll enable the debug to get more debug info. You can run ethtool -s ethx msglvl 0x2c00 to enable debug. Once tx hang occurs please send me the full dmesg log. Does tx hang occur with in-kernel e1000e driver too? I tried several drivers included rhel5 the latest, Intel the latest, rhel6 the latest, issue see on all those drivers. Thanks, Joe Thanks. -Tushar If OS is 32bit system, what which happen? Thanks in advance, Joe On 07/09/12 16:51, Joe Jin wrote: Hi list, I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. Would you please help on this? device info: # lspci -s 05:00.0 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) # lspci -s 05:00.0 -n 05:00.0 0200: 8086:10bc (rev 06) # ethtool -i eth0 driver: e1000e version: 2.0.0-NAPI firmware-version: 5.10-2 bus-info: :05:00.0 # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: on kernel log: --- e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc8c0c next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc9bac next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 [ cut here ] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230() Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call Trace: [c07d9ac5] ? dev_watchdog+0x225/0x230 [c045ba61] warn_slowpath_common+0x81/0xa0 [c07d9ac5] ? dev_watchdog+0x225/0x230 [c045bb23] warn_slowpath_fmt+0x33/0x40 [c07d9ac5] dev_watchdog+0x225/0x230 [c07d98a0] ? dev_activate+0xb0/0xb0 [c0468e82] call_timer_fn+0x32/0xf0 [c04bceb0] ?
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Tuesday, July 10, 2012 5:35 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 03:02, Dave, Tushar N wrote: -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Joe Jin Sent: Tuesday, July 10, 2012 12:40 AM To: Joe Jin Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang When I debug the driver I found before Detected HW hang, driver unable to clean and reclaim the resources: 1457 while ((eop_desc-upper.data cpu_to_le32(E1000_TXD_STAT_DD)) == at here upper.data always is 0x300 1458(count tx_ring-count)) { --- snip --- 1487 } I checked all driver codes I did not found anywhere will set the upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware? Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means HW has processed that descriptor and driver can now clean that descriptor. With value 0x300 , DD bit is not set. That means HW has not processed that descriptor. Thanks for the clarify, might be firmware issue? How fast does tx hang reproduce? I suggest you to enable debug code in driver so when tx hang occurs it will dump the HW desc ring info into kernel log. Once I copy a file from other server, issue to be reproduced at once. I'll enable the debug to get more debug info. You can run ethtool -s ethx msglvl 0x2c00 to enable debug. Once tx hang occurs please send me the full dmesg log. Does tx hang occur with in-kernel e1000e driver too? I tried several drivers included rhel5 the latest, Intel the latest, rhel6 the latest, issue see on all those drivers. Also after issue occurs please capture lspci -vvv (run as root) Thanks, Joe Thanks. -Tushar If OS is 32bit system, what which happen? Thanks in advance, Joe On 07/09/12 16:51, Joe Jin wrote: Hi list, I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. Would you please help on this? device info: # lspci -s 05:00.0 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) # lspci -s 05:00.0 -n 05:00.0 0200: 8086:10bc (rev 06) # ethtool -i eth0 driver: e1000e version: 2.0.0-NAPI firmware-version: 5.10-2 bus-info: :05:00.0 # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: on kernel log: --- e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc8c0c next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 6c TDT 81 next_to_use 81 next_to_clean6b buffer_info[next_to_clean]: time_stamp fffc7a23 next_to_watch71 jiffies fffc9bac next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 [ cut here ] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230() Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 07/11/12 11:22, Dave, Tushar N wrote: Thanks for info. I see that hang occurs right when HW processing first TX descriptor with TSO. Would you be able to reproduce issue with TSO off? Disable TSO by 'ethtool -K ethx tso off' Let all debug enabled as it is, that will help us debug further if issue occurs with TSO off. Hi Tushar, Thanks for you quick reply but disabled tso no help for this issue: # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: on kernel log after disable tso: e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 1 TDT 4 next_to_use 4 next_to_clean1 buffer_info[next_to_clean]: time_stamp 103ae0aba next_to_watch1 jiffies 103ae16a0 next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 1 TDT 4 next_to_use 4 next_to_clean1 buffer_info[next_to_clean]: time_stamp 103ae0aba next_to_watch1 jiffies 103ae2640 next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: Net device Info e1000e: Device Name statetrans_start last_rx e1000e: eth00003 000103AE128A e1000e :05:00.0: Register Dump e1000e: Register Name Value e1000e: CTRL180c0241 e1000e: STATUS 00080387 e1000e: CTRL_EXT181400c0 e1000e: ICR 0040 e1000e: RCTL04048002 e1000e: RDLEN 1000 e1000e: RDH 0090 e1000e: RDT 0080 e1000e: RDTR0020 e1000e: RXDCTL[0-1] 01040420 01040420 e1000e: ERT e1000e: RDBAL 23852000 e1000e: RDBAH 000c e1000e: RDFH075a e1000e: RDFT0752 e1000e: RDFHS 0758 e1000e: RDFTS 0752 e1000e: RDFPC 01b4 e1000e: TCTL3003f00a e1000e: TDBAL 1210c000 e1000e: TDBAH 000c e1000e: TDLEN 1000 e1000e: TDH 0001 e1000e: TDT 0004 e1000e: TIDV0008 e1000e: TXDCTL[0-1] 0145011f 0145011f e1000e: TADV0020 e1000e: TARC[0-1] 07a00403 07400403 e1000e: TDFH1308 e1000e: TDFT1308 e1000e: TDFHS 1308 e1000e: TDFTS 1308 e1000e: TDFPC e1000e :05:00.0: Tx Ring Summary e1000e: Queue [NTU] [NTC] [bi(ntc)-dma ] leng ntw timestamp e1000e: 0 4 1 000620800C02 002A 1 000103AE0ABA e1000e :05:00.0: Tx Ring Dump e1000e: Tl[desc] [address 63:0 ] [SpeCssSCmCsLen] [bi-dma ] leng ntw timestampbi-skb -- Legacy format e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi-dma ] leng ntw timestampbi-skb -- Ext Context format e1000e: Td[desc] [address 63:0 ] [VlaPoRSCm1Dlen] [bi-dma ] leng ntw timestampbi-skb -- Ext Data format e1000e: Tl[0x000]000C1AA0F002 8B2A 002A 0 (null) e1000e: Tl[0x001]000620800C02 8B2A 000620800C02 002A 1 000103AE0ABA 88061c6b6980 NTC e1000e: Tl[0x002]00061E6DBC02 8B2A 00061E6DBC02 002A 2 000103AE0EA2 88061c6b6880 e1000e: Tl[0x003]000620A6C402 8B2A 000620A6C402 002A 3 000103AE128A 8806230b4080 e1000e: Tl[0x004] 0 (null) NTU e1000e: Tl[0x005] 0 (null) e1000e: Tl[0x006] 0 (null) e1000e: Tl[0x007] 0 (null) e1000e: Tl[0x008] 0 (null) e1000e: Tl[0x009] 0 (null) e1000e: Tl[0x00A] 0 (null) e1000e: Tl[0x00B] 0 (null) e1000e: Tl[0x00C] 0 (null) e1000e: Tl[0x00D]
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
-Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Tuesday, July 10, 2012 8:29 PM To: Dave, Tushar N Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux- ker...@vger.kernel.org Subject: Re: 82571EB: Detected Hardware Unit Hang On 07/11/12 11:22, Dave, Tushar N wrote: Thanks for info. I see that hang occurs right when HW processing first TX descriptor with TSO. Would you be able to reproduce issue with TSO off? Disable TSO by 'ethtool -K ethx tso off' Let all debug enabled as it is, that will help us debug further if issue occurs with TSO off. Hi Tushar, Thanks for you quick reply but disabled tso no help for this issue: Thanks for running a quick test. I don't find anything obvious wrong in descriptor dump. When you said you had this issue with RHEL5 and RHEL6 drivers, have you install RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try reproduce it locally! -Tushar # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: on kernel log after disable tso: e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 1 TDT 4 next_to_use 4 next_to_clean1 buffer_info[next_to_clean]: time_stamp 103ae0aba next_to_watch1 jiffies 103ae16a0 next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: eth0: Detected Hardware Unit Hang: TDH 1 TDT 4 next_to_use 4 next_to_clean1 buffer_info[next_to_clean]: time_stamp 103ae0aba next_to_watch1 jiffies 103ae2640 next_to_watch.status 0 MAC Status 80387 PHY Status 792d PHY 1000BASE-T Status 3c00 PHY Extended Status3000 PCI Status 10 e1000e :05:00.0: Net device Info e1000e: Device Name statetrans_start last_rx e1000e: eth00003 000103AE128A e1000e :05:00.0: Register Dump e1000e: Register Name Value e1000e: CTRL180c0241 e1000e: STATUS 00080387 e1000e: CTRL_EXT181400c0 e1000e: ICR 0040 e1000e: RCTL04048002 e1000e: RDLEN 1000 e1000e: RDH 0090 e1000e: RDT 0080 e1000e: RDTR0020 e1000e: RXDCTL[0-1] 01040420 01040420 e1000e: ERT e1000e: RDBAL 23852000 e1000e: RDBAH 000c e1000e: RDFH075a e1000e: RDFT0752 e1000e: RDFHS 0758 e1000e: RDFTS 0752 e1000e: RDFPC 01b4 e1000e: TCTL3003f00a e1000e: TDBAL 1210c000 e1000e: TDBAH 000c e1000e: TDLEN 1000 e1000e: TDH 0001 e1000e: TDT 0004 e1000e: TIDV0008 e1000e: TXDCTL[0-1] 0145011f 0145011f e1000e: TADV0020 e1000e: TARC[0-1] 07a00403 07400403 e1000e: TDFH1308 e1000e: TDFT1308 e1000e: TDFHS 1308 e1000e: TDFTS 1308 e1000e: TDFPC e1000e :05:00.0: Tx Ring Summary e1000e: Queue [NTU] [NTC] [bi(ntc)-dma ] leng ntw timestamp e1000e: 0 4 1 000620800C02 002A 1 000103AE0ABA e1000e :05:00.0: Tx Ring Dump e1000e: Tl[desc] [address 63:0 ] [SpeCssSCmCsLen] [bi-dma ] leng ntw timestampbi-skb -- Legacy format e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi-dma ] leng ntw timestampbi-skb -- Ext Context format e1000e: Td[desc] [address 63:0 ] [VlaPoRSCm1Dlen] [bi-dma ] leng ntw timestampbi-skb -- Ext Data format e1000e: Tl[0x000]000C1AA0F002 8B2A 002A0 (null) e1000e: Tl[0x001]000620800C02 8B2A 000620800C02 002A1 000103AE0ABA 88061c6b6980 NTC e1000e: Tl[0x002]00061E6DBC02 8B2A 00061E6DBC02 002A2 000103AE0EA2 88061c6b6880 e1000e: Tl[0x003]000620A6C402 8B2A 000620A6C402 002A3 000103AE128A 8806230b4080 e1000e: Tl[0x004] 0 (null) NTU e1000e: Tl[0x005] 0 (null) e1000e: Tl[0x006] 0 (null) e1000e: Tl[0x007] 0 (null) e1000e: Tl[0x008] 0
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote: Hi list, I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy a big file (500M) from another server will hit it at once. Would you please help on this? Its a known problem. But apparently Intel guys are not very responsive, as they have another patch than the following : http://permalink.gmane.org/gmane.linux.network/232669 We only have to wait they push their alternative patch, eventually. In the mean time, you can use Hiroaki SHIMODA patch, it works. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
(moving the discussion back to the list) Hi, I am sorry, I didn't receive your patch as we discussed in private and ended up writing one patch myself which essentially does the same thing. The patch is available at: https://bugzilla.redhat.com/show_bug.cgi?id=746272#c13 It schedules a workqueue to flush the descriptors 500ms after sent the first packet. This ensures that there will be a write-back and enough time before the watchdog detects it as an old entry. Time: 0 ms - x ms - y ms -...- 500ms - Pkts: pkt#1 - pkt#2 - pkt#3 -...- pkt#n -pkt(n+1) Event:schedule - - - flush -schedule workqueue workqueue Customer reported that it works, so IMHO, the root cause is confirmed. There is no enough packets to cause the write-back and writing to FPD fixes it. That patch will flush every 500ms with high traffic too which isn't good for performance, though it would be a flush of up to 4 descriptors as far as I understand. I like Michael's approach to let the watchdog detects the hang first, then try to flush. Michael told me that we could flush and use the interrupt raised when the write-back ends to clean up. I think if there is a real TX hang (i.e. no interrupt event), it will take another watchdog cycle to detect that. It seems to me too much time without taking any action. Maybe something like this would work: 1) watchdog detects the hang 2) check for FLAG2_DMA_BURST flag 3) if yes, force flush, set a bit flag in the TX ring and schedule watchdog with a short period 4) if the TXDW interrupt happens, cleans up and reset the bit flag. 5) if not, the watchdog will expire, that bit flag will remain set then it will take any action assuming a real hang has occurred. thanks, fbl On Wed, 26 Oct 2011 17:27:04 +0800 Michael Wang wang...@linux.vnet.ibm.com wrote: Hi, Flavio, Jesse I have send out the patch, which I hope can do some help. Because this is my first time to send a patch, I am sorry if I have done some silly thing. And please tell me if there are some problem about it. Thanks Best regards, Michael Wang -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired -- RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Hi, Flavio, Jesse I have send out the patch, which I hope can do some help. Because this is my first time to send a patch, I am sorry if I have done some silly thing. And please tell me if there are some problem about it. Thanks Best regards, Michael Wang -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 10/25/2011 12:26 AM, Flavio Leitner wrote: On Mon, 24 Oct 2011 16:26:28 +0800 Michael Wangwang...@linux.vnet.ibm.com wrote: On 10/21/2011 10:03 PM, Flavio Leitner wrote: On Fri, 21 Oct 2011 14:15:12 +0800 Michael Wangwang...@linux.vnet.ibm.com wrote: On 10/19/2011 08:16 PM, Flavio Leitner wrote: On Wed, 19 Oct 2011 12:49:48 +0800 wangyunwang...@linux.vnet.ibm.comwrote: Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.com wrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.com wrote: TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the tx descriptor only when: 1. the descriptor device cached is lower then 32. 2. The descriptor host prepared is at least one. I don't think this will cause that issue, but another thing it done is to set the device to write-back the processed descriptor only when the amount reach 5(or 4). So may be when the device get a descriptor and processed, but the amount not reached 5, so it don't write-back it, but actually already transmitted. That could explain the issue and the fact that sometimes the hang info printed shows empty ring (write-back happened in the middle). But this will happen only when the transmit suddenly stopped for one second or more, I don't know whether this is the real traffic situation or not. At least for one customer the interface had almost no traffic. I will go over all the data again checking if this happens every time. And may be I am wrong about this, but also I think this may be the only reason cause this issue. I am seeing this based on the debugging output: This is the full output with debugging patch applied: Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 11 02:03:52 kernel: TDH25 Oct 11 02:03:52 kernel: TDT26 Oct 11 02:03:52 kernel: next_to_use26 Oct 11 02:03:52 kernel: next_to_clean25 Oct 11 02:03:52 kernel: buffer_info[next_to_clean]: Oct 11 02:03:52 kernel: time_stamp100b2aa22 Oct 11 02:03:52 kernel: next_to_watch25 Oct 11 02:03:52 kernel: jiffies100b2ab25 Oct 11 02:03:52 kernel: next_to_watch.status0 Oct 11 02:03:52 kernel: stored_i =25 Oct 11 02:03:52 kernel: stored_first =25 Oct 11 02:03:52 kernel: stamp =100b2aa22 Oct 11 02:03:52 kernel: factor =fa Oct 11 02:03:52 kernel: last_clean =100b2aa1a Oct 11 02:03:52 kernel: last_tx =100b2aa22 Oct 11 02:03:52 kernel: count =0/100 Notice above that buffer_info time_stamp is the same as in last_tx (last time the xmit function was called), also that last_clean (last time the clean function was called) is before that. Therefore, the system sent just one descriptor in about 1 second confirming your idea. So have you try to use the Red Hat 6, is this problem still exist? Actually, I received few other reports that looks like to be same issue but with 6.2. As far as I can tell, hardware that was working just fine started to show it after the kernel upgrade (coincidentally 5.7 and 6.2 introduces FLAG2_DMA_BURST). However, I haven't heard anything back since I had provided the instrumented kernel to confirm to you. I will follow up as soon as I hear something. Assuming that your idea is true, the hang detection is broken because it's possible to have a descriptor apparently stuck that is just missing the write-back. So, is it possible to set a timer to write-back? If yes, it could expire and run before the hang detection period expires. Or perhaps force the write-back to happen before hang detection execution. According to code ew32(TIDV, adapter-tx_int_delay);, I think such timer has been already set, but I don't know if the tx_int_delay is the default value which is 8(units of 1.024 μs). TIDV means if the time expire, it will flush the write-back, enforced. The default value is very less than 1sec, it can not caused this issue. Customer has a test system reproducing this with 5.7, we can test patches there if you like. Just let me know. thank you! fbl May be you can just search macro E1000_TXDCTL_DMA_BURST_ENABLE in drivers/net/e1000e/e1000.h, change it to: #define E1000_TXDCTL_DMA_BURST_ENABLE \ (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \ E1000_TXDCTL_COUNT_DESC | \ (0 16) | /* wthresh must be +1 more than desired */\ (1 8) |
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Mon, 24 Oct 2011 23:29:34 -0700 Michael Wang wang...@linux.vnet.ibm.com wrote: May be you can just search macro E1000_TXDCTL_DMA_BURST_ENABLE in drivers/net/e1000e/e1000.h, change it to: #define E1000_TXDCTL_DMA_BURST_ENABLE \ (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \ E1000_TXDCTL_COUNT_DESC | \ (0 16) | /* wthresh must be +1 more than desired */\ (1 8) | /* hthresh */ \ 0x1f) /* pthresh */ this will do the write-back even only one has been done, if the problem solved, we can think about a good solution. I can already tell you that this will fix the problem, but wthresh=1 is more like the hardware default after reset I think. Doing this will prevent the bursting behavior that got us the performance improvement this patch was made for, which is bad. That is why we are looking at a solution that likely involves two flush writes via the flush partial descriptors bits. Just do the bit 31 set in TIDV and RDTR twice in a row and then make sure it is write flushed. If you wish to implement that and give it a try that would be useful information. We haven't had time yet to get a full repro going. -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 10/25/2011 11:57 PM, Jesse Brandeburg wrote: On Mon, 24 Oct 2011 23:29:34 -0700 Michael Wangwang...@linux.vnet.ibm.com wrote: May be you can just search macro E1000_TXDCTL_DMA_BURST_ENABLE in drivers/net/e1000e/e1000.h, change it to: #define E1000_TXDCTL_DMA_BURST_ENABLE \ (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \ E1000_TXDCTL_COUNT_DESC | \ (0 16) | /* wthresh must be +1 more than desired */\ (1 8) | /* hthresh */ \ 0x1f) /* pthresh */ this will do the write-back even only one has been done, if the problem solved, we can think about a good solution. I can already tell you that this will fix the problem, but wthresh=1 is more like the hardware default after reset I think. Doing this will prevent the bursting behavior that got us the performance improvement this patch was made for, which is bad. Hi, Jesse I was confused about the code ew32(TIDV, adapter-tx_int_delay); I think this will cause a enforced write-back flush every 8*1.024 μs for default. If it works, I don't know why wthresh = 5 will cause this issue, because even there are not enough descriptor(over 4), the write-back will still be done every 8*1.024 μs. That is why we are looking at a solution that likely involves two flush writes via the flush partial descriptors bits. Just do the bit 31 set in TIDV and RDTR twice in a row and then make sure it is write flushed. If you wish to implement that and give it a try that would be useful information. We haven't had time yet to get a full repro going. I think besides my confusion, I will still try to do such work, but I really don't know whether this issue is caused by wthresh or not. Thanks Best regards Michael Wang -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 10/21/2011 10:03 PM, Flavio Leitner wrote: On Fri, 21 Oct 2011 14:15:12 +0800 Michael Wangwang...@linux.vnet.ibm.com wrote: On 10/19/2011 08:16 PM, Flavio Leitner wrote: On Wed, 19 Oct 2011 12:49:48 +0800 wangyunwang...@linux.vnet.ibm.com wrote: Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.comwrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.comwrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. It was buried in the end of the first email: https://bugzilla.redhat.com/show_bug.cgi?id=746272 TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the tx descriptor only when: 1. the descriptor device cached is lower then 32. 2. The descriptor host prepared is at least one. I don't think this will cause that issue, but another thing it done is to set the device to write-back the processed descriptor only when the amount reach 5(or 4). So may be when the device get a descriptor and processed, but the amount not reached 5, so it don't write-back it, but actually already transmitted. But this will happen only when the transmit suddenly stopped for one second or more, I don't know whether this is the real traffic situation or not. And may be I am wrong about this, but also I think this may be the only reason cause this issue. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. Yeah, the sequence might not be exact, but gives us a good idea of what could be happening. There are two events right after another: Oct 9 05:45:23 kernel: TDH48 Oct 9 05:45:23 kernel: TDT49 Oct 9 05:45:23 kernel: next_to_use49 Oct 9 05:45:23 kernel: next_to_clean48 Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: Oct 9 05:45:23 kernel: time_stamp102338ca6 Oct 9 05:45:23 kernel: next_to_watch48 Oct 9 05:45:23 kernel: jiffies102338dc1 Oct 9 05:45:23 kernel: next_to_watch.status0 Oct 9 05:45:23 kernel: MAC Status80383 Oct 9 05:45:23 kernel: PHY Status792d Oct 9 05:45:23 kernel: PHY 1000BASE-T Status3800 Oct 9 05:45:23 kernel: PHY Extended Status3000 Oct 9 05:45:23 kernel: PCI Status10 Oct 9 05:51:54 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 9 05:51:54 kernel: TDH55 Oct 9 05:51:54 kernel: TDT56 Oct 9 05:51:54 kernel: next_to_use56 Oct 9 05:51:54 kernel: next_to_clean55 Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: Oct 9 05:51:54 kernel: time_stamp102350986 Oct 9 05:51:54 kernel: next_to_watch55 Oct 9 05:51:54 kernel: jiffies102350b07 Oct 9 05:51:54 kernel: next_to_watch.status0 Oct 9 05:51:54 kernel: MAC Status80383 Oct 9 05:51:54 kernel: PHY Status792d Oct 9 05:51:54 kernel: PHY 1000BASE-T Status3800 Oct 9 05:51:54 kernel: PHY Extended Status3000 Oct 9 05:51:54 kernel: PCI Status10 I see the judgement of hang is: time_after(jiffies, tx_ring-buffer_info[i].time_stamp + (adapter-tx_timeout_factor * HZ)) which means the hang happened when
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Mon, 24 Oct 2011 16:26:28 +0800 Michael Wang wang...@linux.vnet.ibm.com wrote: On 10/21/2011 10:03 PM, Flavio Leitner wrote: On Fri, 21 Oct 2011 14:15:12 +0800 Michael Wangwang...@linux.vnet.ibm.com wrote: On 10/19/2011 08:16 PM, Flavio Leitner wrote: On Wed, 19 Oct 2011 12:49:48 +0800 wangyunwang...@linux.vnet.ibm.com wrote: Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.comwrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.comwrote: TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the tx descriptor only when: 1. the descriptor device cached is lower then 32. 2. The descriptor host prepared is at least one. I don't think this will cause that issue, but another thing it done is to set the device to write-back the processed descriptor only when the amount reach 5(or 4). So may be when the device get a descriptor and processed, but the amount not reached 5, so it don't write-back it, but actually already transmitted. That could explain the issue and the fact that sometimes the hang info printed shows empty ring (write-back happened in the middle). But this will happen only when the transmit suddenly stopped for one second or more, I don't know whether this is the real traffic situation or not. At least for one customer the interface had almost no traffic. I will go over all the data again checking if this happens every time. And may be I am wrong about this, but also I think this may be the only reason cause this issue. I am seeing this based on the debugging output: This is the full output with debugging patch applied: Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 11 02:03:52 kernel: TDH25 Oct 11 02:03:52 kernel: TDT26 Oct 11 02:03:52 kernel: next_to_use26 Oct 11 02:03:52 kernel: next_to_clean25 Oct 11 02:03:52 kernel: buffer_info[next_to_clean]: Oct 11 02:03:52 kernel: time_stamp100b2aa22 Oct 11 02:03:52 kernel: next_to_watch25 Oct 11 02:03:52 kernel: jiffies100b2ab25 Oct 11 02:03:52 kernel: next_to_watch.status0 Oct 11 02:03:52 kernel: stored_i =25 Oct 11 02:03:52 kernel: stored_first =25 Oct 11 02:03:52 kernel: stamp =100b2aa22 Oct 11 02:03:52 kernel: factor =fa Oct 11 02:03:52 kernel: last_clean =100b2aa1a Oct 11 02:03:52 kernel: last_tx =100b2aa22 Oct 11 02:03:52 kernel: count =0/100 Notice above that buffer_info time_stamp is the same as in last_tx (last time the xmit function was called), also that last_clean (last time the clean function was called) is before that. Therefore, the system sent just one descriptor in about 1 second confirming your idea. So have you try to use the Red Hat 6, is this problem still exist? Actually, I received few other reports that looks like to be same issue but with 6.2. As far as I can tell, hardware that was working just fine started to show it after the kernel upgrade (coincidentally 5.7 and 6.2 introduces FLAG2_DMA_BURST). However, I haven't heard anything back since I had provided the instrumented kernel to confirm to you. I will follow up as soon as I hear something. Assuming that your idea is true, the hang detection is broken because it's possible to have a descriptor apparently stuck that is just missing the write-back. So, is it possible to set a timer to write-back? If yes, it could expire and run before the hang detection period expires. Or perhaps force the write-back to happen before hang detection execution. Customer has a test system reproducing this with 5.7, we can test patches there if you like. Just let me know. thank you! fbl -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 10/19/2011 08:16 PM, Flavio Leitner wrote: On Wed, 19 Oct 2011 12:49:48 +0800 wangyunwang...@linux.vnet.ibm.com wrote: Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.com wrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.com wrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. It was buried in the end of the first email: https://bugzilla.redhat.com/show_bug.cgi?id=746272 TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. Yeah, the sequence might not be exact, but gives us a good idea of what could be happening. There are two events right after another: Oct 9 05:45:23 kernel: TDH48 Oct 9 05:45:23 kernel: TDT49 Oct 9 05:45:23 kernel: next_to_use49 Oct 9 05:45:23 kernel: next_to_clean48 Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: Oct 9 05:45:23 kernel: time_stamp102338ca6 Oct 9 05:45:23 kernel: next_to_watch48 Oct 9 05:45:23 kernel: jiffies102338dc1 Oct 9 05:45:23 kernel: next_to_watch.status0 Oct 9 05:45:23 kernel: MAC Status80383 Oct 9 05:45:23 kernel: PHY Status792d Oct 9 05:45:23 kernel: PHY 1000BASE-T Status3800 Oct 9 05:45:23 kernel: PHY Extended Status3000 Oct 9 05:45:23 kernel: PCI Status10 Oct 9 05:51:54 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 9 05:51:54 kernel: TDH55 Oct 9 05:51:54 kernel: TDT56 Oct 9 05:51:54 kernel: next_to_use56 Oct 9 05:51:54 kernel: next_to_clean55 Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: Oct 9 05:51:54 kernel: time_stamp102350986 Oct 9 05:51:54 kernel: next_to_watch55 Oct 9 05:51:54 kernel: jiffies102350b07 Oct 9 05:51:54 kernel: next_to_watch.status0 Oct 9 05:51:54 kernel: MAC Status80383 Oct 9 05:51:54 kernel: PHY Status792d Oct 9 05:51:54 kernel: PHY 1000BASE-T Status3800 Oct 9 05:51:54 kernel: PHY Extended Status3000 Oct 9 05:51:54 kernel: PCI Status10 I see the judgement of hang is: time_after(jiffies, tx_ring-buffer_info[i].time_stamp + (adapter-tx_timeout_factor * HZ)) which means the hang happened when current jiffies minus buffer's time stamp is over (adapter-tx_timeout_factor * HZ). And I see the tx_timeout_factor will at least be 1, so on x86 the (jiffies-time_stamp) should over 1000, but here looks only around 300. Could you please check the HZ number of your platform? sure, adapter-tx_timeout_factor * HZ = 0xfa/250d That data came from a customer using kernel-xen, so HZ is 250. Here is the debugging patch used: http://people.redhat.com/~fleitner/linux-kernel-test.patch The idea was to capture all the relevant values at the time of the problem. (The print_hang_task is scheduled and sometimes it shows timestamp=0, TDH=TDT because the packet is already sent) This is the full output with debugging patch applied: Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 11 02:03:52 kernel: TDH25 Oct 11 02:03:52 kernel: TDT26 Oct 11
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Fri, 21 Oct 2011 14:15:12 +0800 Michael Wang wang...@linux.vnet.ibm.com wrote: On 10/19/2011 08:16 PM, Flavio Leitner wrote: On Wed, 19 Oct 2011 12:49:48 +0800 wangyunwang...@linux.vnet.ibm.com wrote: Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.com wrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.com wrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. It was buried in the end of the first email: https://bugzilla.redhat.com/show_bug.cgi?id=746272 TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. Yeah, the sequence might not be exact, but gives us a good idea of what could be happening. There are two events right after another: Oct 9 05:45:23 kernel: TDH48 Oct 9 05:45:23 kernel: TDT49 Oct 9 05:45:23 kernel: next_to_use49 Oct 9 05:45:23 kernel: next_to_clean48 Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: Oct 9 05:45:23 kernel: time_stamp102338ca6 Oct 9 05:45:23 kernel: next_to_watch48 Oct 9 05:45:23 kernel: jiffies102338dc1 Oct 9 05:45:23 kernel: next_to_watch.status0 Oct 9 05:45:23 kernel: MAC Status80383 Oct 9 05:45:23 kernel: PHY Status792d Oct 9 05:45:23 kernel: PHY 1000BASE-T Status3800 Oct 9 05:45:23 kernel: PHY Extended Status3000 Oct 9 05:45:23 kernel: PCI Status10 Oct 9 05:51:54 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 9 05:51:54 kernel: TDH55 Oct 9 05:51:54 kernel: TDT56 Oct 9 05:51:54 kernel: next_to_use56 Oct 9 05:51:54 kernel: next_to_clean55 Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: Oct 9 05:51:54 kernel: time_stamp102350986 Oct 9 05:51:54 kernel: next_to_watch55 Oct 9 05:51:54 kernel: jiffies102350b07 Oct 9 05:51:54 kernel: next_to_watch.status0 Oct 9 05:51:54 kernel: MAC Status80383 Oct 9 05:51:54 kernel: PHY Status792d Oct 9 05:51:54 kernel: PHY 1000BASE-T Status3800 Oct 9 05:51:54 kernel: PHY Extended Status3000 Oct 9 05:51:54 kernel: PCI Status10 I see the judgement of hang is: time_after(jiffies, tx_ring-buffer_info[i].time_stamp + (adapter-tx_timeout_factor * HZ)) which means the hang happened when current jiffies minus buffer's time stamp is over (adapter-tx_timeout_factor * HZ). And I see the tx_timeout_factor will at least be 1, so on x86 the (jiffies-time_stamp) should over 1000, but here looks only around 300. Could you please check the HZ number of your platform? sure, adapter-tx_timeout_factor * HZ = 0xfa/250d That data came from a customer using kernel-xen, so HZ is 250. Here is the debugging patch used: http://people.redhat.com/~fleitner/linux-kernel-test.patch The idea was to capture all the relevant values at the time of the problem. (The print_hang_task is scheduled and sometimes it shows timestamp=0, TDH=TDT because the packet is already sent)
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Wed, 19 Oct 2011 12:49:48 +0800 wangyun wang...@linux.vnet.ibm.com wrote: Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.com wrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.com wrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. It was buried in the end of the first email: https://bugzilla.redhat.com/show_bug.cgi?id=746272 TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. Yeah, the sequence might not be exact, but gives us a good idea of what could be happening. There are two events right after another: Oct 9 05:45:23 kernel: TDH48 Oct 9 05:45:23 kernel: TDT49 Oct 9 05:45:23 kernel: next_to_use49 Oct 9 05:45:23 kernel: next_to_clean48 Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: Oct 9 05:45:23 kernel: time_stamp102338ca6 Oct 9 05:45:23 kernel: next_to_watch48 Oct 9 05:45:23 kernel: jiffies102338dc1 Oct 9 05:45:23 kernel: next_to_watch.status0 Oct 9 05:45:23 kernel: MAC Status80383 Oct 9 05:45:23 kernel: PHY Status792d Oct 9 05:45:23 kernel: PHY 1000BASE-T Status3800 Oct 9 05:45:23 kernel: PHY Extended Status3000 Oct 9 05:45:23 kernel: PCI Status10 Oct 9 05:51:54 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 9 05:51:54 kernel: TDH55 Oct 9 05:51:54 kernel: TDT56 Oct 9 05:51:54 kernel: next_to_use56 Oct 9 05:51:54 kernel: next_to_clean55 Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: Oct 9 05:51:54 kernel: time_stamp102350986 Oct 9 05:51:54 kernel: next_to_watch55 Oct 9 05:51:54 kernel: jiffies102350b07 Oct 9 05:51:54 kernel: next_to_watch.status0 Oct 9 05:51:54 kernel: MAC Status80383 Oct 9 05:51:54 kernel: PHY Status792d Oct 9 05:51:54 kernel: PHY 1000BASE-T Status3800 Oct 9 05:51:54 kernel: PHY Extended Status3000 Oct 9 05:51:54 kernel: PCI Status10 I see the judgement of hang is: time_after(jiffies, tx_ring-buffer_info[i].time_stamp + (adapter-tx_timeout_factor * HZ)) which means the hang happened when current jiffies minus buffer's time stamp is over (adapter-tx_timeout_factor * HZ). And I see the tx_timeout_factor will at least be 1, so on x86 the (jiffies-time_stamp) should over 1000, but here looks only around 300. Could you please check the HZ number of your platform? sure, adapter-tx_timeout_factor * HZ = 0xfa/250d That data came from a customer using kernel-xen, so HZ is 250. Here is the debugging patch used: http://people.redhat.com/~fleitner/linux-kernel-test.patch The idea was to capture all the relevant values at the time of the problem. (The print_hang_task is scheduled and sometimes it shows timestamp=0, TDH=TDT because the packet is already sent) This is the full output with debugging patch applied: Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 11 02:03:52 kernel: TDH
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburg jesse.brandeb...@intel.com wrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitner f...@redhat.com wrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. It was buried in the end of the first email: https://bugzilla.redhat.com/show_bug.cgi?id=746272 TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. Yeah, the sequence might not be exact, but gives us a good idea of what could be happening. There are two events right after another: Oct 9 05:45:23 kernel: TDH 48 Oct 9 05:45:23 kernel: TDT 49 Oct 9 05:45:23 kernel: next_to_use 49 Oct 9 05:45:23 kernel: next_to_clean48 Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: Oct 9 05:45:23 kernel: time_stamp 102338ca6 Oct 9 05:45:23 kernel: next_to_watch48 Oct 9 05:45:23 kernel: jiffies 102338dc1 Oct 9 05:45:23 kernel: next_to_watch.status 0 Oct 9 05:45:23 kernel: MAC Status 80383 Oct 9 05:45:23 kernel: PHY Status 792d Oct 9 05:45:23 kernel: PHY 1000BASE-T Status 3800 Oct 9 05:45:23 kernel: PHY Extended Status3000 Oct 9 05:45:23 kernel: PCI Status 10 Oct 9 05:51:54 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 9 05:51:54 kernel: TDH 55 Oct 9 05:51:54 kernel: TDT 56 Oct 9 05:51:54 kernel: next_to_use 56 Oct 9 05:51:54 kernel: next_to_clean55 Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: Oct 9 05:51:54 kernel: time_stamp 102350986 Oct 9 05:51:54 kernel: next_to_watch55 Oct 9 05:51:54 kernel: jiffies 102350b07 Oct 9 05:51:54 kernel: next_to_watch.status 0 Oct 9 05:51:54 kernel: MAC Status 80383 Oct 9 05:51:54 kernel: PHY Status 792d Oct 9 05:51:54 kernel: PHY 1000BASE-T Status 3800 Oct 9 05:51:54 kernel: PHY Extended Status3000 Oct 9 05:51:54 kernel: PCI Status 10 This is the sar report, the interface was idling. 00:00:01IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s 05:40:01 eth7 1.13 0.03944.69 4.14 0.00 0.00 0.87 05:50:01 eth7 1.25 0.03952.37 4.13 0.00 0.00 0.87 06:00:01 eth7 1.14 0.03947.26 4.14 0.00 0.00 0.87 00:00:01IFACE rxerr/s txerr/scoll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s 05:40:01 eth7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:50:01 eth7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 06:00:01 eth7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ethtool -i eth7: driver: e1000e version: 1.3.10-k2 firmware-version: 5.12-2 bus-info: :22:00.1 22:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 22:00.1 0200: 8086:10bc (rev 06) (the rest of the lspci is on the first email,
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.com wrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.com wrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. It was buried in the end of the first email: https://bugzilla.redhat.com/show_bug.cgi?id=746272 TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. Yeah, the sequence might not be exact, but gives us a good idea of what could be happening. There are two events right after another: Oct 9 05:45:23 kernel: TDH48 Oct 9 05:45:23 kernel: TDT49 Oct 9 05:45:23 kernel: next_to_use49 Oct 9 05:45:23 kernel: next_to_clean48 Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: Oct 9 05:45:23 kernel: time_stamp102338ca6 Oct 9 05:45:23 kernel: next_to_watch48 Oct 9 05:45:23 kernel: jiffies102338dc1 Oct 9 05:45:23 kernel: next_to_watch.status0 Oct 9 05:45:23 kernel: MAC Status80383 Oct 9 05:45:23 kernel: PHY Status792d Oct 9 05:45:23 kernel: PHY 1000BASE-T Status3800 Oct 9 05:45:23 kernel: PHY Extended Status3000 Oct 9 05:45:23 kernel: PCI Status10 Oct 9 05:51:54 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 9 05:51:54 kernel: TDH55 Oct 9 05:51:54 kernel: TDT56 Oct 9 05:51:54 kernel: next_to_use56 Oct 9 05:51:54 kernel: next_to_clean55 Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: Oct 9 05:51:54 kernel: time_stamp102350986 Oct 9 05:51:54 kernel: next_to_watch55 Oct 9 05:51:54 kernel: jiffies102350b07 Oct 9 05:51:54 kernel: next_to_watch.status0 Oct 9 05:51:54 kernel: MAC Status80383 Oct 9 05:51:54 kernel: PHY Status792d Oct 9 05:51:54 kernel: PHY 1000BASE-T Status3800 Oct 9 05:51:54 kernel: PHY Extended Status3000 Oct 9 05:51:54 kernel: PCI Status10 I see the judgement of hang is: time_after(jiffies, tx_ring-buffer_info[i].time_stamp + (adapter-tx_timeout_factor * HZ)) which means the hang happened when current jiffies minus buffer's time stamp is over (adapter-tx_timeout_factor * HZ). And I see the tx_timeout_factor will at least be 1, so on x86 the (jiffies-time_stamp) should over 1000, but here looks only around 300. Could you please check the HZ number of your platform? This is the sar report, the interface was idling. 00:00:01IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s 05:40:01 eth7 1.13 0.03944.69 4.14 0.00 0.00 0.87 05:50:01 eth7 1.25 0.03952.37 4.13 0.00 0.00 0.87 06:00:01 eth7 1.14 0.03947.26 4.14 0.00 0.00 0.87 00:00:01IFACE rxerr/s txerr/scoll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s 05:40:01 eth7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:50:01 eth7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitner f...@redhat.com wrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. The following commit seems to be related to the symptoms seen above: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3a3b75860527a11ba5035c6aa576079245d09e2a From: Jesse Brandeburg jesse.brandeb...@intel.com Date: Wed, 29 Sep 2010 21:38:49 + (+) Subject: e1000e: use hardware writeback batching X-Git-Tag: v2.6.37-rc1~147^2~299 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=3a3b75860527a11ba5035c6aa576079245d09e2a e1000e: use hardware writeback batching Most e1000e parts support batching writebacks. The problem with this is that when some of the TADV or TIDV timers are not set, Tx can sit forever. This is solved in this patch with write flushes using the Flush Partial Descriptors (FPD) bit in TIDV and RDTR. This improves bus utilization and removes partial writes on e1000e, particularly from 82571 parts in S5500 chipset based machines. Only ES2LAN and 82571/2 parts are included in this optimization, to reduce testing load. We have modified the instrumented kernel to include the following patch disabling writeback batching feature to narrow down the problem: --- debug/drivers/net/e1000e/82571.c.orig 2011-10-11 14:00:44.0 -0300 +++ debug/drivers/net/e1000e/82571.c 2011-10-11 15:02:51.0 -0300 @@ -2028,8 +2028,7 @@ struct e1000_info e1000_82571_info = { | FLAG_RESET_OVERWRITES_LAA /* errata */ | FLAG_TARC_SPEED_MODE_BIT /* errata */ | FLAG_APME_CHECK_PORT_B, - .flags2 = FLAG2_DISABLE_ASPM_L1 /* errata 13 */ -| FLAG2_DMA_BURST, + .flags2 = FLAG2_DISABLE_ASPM_L1, /* errata 13 */ .pba= 38, .max_hw_frame_size = DEFAULT_JUMBO, and the customer confirmed that the issue has disappeared since then. Board info: 1e:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) 1e:00.0 0200: 8086:10bc (rev 06) Subsystem: 103c:704b Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 224 Region 0: Memory at fd4e (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fd40 (32-bit, non-prefetchable) [size=512K] Region 2: I/O ports at 7000 [size=32] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee0 Data: 4073 Capabilities: [e0] Express (v1) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, L1 64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-