Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-12-18 Thread Joe Jin
Hi all,

I backported mps commits and ask customer pass pci=pcie_bus_peer2pee to kernel
to limited MPS to 128 and issue disappeared, sound like this is a BIOS bug.

Thanks all of your help.

Best Regards,
Joe

On 11/29/12 23:52, Fujinaka, Todd wrote:
 Someone else pointed this out to me locally. If you have a non-client BIOS, 
 you should be able to set the MaxPayloadSize using setpci. You have to make 
 sure that you're being consistent throughout all the associated links.
 
 Todd Fujinaka
 Technical Marketing Engineer
 LAN Access Division (LAD)
 Intel Corporation
 todd.fujin...@intel.com
 (503) 712-4565
 
 
 -Original Message-
 From: Ethan Zhao [mailto:ethan.ker...@gmail.com] 
 Sent: Wednesday, November 28, 2012 7:10 PM
 To: Fujinaka, Todd
 Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; 
 e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
 Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
 
 Joe,
 Possibly your customer is running a kernel without source code on a 
 platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell 
 server ?).
 Anyway, to see if is a payload issue or,  you could change the payload 
 size with setpci tool to those devices and set the link retrain bit to 
 trigger the link retraining to debug the issue and identity the root cause.  
 I thinks it is much easier than modify the BIOS or  eeprom of NIC.
 
 e.g.
set device control register to 0f 00   (128 bytes payload size)
#   setpci -v -s 00:02.0 98.w=000f
set device link control register to 60h (retrain the link)
#  setpci -v -s 00:02.0 a0.b=60
 
   Hope it works,  Just my 2 cents.
 
 ethan.z...@oracle.com
 
 On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com 
 wrote:
 The only EEPROM I know about or can speak to is the one attached to the 
 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.

 Todd Fujinaka
 Technical Marketing Engineer
 LAN Access Division (LAD)
 Intel Corporation
 todd.fujin...@intel.com
 (503) 712-4565


 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, November 28, 2012 12:31 AM
 To: Ben Hutchings
 Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; 
 e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
 Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

 On 11/28/12 02:10, Ben Hutchings wrote:
 On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has 
 been mentioned in the past.

 We (and by we I mean the Ethernet part and driver) can only change 
 the advertised availability of a larger MaxPayloadSize. The size is 
 negotiated by both sides of the link when the link is established.
 The driver should not change the size of the link as it would be 
 poking at registers outside of its scope and is controlled by the 
 upstream bridge (not us).
 [...]

 MaxPayloadSize (MPS) is not negotiated between devices but is 
 programmed by the system firmware (at least for devices present at 
 boot - the kernel may be responsible in case of hotplug).  You can 
 use the kernel parameter 'pci=pcie_bus_perf' (or one of several 
 others) to set a policy that overrides this, but no policy will allow 
 setting MPS above the device's MaxPayloadSizeSupported (MPSS).


 Ben,

 Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
 So I'm trying to use ethtool modify it from eeprom to see if help or no.


 Todd, I'll review all MaxPayload for all devices, but need to say if it 
 mismatch, customer could not modify it from BIOS for there was not entry at 
 there, to test it, we have to find how to verify if this is the root cause, 
 so still need to find the offset in eeprom.

 Thanks in advance,
 Joe



-- 
Oracle http://www.oracle.com
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 

--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-12-18 Thread Joe Jin
Hi Yijing,

Thanks for your reference, the patch looks good for me, but I have no chance
to test it on customer's env.

Best Regards,
Joe

On 12/19/12 13:52, Yijing Wang wrote:
 On 2012/12/19 11:04, Joe Jin wrote:
 Hi all,

 I backported mps commits and ask customer pass pci=pcie_bus_peer2pee to 
 kernel
 to limited MPS to 128 and issue disappeared, sound like this is a BIOS bug.

 
 Hi Joe,
I found similar problem when I do pci hotplug, discussion is 
 here:http://marc.info/?l=linux-pcim=134810569924220w=2.
 We try to improve Linux kernel to debug this problem easily based Bjorn's 
 suggestion. Jon sent out the first version patch 
 http://marc.info/?l=linux-pcim=135002016005274w=2.
 I think we can do further here, 
 http://marc.info/?l=linux-pcim=135115581307869w=2. I hope this information 
 can help you.
 
 Thanks!
 Yijing.
 
 Thanks all of your help.

 Best Regards,
 Joe

 On 11/29/12 23:52, Fujinaka, Todd wrote:
 Someone else pointed this out to me locally. If you have a non-client BIOS, 
 you should be able to set the MaxPayloadSize using setpci. You have to make 
 sure that you're being consistent throughout all the associated links.

 Todd Fujinaka
 Technical Marketing Engineer
 LAN Access Division (LAD)
 Intel Corporation
 todd.fujin...@intel.com
 (503) 712-4565


 -Original Message-
 From: Ethan Zhao [mailto:ethan.ker...@gmail.com] 
 Sent: Wednesday, November 28, 2012 7:10 PM
 To: Fujinaka, Todd
 Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; 
 e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
 Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

 Joe,
 Possibly your customer is running a kernel without source code on a 
 platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell 
 server ?).
 Anyway, to see if is a payload issue or,  you could change the payload 
 size with setpci tool to those devices and set the link retrain bit to 
 trigger the link retraining to debug the issue and identity the root cause. 
  I thinks it is much easier than modify the BIOS or  eeprom of NIC.

 e.g.
set device control register to 0f 00   (128 bytes payload size)
#   setpci -v -s 00:02.0 98.w=000f
set device link control register to 60h (retrain the link)
#  setpci -v -s 00:02.0 a0.b=60

   Hope it works,  Just my 2 cents.

 ethan.z...@oracle.com

 On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com 
 wrote:
 The only EEPROM I know about or can speak to is the one attached to the 
 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.

 Todd Fujinaka
 Technical Marketing Engineer
 LAN Access Division (LAD)
 Intel Corporation
 todd.fujin...@intel.com
 (503) 712-4565


 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, November 28, 2012 12:31 AM
 To: Ben Hutchings
 Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; 
 e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
 Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

 On 11/28/12 02:10, Ben Hutchings wrote:
 On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has 
 been mentioned in the past.

 We (and by we I mean the Ethernet part and driver) can only change 
 the advertised availability of a larger MaxPayloadSize. The size is 
 negotiated by both sides of the link when the link is established.
 The driver should not change the size of the link as it would be 
 poking at registers outside of its scope and is controlled by the 
 upstream bridge (not us).
 [...]

 MaxPayloadSize (MPS) is not negotiated between devices but is 
 programmed by the system firmware (at least for devices present at 
 boot - the kernel may be responsible in case of hotplug).  You can 
 use the kernel parameter 'pci=pcie_bus_perf' (or one of several 
 others) to set a policy that overrides this, but no policy will allow 
 setting MPS above the device's MaxPayloadSizeSupported (MPSS).


 Ben,

 Unfortunately I'm using 3.0.x kernel and this is not included in the 
 kernel.
 So I'm trying to use ethtool modify it from eeprom to see if help or no.


 Todd, I'll review all MaxPayload for all devices, but need to say if it 
 mismatch, customer could not modify it from BIOS for there was not entry 
 at there, to test it, we have to find how to verify if this is the root 
 cause, so still need to find the offset in eeprom.

 Thanks in advance,
 Joe



 
 


-- 
Oracle http://www.oracle.com
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 

--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-12-18 Thread Yijing Wang
On 2012/12/19 11:04, Joe Jin wrote:
 Hi all,
 
 I backported mps commits and ask customer pass pci=pcie_bus_peer2pee to 
 kernel
 to limited MPS to 128 and issue disappeared, sound like this is a BIOS bug.
 

Hi Joe,
   I found similar problem when I do pci hotplug, discussion is 
here:http://marc.info/?l=linux-pcim=134810569924220w=2.
We try to improve Linux kernel to debug this problem easily based Bjorn's 
suggestion. Jon sent out the first version patch 
http://marc.info/?l=linux-pcim=135002016005274w=2.
I think we can do further here, 
http://marc.info/?l=linux-pcim=135115581307869w=2. I hope this information 
can help you.

Thanks!
Yijing.

 Thanks all of your help.
 
 Best Regards,
 Joe
 
 On 11/29/12 23:52, Fujinaka, Todd wrote:
 Someone else pointed this out to me locally. If you have a non-client BIOS, 
 you should be able to set the MaxPayloadSize using setpci. You have to make 
 sure that you're being consistent throughout all the associated links.

 Todd Fujinaka
 Technical Marketing Engineer
 LAN Access Division (LAD)
 Intel Corporation
 todd.fujin...@intel.com
 (503) 712-4565


 -Original Message-
 From: Ethan Zhao [mailto:ethan.ker...@gmail.com] 
 Sent: Wednesday, November 28, 2012 7:10 PM
 To: Fujinaka, Todd
 Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; 
 e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
 Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

 Joe,
 Possibly your customer is running a kernel without source code on a 
 platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell 
 server ?).
 Anyway, to see if is a payload issue or,  you could change the payload 
 size with setpci tool to those devices and set the link retrain bit to 
 trigger the link retraining to debug the issue and identity the root cause.  
 I thinks it is much easier than modify the BIOS or  eeprom of NIC.

 e.g.
set device control register to 0f 00   (128 bytes payload size)
#   setpci -v -s 00:02.0 98.w=000f
set device link control register to 60h (retrain the link)
#  setpci -v -s 00:02.0 a0.b=60

   Hope it works,  Just my 2 cents.

 ethan.z...@oracle.com

 On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com 
 wrote:
 The only EEPROM I know about or can speak to is the one attached to the 
 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.

 Todd Fujinaka
 Technical Marketing Engineer
 LAN Access Division (LAD)
 Intel Corporation
 todd.fujin...@intel.com
 (503) 712-4565


 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, November 28, 2012 12:31 AM
 To: Ben Hutchings
 Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; 
 e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
 Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

 On 11/28/12 02:10, Ben Hutchings wrote:
 On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has 
 been mentioned in the past.

 We (and by we I mean the Ethernet part and driver) can only change 
 the advertised availability of a larger MaxPayloadSize. The size is 
 negotiated by both sides of the link when the link is established.
 The driver should not change the size of the link as it would be 
 poking at registers outside of its scope and is controlled by the 
 upstream bridge (not us).
 [...]

 MaxPayloadSize (MPS) is not negotiated between devices but is 
 programmed by the system firmware (at least for devices present at 
 boot - the kernel may be responsible in case of hotplug).  You can 
 use the kernel parameter 'pci=pcie_bus_perf' (or one of several 
 others) to set a policy that overrides this, but no policy will allow 
 setting MPS above the device's MaxPayloadSizeSupported (MPSS).


 Ben,

 Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
 So I'm trying to use ethtool modify it from eeprom to see if help or no.


 Todd, I'll review all MaxPayload for all devices, but need to say if it 
 mismatch, customer could not modify it from BIOS for there was not entry at 
 there, to test it, we have to find how to verify if this is the root cause, 
 so still need to find the offset in eeprom.

 Thanks in advance,
 Joe

 
 


-- 
Thanks!
Yijing


--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-29 Thread Fujinaka, Todd
Someone else pointed this out to me locally. If you have a non-client BIOS, you 
should be able to set the MaxPayloadSize using setpci. You have to make sure 
that you're being consistent throughout all the associated links.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujin...@intel.com
(503) 712-4565


-Original Message-
From: Ethan Zhao [mailto:ethan.ker...@gmail.com] 
Sent: Wednesday, November 28, 2012 7:10 PM
To: Fujinaka, Todd
Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; 
e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Joe,
Possibly your customer is running a kernel without source code on a 
platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server 
?).
Anyway, to see if is a payload issue or,  you could change the payload size 
with setpci tool to those devices and set the link retrain bit to trigger the 
link retraining to debug the issue and identity the root cause.  I thinks it is 
much easier than modify the BIOS or  eeprom of NIC.

e.g.
   set device control register to 0f 00   (128 bytes payload size)
   #   setpci -v -s 00:02.0 98.w=000f
   set device link control register to 60h (retrain the link)
   #  setpci -v -s 00:02.0 a0.b=60

  Hope it works,  Just my 2 cents.

ethan.z...@oracle.com

On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd todd.fujin...@intel.com 
wrote:
 The only EEPROM I know about or can speak to is the one attached to the 82571 
 and it doesn't set the MaxPayloadSize. That's done by the BIOS.

 Todd Fujinaka
 Technical Marketing Engineer
 LAN Access Division (LAD)
 Intel Corporation
 todd.fujin...@intel.com
 (503) 712-4565


 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, November 28, 2012 12:31 AM
 To: Ben Hutchings
 Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; 
 e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
 Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

 On 11/28/12 02:10, Ben Hutchings wrote:
 On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has 
 been mentioned in the past.

 We (and by we I mean the Ethernet part and driver) can only change 
 the advertised availability of a larger MaxPayloadSize. The size is 
 negotiated by both sides of the link when the link is established.
 The driver should not change the size of the link as it would be 
 poking at registers outside of its scope and is controlled by the 
 upstream bridge (not us).
 [...]

 MaxPayloadSize (MPS) is not negotiated between devices but is 
 programmed by the system firmware (at least for devices present at 
 boot - the kernel may be responsible in case of hotplug).  You can 
 use the kernel parameter 'pci=pcie_bus_perf' (or one of several 
 others) to set a policy that overrides this, but no policy will allow 
 setting MPS above the device's MaxPayloadSizeSupported (MPSS).


 Ben,

 Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
 So I'm trying to use ethtool modify it from eeprom to see if help or no.


 Todd, I'll review all MaxPayload for all devices, but need to say if it 
 mismatch, customer could not modify it from BIOS for there was not entry at 
 there, to test it, we have to find how to verify if this is the root cause, 
 so still need to find the offset in eeprom.

 Thanks in advance,
 Joe


--
Keep yourself connected to Go Parallel: 
VERIFY Test and improve your parallel project with help from experts 
and peers. http://goparallel.sourceforge.net
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-28 Thread Joe Jin
On 11/28/12 02:10, Ben Hutchings wrote:
 On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has
 been mentioned in the past.

 We (and by we I mean the Ethernet part and driver) can only change the
 advertised availability of a larger MaxPayloadSize. The size is
 negotiated by both sides of the link when the link is established. The
 driver should not change the size of the link as it would be poking at
 registers outside of its scope and is controlled by the upstream
 bridge (not us).
 [...]
 
 MaxPayloadSize (MPS) is not negotiated between devices but is programmed
 by the system firmware (at least for devices present at boot - the
 kernel may be responsible in case of hotplug).  You can use the kernel
 parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy
 that overrides this, but no policy will allow setting MPS above the
 device's MaxPayloadSizeSupported (MPSS).
 

Ben,

Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
So I'm trying to use ethtool modify it from eeprom to see if help or no.


Todd, I'll review all MaxPayload for all devices, but need to say if it 
mismatch,
customer could not modify it from BIOS for there was not entry at there, to
test it, we have to find how to verify if this is the root cause, so still 
need to find the offset in eeprom.

Thanks in advance,
Joe


--
Keep yourself connected to Go Parallel: 
INSIGHTS What's next for parallel hardware, programming and related areas?
Interviews and blogs by thought leaders keep you ahead of the curve.
http://goparallel.sourceforge.net
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-28 Thread Fujinaka, Todd
The only EEPROM I know about or can speak to is the one attached to the 82571 
and it doesn't set the MaxPayloadSize. That's done by the BIOS.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujin...@intel.com
(503) 712-4565


-Original Message-
From: Joe Jin [mailto:joe@oracle.com] 
Sent: Wednesday, November 28, 2012 12:31 AM
To: Ben Hutchings
Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; 
e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On 11/28/12 02:10, Ben Hutchings wrote:
 On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has 
 been mentioned in the past.

 We (and by we I mean the Ethernet part and driver) can only change 
 the advertised availability of a larger MaxPayloadSize. The size is 
 negotiated by both sides of the link when the link is established. 
 The driver should not change the size of the link as it would be 
 poking at registers outside of its scope and is controlled by the 
 upstream bridge (not us).
 [...]
 
 MaxPayloadSize (MPS) is not negotiated between devices but is 
 programmed by the system firmware (at least for devices present at 
 boot - the kernel may be responsible in case of hotplug).  You can use 
 the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to 
 set a policy that overrides this, but no policy will allow setting MPS 
 above the device's MaxPayloadSizeSupported (MPSS).
 

Ben,

Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
So I'm trying to use ethtool modify it from eeprom to see if help or no.


Todd, I'll review all MaxPayload for all devices, but need to say if it 
mismatch, customer could not modify it from BIOS for there was not entry at 
there, to test it, we have to find how to verify if this is the root cause, so 
still need to find the offset in eeprom.

Thanks in advance,
Joe

--
Keep yourself connected to Go Parallel: 
INSIGHTS What's next for parallel hardware, programming and related areas?
Interviews and blogs by thought leaders keep you ahead of the curve.
http://goparallel.sourceforge.net
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-28 Thread Ethan Zhao
Joe,
Possibly your customer is running a kernel without source code on
a platform whose vendor wouldn't like to fix BIOS issue( Is that a
HP/Dell server ?).
Anyway, to see if is a payload issue or,  you could change the
payload size with setpci tool to those devices and set the link
retrain bit to trigger the link retraining to debug the issue and
identity the root cause.  I thinks it is much easier than modify the
BIOS or  eeprom of NIC.

e.g.
   set device control register to 0f 00   (128 bytes payload size)
   #   setpci -v -s 00:02.0 98.w=000f
   set device link control register to 60h (retrain the link)
   #  setpci -v -s 00:02.0 a0.b=60

  Hope it works,  Just my 2 cents.

ethan.z...@oracle.com

On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd
todd.fujin...@intel.com wrote:
 The only EEPROM I know about or can speak to is the one attached to the 82571 
 and it doesn't set the MaxPayloadSize. That's done by the BIOS.

 Todd Fujinaka
 Technical Marketing Engineer
 LAN Access Division (LAD)
 Intel Corporation
 todd.fujin...@intel.com
 (503) 712-4565


 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, November 28, 2012 12:31 AM
 To: Ben Hutchings
 Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; 
 e1000-de...@lists.sf.net; linux-ker...@vger.kernel.org; linux-pci
 Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

 On 11/28/12 02:10, Ben Hutchings wrote:
 On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has
 been mentioned in the past.

 We (and by we I mean the Ethernet part and driver) can only change
 the advertised availability of a larger MaxPayloadSize. The size is
 negotiated by both sides of the link when the link is established.
 The driver should not change the size of the link as it would be
 poking at registers outside of its scope and is controlled by the
 upstream bridge (not us).
 [...]

 MaxPayloadSize (MPS) is not negotiated between devices but is
 programmed by the system firmware (at least for devices present at
 boot - the kernel may be responsible in case of hotplug).  You can use
 the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to
 set a policy that overrides this, but no policy will allow setting MPS
 above the device's MaxPayloadSizeSupported (MPSS).


 Ben,

 Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
 So I'm trying to use ethtool modify it from eeprom to see if help or no.


 Todd, I'll review all MaxPayload for all devices, but need to say if it 
 mismatch, customer could not modify it from BIOS for there was not entry at 
 there, to test it, we have to find how to verify if this is the root cause, 
 so still need to find the offset in eeprom.

 Thanks in advance,
 Joe


--
Keep yourself connected to Go Parallel: 
VERIFY Test and improve your parallel project with help from experts 
and peers. http://goparallel.sourceforge.net
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-27 Thread Fujinaka, Todd
Forgive me if I'm being too repetitious as I think some of this has been 
mentioned in the past.

We (and by we I mean the Ethernet part and driver) can only change the 
advertised availability of a larger MaxPayloadSize. The size is negotiated by 
both sides of the link when the link is established. The driver should not 
change the size of the link as it would be poking at registers outside of its 
scope and is controlled by the upstream bridge (not us).

You also need to check all the PCIe links to get to the device. There can be 
several to get from the root complex, through bridges, to the endpoint Ethernet 
controller. The Ethernet part and driver has no control over any other links. 
You'll have to talk to the motherboard manufacturer about those links.

Your original problem appears to be hangs and Tushar asked you to the entire 
path of PCIe connections from the root complex to the endpoint. Any mismatches 
in payload can cause hangs and I believe you have had the problem in the past. 
I'm sure you remember all the lspci commands to list the tree view and to dump 
all the details from each of the links and I would suggest you do that to check 
to see that the payload sizes match. What I do is lspci -tvvv to see what's 
connected, then lspci -s xx:xx.x -vvv to check the devices on the link.

Thanks.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujin...@intel.com
(503) 712-4565


-Original Message-
From: Mary Mcgrath [mailto:mary.mcgr...@oracle.com] 
Sent: Monday, November 26, 2012 6:07 PM
To: Joe Jin
Cc: net...@vger.kernel.org; e1000-de...@lists.sf.net; 
linux-ker...@vger.kernel.org
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Joe
Thank you for working this.
I would love to find out how they expect a customer to make the modification To 
 word  0x1A, and see if the 8th bit is 0 or 1, and to change to 0.

I have in turn asked the ct for the lspci command on eth3, maybe the incorrect 
setting is upstream.

Again,  thank you.
Regards
Mary



-Original Message-
From: Joe Jin
Sent: Monday, November 26, 2012 8:00 PM
To: Fujinaka, Todd
Cc: Dave, Tushar N; net...@vger.kernel.org; e1000-de...@lists.sf.net; 
linux-ker...@vger.kernel.org; Mary Mcgrath
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On 11/27/12 00:23, Fujinaka, Todd wrote:
 If you look at the previous section, DevCap, you'll see that it's 
 correctly advertising 256 bytes but the system is negotiating 128 for 
 the link to the Ethernet controller. Things on the other side of the 
 link are controlled outside of the e1000 driver.
 
 Tushar's first suggestion was to check the PCIe payload settings in 
 the entire chain. Have you done that? Mismatches will cause hangs.

Hi Todd,

So far I had to know how to modify the maxpayload size, since BIOS have not 
entry to change this, so I had to use ethtool, now I need to get the offset of 
MaxPayload size in eeprom, I ever tried to find from Intel online document but 
failed, any idea?

Thanks in advance,
Joe

--
Monitor your physical, virtual and cloud infrastructure from a single web 
console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud 
infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-27 Thread Fujinaka, Todd
Thanks for the clarification. I was just going by the PCIe spec, which says the 
lowest value of both ends is used, and I figured SOMETHING had to be looking at 
that and doing some sort of negotiation. I'm no BIOS guy, so I'm not sure 
what's actually going on, whether something walks the PCIe tree or if the BIOS 
just sets all the values to the minimum.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujin...@intel.com
(503) 712-4565


-Original Message-
From: Ben Hutchings [mailto:bhutchi...@solarflare.com] 
Sent: Tuesday, November 27, 2012 10:11 AM
To: Fujinaka, Todd; Mary Mcgrath
Cc: Joe Jin; net...@vger.kernel.org; e1000-de...@lists.sf.net; 
linux-ker...@vger.kernel.org; linux-pci
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has 
 been mentioned in the past.
 
 We (and by we I mean the Ethernet part and driver) can only change the 
 advertised availability of a larger MaxPayloadSize. The size is 
 negotiated by both sides of the link when the link is established. The 
 driver should not change the size of the link as it would be poking at 
 registers outside of its scope and is controlled by the upstream 
 bridge (not us).
[...]

MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the 
system firmware (at least for devices present at boot - the kernel may be 
responsible in case of hotplug).  You can use the kernel parameter 
'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides 
this, but no policy will allow setting MPS above the device's 
MaxPayloadSizeSupported (MPSS).

(These parameters are not documented in
Documentation/kernel-parameters.txt!  Someone ought to fix that.)

Ben.

--
Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's 
the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-27 Thread Ben Hutchings
On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote:
 Forgive me if I'm being too repetitious as I think some of this has
 been mentioned in the past.
 
 We (and by we I mean the Ethernet part and driver) can only change the
 advertised availability of a larger MaxPayloadSize. The size is
 negotiated by both sides of the link when the link is established. The
 driver should not change the size of the link as it would be poking at
 registers outside of its scope and is controlled by the upstream
 bridge (not us).
[...]

MaxPayloadSize (MPS) is not negotiated between devices but is programmed
by the system firmware (at least for devices present at boot - the
kernel may be responsible in case of hotplug).  You can use the kernel
parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy
that overrides this, but no policy will allow setting MPS above the
device's MaxPayloadSizeSupported (MPSS).

(These parameters are not documented in
Documentation/kernel-parameters.txt!  Someone ought to fix that.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-26 Thread Fujinaka, Todd
On Tue, 20 Nov 2012, Joe Jin wrote:

 On 11/20/12 16:59, Dave, Tushar N wrote:
 Have you power off the system completely after modifying eeprom? If not 
 please do so.

 Hi Tushar,

 Seems not works for me, would you please help to check what is wrong of my 
 operations?

...

 # lspci -s :52:00.1 -vvv
 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
 Controller (rev 06)
 --snip--
   Capabilities: [e0] Express (v1) Endpoint, MSI 00
   DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
 L1 64us

   ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
   DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
 Unsupported+
   RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
   MaxPayload 128 bytes, MaxReadReq 4096 bytes
   ^

 --snip--

If you look at the previous section, DevCap, you'll see that it's
correctly advertising 256 bytes but the system is negotiating 128 for
the link to the Ethernet controller. Things on the other side of the
link are controlled outside of the e1000 driver.

Tushar's first suggestion was to check the PCIe payload settings in the
entire chain. Have you done that? Mismatches will cause hangs.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
todd.fujin...@intel.com
(503) 712-4565


--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-26 Thread Joe Jin
On 11/27/12 00:23, Fujinaka, Todd wrote:
 If you look at the previous section, DevCap, you'll see that it's
 correctly advertising 256 bytes but the system is negotiating 128 for
 the link to the Ethernet controller. Things on the other side of the
 link are controlled outside of the e1000 driver.
 
 Tushar's first suggestion was to check the PCIe payload settings in the
 entire chain. Have you done that? Mismatches will cause hangs.

Hi Todd,

So far I had to know how to modify the maxpayload size, since BIOS have not
entry to change this, so I had to use ethtool, now I need to get the offset
of MaxPayload size in eeprom, I ever tried to find from Intel online document
but failed, any idea?

Thanks in advance,
Joe

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-26 Thread Mary Mcgrath
Joe
Thank you for working this.
I would love to find out how they expect a customer to make the modification
To  word  0x1A, and see if the 8th bit is 0 or 1, and to change to 0.

I have in turn asked the ct for the lspci command on eth3, maybe the incorrect 
setting is upstream.

Again,  thank you.
Regards
Mary



-Original Message-
From: Joe Jin 
Sent: Monday, November 26, 2012 8:00 PM
To: Fujinaka, Todd
Cc: Dave, Tushar N; net...@vger.kernel.org; e1000-de...@lists.sf.net; 
linux-ker...@vger.kernel.org; Mary Mcgrath
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On 11/27/12 00:23, Fujinaka, Todd wrote:
 If you look at the previous section, DevCap, you'll see that it's 
 correctly advertising 256 bytes but the system is negotiating 128 for 
 the link to the Ethernet controller. Things on the other side of the 
 link are controlled outside of the e1000 driver.
 
 Tushar's first suggestion was to check the PCIe payload settings in 
 the entire chain. Have you done that? Mismatches will cause hangs.

Hi Todd,

So far I had to know how to modify the maxpayload size, since BIOS have not 
entry to change this, so I had to use ethtool, now I need to get the offset of 
MaxPayload size in eeprom, I ever tried to find from Intel online document but 
failed, any idea?

Thanks in advance,
Joe

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Sunday, November 18, 2012 9:38 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org; Mary Mcgrath
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/16/12 04:26, Dave, Tushar N wrote:
 Would you please help to fine the offset of max payload size in eeprom?
 I'd like to have a try to modify it by ethtool.

 It is defined using bit 8 of word 0x1A.
 Bit value 0 = 128B , bit value 1 = 256B

Hi Tushar,

I checked one of my server which Max Payload Size is 128:

# lspci -vvv -s 52:00.1
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr+ Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
TAbort- MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 266
Region 0: Memory at dfea (32-bit, non-prefetchable)
[size=128K]
Region 1: Memory at dfe8 (32-bit, non-prefetchable)
[size=128K]
Region 2: I/O ports at 6020 [size=32]
[virtual] Expansion ROM at d812 [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee0  Data: 409a
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
512ns, L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr-
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s,
Latency L0 4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+
RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+
NonFatalErr-
AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap-
ChkEn-
Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed-
86
Kernel driver in use: e1000e
Kernel modules: e1000e

And eeprom dump as below:

Offset  Values
--  --
0x  00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0060  00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff


If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the
bit 8 is 1, but my NIC's MPS is 128b, anything I'm wrong?

Have you power off the system completely after modifying eeprom? If not please 
do so.
-Tushar 

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Joe Jin
On 11/20/12 16:59, Dave, Tushar N wrote:
 Have you power off the system completely after modifying eeprom? If not 
 please do so.

Hi Tushar,

Seems not works for me, would you please help to check what is wrong of my 
operations?

Original eeprom dump:

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
^
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
--snip--
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
--snip--

# ethtool eth3
Settings for eth3:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x0007 (7)
Link detected: yes

# ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7
# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
^ == a6 -- a7
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
--snip--
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
--snip--

#  ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06 
^== 07 - 17
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Joe Jin
On 11/20/12 16:59, Dave, Tushar N wrote:
 Have you power off the system completely after modifying eeprom? If not 
 please do so.

seems not works for me, would you please help to check what is wrong of my 
operations?

Original eeprom dump:

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
^
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
--snip--
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
--snip--

# ethtool eth3
Settings for eth3:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x0007 (7)
Link detected: yes

# ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7
# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
^ == a6 -- a7
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
--snip--
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
--snip--

#  ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06 
^== 07 - 17
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-18 Thread Joe Jin
On 11/16/12 04:26, Dave, Tushar N wrote:
 Would you please help to fine the offset of max payload size in eeprom?
 I'd like to have a try to modify it by ethtool.
 
 It is defined using bit 8 of word 0x1A.
 Bit value 0 = 128B , bit value 1 = 256B

Hi Tushar,

I checked one of my server which Max Payload Size is 128:

# lspci -vvv -s 52:00.1
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 266
Region 0: Memory at dfea (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at dfe8 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 6020 [size=32]
[virtual] Expansion ROM at d812 [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee0  Data: 409a
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ 
RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr-
AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed-86
Kernel driver in use: e1000e
Kernel modules: e1000e

And eeprom dump as below:

Offset  Values
--  --
0x  00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0060  00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff 


If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the bit 8 
is 1, but 
my NIC's MPS is 128b, anything I'm wrong? 

Thanks,
Joe


--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Joe Jin
On 11/09/12 04:35, Dave, Tushar N wrote:
 All devices in path from root complex to 82571, should have *same* max 
 payload size otherwise it can cause hang. 
 Can you double check this?

Hi Tushar,

Checked with hardware vendor and they said no way to modify the max payload 
size 
from BIOS, can I modify it from driver side?

Thanks,
Joe

--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Li Yu
于 2012年11月09日 04:35, Dave, Tushar N 写道:
 -Original Message-
 From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
 On Behalf Of Joe Jin
 Sent: Wednesday, November 07, 2012 10:25 PM
 To: e1000-de...@lists.sf.net
 Cc: net...@vger.kernel.org; linux-ker...@vger.kernel.org; Mary Mcgrath
 Subject: 82571EB: Detected Hardware Unit Hang

 Hi list,

 IHAC reported 82571EB Detected Hardware Unit Hang on HP ProLiant DL360
 G6, and have to reboot the server to recover:

 e1000e :06:00.1: eth3: Detected Hardware Unit Hang:
   TDH  1a
   TDT  1a
   next_to_use  1a
   next_to_clean18
 buffer_info[next_to_clean]:
   time_stamp   10047a74e
   next_to_watch18
   jiffies  10047a88c
   next_to_watch.status 1
 MAC Status 80383
 PHY Status 792d
 PHY 1000BASE-T Status  3800
 PHY Extended Status3000
 PCI Status 10

 With newer kernel 2.0.0.1 the issue still reproducible.

 Device info:
 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
 Controller (Copper) (rev 06)
 06:00.1 0200: 8086:10bc (rev 06)

 I compared lspci output before and after the issue, different as below:
 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
 Controller (Copper) (rev 06)
  Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
 Gigabit Server Adapter
  Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
 Stepping- SERR- FastB2B- DisINTx-
 -Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR- INTx-
 +Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 +TAbort- MAbort- SERR- PERR- INTx+

 Are you sure this is not similar issue as before that you reported.
 i.e.
 On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
 just copy a big file (500M) from another server will hit it at once.

 All devices in path from root complex to 82571, should have *same* max 
 payload size otherwise it can cause hang.
 Can you double check this?


We also found such hang problem on 82599EB (ixgbe driver) in RHEL6.3
kernel, we ever tried to upgrade to latest version (3.8.21 or 3.10.17),
but it still happens.

Is it probably also due to wrong max payload size set in BIOS?

Thanks

Yu

 -Tushar
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/



--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, November 13, 2012 6:48 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org; Mary Mcgrath
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/09/12 04:35, Dave, Tushar N wrote:
 All devices in path from root complex to 82571, should have *same* max
payload size otherwise it can cause hang.
 Can you double check this?

Hi Tushar,

Checked with hardware vendor and they said no way to modify the max
payload size from BIOS, can I modify it from driver side?

If you want to change value for 82571 device you can do it from eeprom but for 
other upstream devices I am not sure. I will check with my team.

-Tushar


--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-08 Thread Dave, Tushar N
-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Wednesday, November 07, 2012 10:25 PM
To: e1000-de...@lists.sf.net
Cc: net...@vger.kernel.org; linux-ker...@vger.kernel.org; Mary Mcgrath
Subject: 82571EB: Detected Hardware Unit Hang

Hi list,

IHAC reported 82571EB Detected Hardware Unit Hang on HP ProLiant DL360
G6, and have to reboot the server to recover:

e1000e :06:00.1: eth3: Detected Hardware Unit Hang:
  TDH  1a
  TDT  1a
  next_to_use  1a
  next_to_clean18
buffer_info[next_to_clean]:
  time_stamp   10047a74e
  next_to_watch18
  jiffies  10047a88c
  next_to_watch.status 1
MAC Status 80383
PHY Status 792d
PHY 1000BASE-T Status  3800
PHY Extended Status3000
PCI Status 10

With newer kernel 2.0.0.1 the issue still reproducible.

Device info:
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
06:00.1 0200: 8086:10bc (rev 06)

I compared lspci output before and after the issue, different as below:
 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
   Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
Gigabit Server Adapter
   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B- DisINTx-
-  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
TAbort- MAbort- SERR- PERR- INTx-
+  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
+TAbort- MAbort- SERR- PERR- INTx+

Are you sure this is not similar issue as before that you reported.
i.e. 
On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when 
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, 
 just copy a big file (500M) from another server will hit it at once.

All devices in path from root complex to 82571, should have *same* max payload 
size otherwise it can cause hang. 
Can you double check this?

-Tushar

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-11-08 Thread Joe Jin
On 11/09/12 04:35, Dave, Tushar N wrote:
 Are you sure this is not similar issue as before that you reported.
 i.e. 

Tushar,

Thanks for your quick response, I'll check with customer if they can modify the 
Max
payload size from BIOS, this time issue hit on HP's server.

Thanks again,
Joe

 On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
  I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when 
  doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, 
  just copy a big file (500M) from another server will hit it at once.
 All devices in path from root complex to 82571, should have *same* max 
 payload size otherwise it can cause hang. 
 Can you double check this?
 


-- 
Oracle http://www.oracle.com
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-08-29 Thread Andrew Peng
This is the output:

~$ sudo ethtool -S eth1 | grep tx_timeout_count
 tx_timeout_count: 0
~$


I will try new driver, but this is a production server. I don't have any
actual problems with the nic, but I do keep seeing the hardware hand
message pop up in the logs. When I can take the server down for routine
maintenance I will get the new driver in and report back.

Thank you all for the help.

--Andrew

On Fri, Aug 24, 2012 at 2:39 PM, Dave, Tushar N tushar.n.d...@intel.comwrote:

  You are right that driver only dump HW ring if adapter resets. However,
 in case of **true** tx hang , driver should tx_timeout that will reset
 the adapter and if msglvl is set correctly it will dump HW ring.

 If you’re not seeing tx_timeout I believe it’s a false tx hang. Check
 with ‘ethtool –S ethx | grep tx_timeout_count’ 

 ** **

 -Tushar

 PS: I would suggest try latest e1000e driver

 *From:* Andrew Peng [mailto:peng...@gmail.com]
 *Sent:* Friday, August 24, 2012 10:29 AM

 *To:* Dave, Tushar N
 *Cc:* e1000-devel@lists.sourceforge.net
 *Subject:* Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

  ** **

 Hi, in regards to the ring dump, this is the response I received from the
 Debian kernel team:

 **
 The ring dump is only shown in case the driver resets the chip, and it
 doesn't do that in the case of Hardware Unit Hang.  So I think whichever
 developer told you this was confused.
 **

 I haven't gotten to using the new driver, but when I do i'll report back.

 --Andrew

 On Thu, Jul 19, 2012 at 9:20 PM, Dave, Tushar N tushar.n.d...@intel.com
 wrote:

 In that case, you can use our e1000e outbox driver from Sourceforge (which
 should have patches mentioned by Flavio).

 -Tushar


 -Original Message-
 From: Flavio Leitner [mailto:f...@redhat.com]
 Sent: Thursday, July 19, 2012 6:39 PM
 To: Andrew Peng
 Cc: Dave, Tushar N; e1000-devel@lists.sourceforge.net
 Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 On Thu, 19 Jul 2012 20:17:14 -0500
 Andrew Peng peng...@gmail.com wrote:
 
  Flavio;
 
  I am using the stock kernel driver with the stock Debian Squeeze kernel.
 
 
 Well, I don't have the debian kernel sources handy to check, but based on
 the version 2.6.32-5-amd64, It sounds like you don't have.
 
 I pointed that patch because your card supports the write-back feature and
 TDT and TDH are close to each other, less than 4, which is a signature of
 the bug fixed by the first patch.
 
 fbl
 
  Tushar;
 
  I've double checked that the message level is set correctly:
  Current message level: 0x2c01 (11265)
  Link detected: yes
 
  However, I just checked all of the logs on the server and I do not see
  a HW ring dump.
 
  Thanks all again for help
 
 
  --Andrew
 
  On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N
 tushar.n.d...@intel.com wrote:
   Andrew,
  
   I don't think current message level set correctly.
   Have you ran 'ethtool -s ethx msglvl 0x2c01'
   I don't see HW ring dump in the log.
   Please confirm that msglvl is set correctly by running 'ethtool ethx'
  
   -Tushar
  
  
  
  -Original Message-
  From: Andrew Peng [mailto:peng...@gmail.com]
  Sent: Thursday, July 19, 2012 4:42 PM
  To: Dave, Tushar N
  Cc: e1000-devel@lists.sourceforge.net
  Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
  
  Attached is the dmesg output. Please let me know if this looks right.
  There are two instances of the error here:
  
  [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit
 Hang:
  [361106.726604]   TDH  c5
  [361106.726606]   TDT  c7
  [361106.726607]   next_to_use  c7
  [361106.726608]   next_to_cleanc5
  [361106.726609] buffer_info[next_to_clean]:
  [361106.726610]   time_stamp   105605cd5
  [361106.726611]   next_to_watchc5
  [361106.726612]   jiffies  105605e51
  [361106.726614]   next_to_watch.status 0
  [361106.726615] MAC Status 80383
  [361106.726616] PHY Status 792d
  [361106.726617] PHY 1000BASE-T Status  3800
  [361106.726618] PHY Extended Status3000
  [361106.726619] PCI Status 10
  
  [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit
 Hang:
  [411932.038651]   TDH  3d
  [411932.038652]   TDT  3f
  [411932.038653]   next_to_use  3f
  [411932.038654]   next_to_clean3d
  [411932.038655] buffer_info[next_to_clean]:
  [411932.038657]   time_stamp   106223f55
  [411932.038658]   next_to_watch3d
  [411932.038659]   jiffies  106224069
  [411932.038660]   next_to_watch.status 0
  [411932.038661] MAC Status 80383
  [411932.038662] PHY Status 792d
  [411932.038663] PHY 1000BASE-T Status  3800
  [411932.038664] PHY Extended Status3000
  [411932.038665] PCI Status 10
  [422584.120473

Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-08-29 Thread Dave, Tushar N
Andrew,

There is no tx_timeout . So as I motioned in previous email this is a false 
hang. If issue persist with latest driver let me know and I look into it.

-Tushar


From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Wednesday, August 29, 2012 11:41 AM
To: Dave, Tushar N
Cc: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

This is the output:

~$ sudo ethtool -S eth1 | grep tx_timeout_count
 tx_timeout_count: 0
~$


I will try new driver, but this is a production server. I don't have any actual 
problems with the nic, but I do keep seeing the hardware hand message pop up in 
the logs. When I can take the server down for routine maintenance I will get 
the new driver in and report back.

Thank you all for the help.

--Andrew

On Fri, Aug 24, 2012 at 2:39 PM, Dave, Tushar N 
tushar.n.d...@intel.commailto:tushar.n.d...@intel.com wrote:
You are right that driver only dump HW ring if adapter resets. However, in case 
of *true* tx hang , driver should tx_timeout that will reset the adapter and if 
msglvl is set correctly it will dump HW ring.
If you’re not seeing tx_timeout I believe it’s a false tx hang. Check with 
‘ethtool –S ethx | grep tx_timeout_count’

-Tushar
PS: I would suggest try latest e1000e driver
From: Andrew Peng [mailto:peng...@gmail.commailto:peng...@gmail.com]
Sent: Friday, August 24, 2012 10:29 AM

To: Dave, Tushar N
Cc: e1000-devel@lists.sourceforge.netmailto:e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Hi, in regards to the ring dump, this is the response I received from the 
Debian kernel team:

**
The ring dump is only shown in case the driver resets the chip, and it
doesn't do that in the case of Hardware Unit Hang.  So I think whichever
developer told you this was confused.
**

I haven't gotten to using the new driver, but when I do i'll report back.

--Andrew
On Thu, Jul 19, 2012 at 9:20 PM, Dave, Tushar N 
tushar.n.d...@intel.commailto:tushar.n.d...@intel.com wrote:
In that case, you can use our e1000e outbox driver from Sourceforge (which 
should have patches mentioned by Flavio).

-Tushar

-Original Message-
From: Flavio Leitner [mailto:f...@redhat.commailto:f...@redhat.com]
Sent: Thursday, July 19, 2012 6:39 PM
To: Andrew Peng
Cc: Dave, Tushar N; 
e1000-devel@lists.sourceforge.netmailto:e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

On Thu, 19 Jul 2012 20:17:14 -0500
Andrew Peng peng...@gmail.commailto:peng...@gmail.com wrote:

 Flavio;

 I am using the stock kernel driver with the stock Debian Squeeze kernel.


Well, I don't have the debian kernel sources handy to check, but based on
the version 2.6.32-5-amd64, It sounds like you don't have.

I pointed that patch because your card supports the write-back feature and
TDT and TDH are close to each other, less than 4, which is a signature of
the bug fixed by the first patch.

fbl

 Tushar;

 I've double checked that the message level is set correctly:
 Current message level: 0x2c01 (11265)
 Link detected: yes

 However, I just checked all of the logs on the server and I do not see
 a HW ring dump.

 Thanks all again for help


 --Andrew

 On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N
tushar.n.d...@intel.commailto:tushar.n.d...@intel.com wrote:
  Andrew,
 
  I don't think current message level set correctly.
  Have you ran 'ethtool -s ethx msglvl 0x2c01'
  I don't see HW ring dump in the log.
  Please confirm that msglvl is set correctly by running 'ethtool ethx'
 
  -Tushar
 
 
 
 -Original Message-
 From: Andrew Peng [mailto:peng...@gmail.commailto:peng...@gmail.com]
 Sent: Thursday, July 19, 2012 4:42 PM
 To: Dave, Tushar N
 Cc: 
 e1000-devel@lists.sourceforge.netmailto:e1000-devel@lists.sourceforge.net
 Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 Attached is the dmesg output. Please let me know if this looks right.
 There are two instances of the error here:
 
 [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit
Hang:
 [361106.726604]   TDH  c5
 [361106.726606]   TDT  c7
 [361106.726607]   next_to_use  c7
 [361106.726608]   next_to_cleanc5
 [361106.726609] buffer_info[next_to_clean]:
 [361106.726610]   time_stamp   105605cd5
 [361106.726611]   next_to_watchc5
 [361106.726612]   jiffies  105605e51
 [361106.726614]   next_to_watch.status 0
 [361106.726615] MAC Status 80383
 [361106.726616] PHY Status 792d
 [361106.726617] PHY 1000BASE-T Status  3800
 [361106.726618] PHY Extended Status3000
 [361106.726619] PCI Status 10
 
 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit
Hang:
 [411932.038651]   TDH  3d
 [411932.038652]   TDT  3f
 [411932.038653]   next_to_use  3f

Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-08-28 Thread Nikolay Popov
29.08.2012 6:29, Dave, Tushar N пишет:
 Thanks for the info.
 For both, 82571 and 80003ES2LAN, I see UnsuppReq+ and  UncorrErr+ in lspci
 (DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend+)

 Have you tried disabling tso (ethtool -K tso off)?
Yes, this doesn't help

 Was this working okay before with old driver or old kernel?

At least at 3.3.6 I don't see this warning messages in syslog

Regards, Nikolay



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-08-28 Thread Dave, Tushar N
-Original Message-
From: Nikolay Popov [mailto:niko...@popoff.net.ua]
Sent: Tuesday, August 28, 2012 9:00 PM
To: Dave, Tushar N
Cc: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

29.08.2012 6:29, Dave, Tushar N wrote:
 Have you tried disabling tso (ethtool -K tso off)?
I also tried recompiling driver with DISABLE_PM, disabling gro and other
offload types, boot kernel with acpi_aspm=off, increase ring buffers to
4096, playing around flow control - nothing helped.

Okay thanks for info.
I will check changes went into e1000e driver since 3.3.6 then.
Also, would you please run 'ethtool -s ethx msglvl 0x2c01' so that next time 
when tx hang occurs it will log hw desc ring info. Send me the full dmesg log 
once issue occur. 

-Tushar
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-08-28 Thread Nikolay Popov
Hi, Dave!

Ok, I have set msglevel as you requested, let's wait for some logs
Also, about versions - we using 1.11.3-NAPI on both 3.3.6 and 3.5.2 hosts.
We was enforced to do that because with default kernel driver (at least 2.0.0 
at 3.5.2) we see some misterious drops and delays (~1-2%, and delays up to 
2000ms) that appears once per few minutes. Downgrading driver to 1.11.3-NAPI 
solves this issue (that we'll discuss in separate topic I suppose) but with 
this driver version we're running into TX hang trouble we're trying to find now.
I can't test if this problem appears in 2.x.x driver versions because hosts are 
in production and such kind of delays/losses aren't acceptable at all.

Regards, Nikolay


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-08-27 Thread Dave, Tushar N
-Original Message-
From: Nikolay Popov [mailto:niko...@popoff.net.ua]
Sent: Saturday, August 25, 2012 1:29 AM
To: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Hi, All

It seems that I'm getting same problems with 3.5.2 kernel - 80003ES2LAN
onboard NIC is going to reset from time to time under load

Aug 25 10:27:53 bras2 kernel: [134612.808590] e1000e :05:00.0: eth2:
Detected Hardware Unit Hang:
Aug 25 10:27:53 bras2 kernel: [134612.808590]   TDH cd
Aug 25 10:27:53 bras2 kernel: [134612.808590]   TDT b9
Aug 25 10:27:53 bras2 kernel: [134612.808590]   next_to_use b9
Aug 25 10:27:53 bras2 kernel: [134612.808590]   next_to_clean cc
Aug 25 10:27:53 bras2 kernel: [134612.808590] buffer_info[next_to_clean]:
Aug 25 10:27:53 bras2 kernel: [134612.808590]   time_stamp 1020057ff
Aug 25 10:27:53 bras2 kernel: [134612.808590]   next_to_watch cf
Aug 25 10:27:53 bras2 kernel: [134612.808590]   jiffies 102005cda
Aug 25 10:27:53 bras2 kernel: [134612.808590]   next_to_watch.status 0
Aug 25 10:27:53 bras2 kernel: [134612.808590] MAC Status 2080783 Aug 25
10:27:53 bras2 kernel: [134612.808590] PHY Status 792d Aug 25 10:27:53
bras2 kernel: [134612.808590] PHY 1000BASE-T Status 7800 Aug 25 10:27:53
bras2 kernel: [134612.808590] PHY Extended Status 3000 Aug 25 10:27:53
bras2 kernel: [134612.808590] PCI Status 10 Aug 25 10:27:55 bras2
kernel: [134614.816086] e1000e :05:00.0: eth2:
Reset adapter
Aug 25 10:27:58 bras2 kernel: [134617.654599] e1000e: eth2 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: Rx

Please send full dmesg log and 'ethtool -S ethx' output after issue occurs.

-Tushar



root@bras2:~# ethtool -i eth2
driver: e1000e
version: 1.11.3-NAPI
firmware-version: 1.0-0
bus-info: :05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

root@bras2:~# lspci | grep 05:00.0
05:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
Ethernet Controller (Copper) (rev 01)

Mainboard: Intel S5000PAL

I used to fall back to 1.11.3-NAPI driver version because with kernel
2.0.0 (and also with 2.0.0.1 from sf.net) there were a lot of random
packet drops and latency spikes, so 1.11.3 is more acceptable to
production.
While reset traffic stop going, iowait increase up to 100% and then link
flaps and all became normal until next reset that could happen in 1
hour, or in 1 day. Also I noticed, that resets aren't correlate with
traffic load. It could happen ever when NIC is almost idle, transferring
~30-40 mbps.

Is there anything we can do to fix this issue?

Regards, Nikolay


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-08-25 Thread Nikolay Popov
Hi, All

It seems that I'm getting same problems with 3.5.2 kernel - 80003ES2LAN 
onboard NIC is going to reset from time to time under load

Aug 25 10:27:53 bras2 kernel: [134612.808590] e1000e :05:00.0: eth2: 
Detected Hardware Unit Hang:
Aug 25 10:27:53 bras2 kernel: [134612.808590]   TDH cd
Aug 25 10:27:53 bras2 kernel: [134612.808590]   TDT b9
Aug 25 10:27:53 bras2 kernel: [134612.808590]   next_to_use b9
Aug 25 10:27:53 bras2 kernel: [134612.808590]   next_to_clean cc
Aug 25 10:27:53 bras2 kernel: [134612.808590] buffer_info[next_to_clean]:
Aug 25 10:27:53 bras2 kernel: [134612.808590]   time_stamp 1020057ff
Aug 25 10:27:53 bras2 kernel: [134612.808590]   next_to_watch cf
Aug 25 10:27:53 bras2 kernel: [134612.808590]   jiffies 102005cda
Aug 25 10:27:53 bras2 kernel: [134612.808590]   next_to_watch.status 0
Aug 25 10:27:53 bras2 kernel: [134612.808590] MAC Status 2080783
Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY Status 792d
Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY 1000BASE-T Status 7800
Aug 25 10:27:53 bras2 kernel: [134612.808590] PHY Extended Status 3000
Aug 25 10:27:53 bras2 kernel: [134612.808590] PCI Status 10
Aug 25 10:27:55 bras2 kernel: [134614.816086] e1000e :05:00.0: eth2: 
Reset adapter
Aug 25 10:27:58 bras2 kernel: [134617.654599] e1000e: eth2 NIC Link is 
Up 1000 Mbps Full Duplex, Flow Control: Rx


root@bras2:~# ethtool -i eth2
driver: e1000e
version: 1.11.3-NAPI
firmware-version: 1.0-0
bus-info: :05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

root@bras2:~# lspci | grep 05:00.0
05:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)

Mainboard: Intel S5000PAL

I used to fall back to 1.11.3-NAPI driver version because with kernel 
2.0.0 (and also with 2.0.0.1 from sf.net) there were a lot of random 
packet drops and latency spikes, so 1.11.3 is more acceptable to 
production.
While reset traffic stop going, iowait increase up to 100% and then link 
flaps and all became normal until next reset that could happen in 1 
hour, or in 1 day. Also I noticed, that resets aren't correlate with 
traffic load. It could happen ever when NIC is almost idle, transferring 
~30-40 mbps.

Is there anything we can do to fix this issue?

Regards, Nikolay



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-08-24 Thread Andrew Peng
Hi, in regards to the ring dump, this is the response I received from the
Debian kernel team:

**
The ring dump is only shown in case the driver resets the chip, and it
doesn't do that in the case of Hardware Unit Hang.  So I think whichever
developer told you this was confused.
**

I haven't gotten to using the new driver, but when I do i'll report back.

--Andrew

On Thu, Jul 19, 2012 at 9:20 PM, Dave, Tushar N tushar.n.d...@intel.comwrote:

 In that case, you can use our e1000e outbox driver from Sourceforge (which
 should have patches mentioned by Flavio).

 -Tushar

 -Original Message-
 From: Flavio Leitner [mailto:f...@redhat.com]
 Sent: Thursday, July 19, 2012 6:39 PM
 To: Andrew Peng
 Cc: Dave, Tushar N; e1000-devel@lists.sourceforge.net
 Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 On Thu, 19 Jul 2012 20:17:14 -0500
 Andrew Peng peng...@gmail.com wrote:
 
  Flavio;
 
  I am using the stock kernel driver with the stock Debian Squeeze kernel.
 
 
 Well, I don't have the debian kernel sources handy to check, but based on
 the version 2.6.32-5-amd64, It sounds like you don't have.
 
 I pointed that patch because your card supports the write-back feature and
 TDT and TDH are close to each other, less than 4, which is a signature of
 the bug fixed by the first patch.
 
 fbl
 
  Tushar;
 
  I've double checked that the message level is set correctly:
  Current message level: 0x2c01 (11265)
  Link detected: yes
 
  However, I just checked all of the logs on the server and I do not see
  a HW ring dump.
 
  Thanks all again for help
 
 
  --Andrew
 
  On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N
 tushar.n.d...@intel.com wrote:
   Andrew,
  
   I don't think current message level set correctly.
   Have you ran 'ethtool -s ethx msglvl 0x2c01'
   I don't see HW ring dump in the log.
   Please confirm that msglvl is set correctly by running 'ethtool ethx'
  
   -Tushar
  
  
  
  -Original Message-
  From: Andrew Peng [mailto:peng...@gmail.com]
  Sent: Thursday, July 19, 2012 4:42 PM
  To: Dave, Tushar N
  Cc: e1000-devel@lists.sourceforge.net
  Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
  
  Attached is the dmesg output. Please let me know if this looks right.
  There are two instances of the error here:
  
  [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit
 Hang:
  [361106.726604]   TDH  c5
  [361106.726606]   TDT  c7
  [361106.726607]   next_to_use  c7
  [361106.726608]   next_to_cleanc5
  [361106.726609] buffer_info[next_to_clean]:
  [361106.726610]   time_stamp   105605cd5
  [361106.726611]   next_to_watchc5
  [361106.726612]   jiffies  105605e51
  [361106.726614]   next_to_watch.status 0
  [361106.726615] MAC Status 80383
  [361106.726616] PHY Status 792d
  [361106.726617] PHY 1000BASE-T Status  3800
  [361106.726618] PHY Extended Status3000
  [361106.726619] PCI Status 10
  
  [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit
 Hang:
  [411932.038651]   TDH  3d
  [411932.038652]   TDT  3f
  [411932.038653]   next_to_use  3f
  [411932.038654]   next_to_clean3d
  [411932.038655] buffer_info[next_to_clean]:
  [411932.038657]   time_stamp   106223f55
  [411932.038658]   next_to_watch3d
  [411932.038659]   jiffies  106224069
  [411932.038660]   next_to_watch.status 0
  [411932.038661] MAC Status 80383
  [411932.038662] PHY Status 792d
  [411932.038663] PHY 1000BASE-T Status  3800
  [411932.038664] PHY Extended Status3000
  [411932.038665] PCI Status 10
  [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit
 Hang:
  [422584.120475]   TDH  15
  [422584.120477]   TDT  16
  [422584.120478]   next_to_use  16
  [422584.120479]   next_to_clean15
  [422584.120480] buffer_info[next_to_clean]:
  [422584.120481]   time_stamp   1064ae19c
  [422584.120483]   next_to_watch15
  [422584.120484]   jiffies  1064ae2d6
  [422584.120485]   next_to_watch.status 0
  [422584.120486] MAC Status 80383
  [422584.120487] PHY Status 792d
  [422584.120488] PHY 1000BASE-T Status  3800
  [422584.120489] PHY Extended Status3000
  [422584.120491] PCI Status 10
  
  Thank you again for all the help
  
  
  --Andrew
  
  
  
  On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N
  tushar.n.d...@intel.com
  wrote:
   We can find the reason now.
   Please enable TSO back.
   Then run ethtool -s ethx msglvl 0x2c01. This will enable debug
   code
  that logs HW ring data (into dmesg log) when Tx hang occurs. When
  issue occur next time please send me the full dmesg log.
  
   -Tushar
  
  -Original Message-
  From: Andrew Peng

Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-07-19 Thread Andrew Peng
Attached is the dmesg output. Please let me know if this looks right.
There are two instances of the error here:

[361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[361106.726604]   TDH  c5
[361106.726606]   TDT  c7
[361106.726607]   next_to_use  c7
[361106.726608]   next_to_cleanc5
[361106.726609] buffer_info[next_to_clean]:
[361106.726610]   time_stamp   105605cd5
[361106.726611]   next_to_watchc5
[361106.726612]   jiffies  105605e51
[361106.726614]   next_to_watch.status 0
[361106.726615] MAC Status 80383
[361106.726616] PHY Status 792d
[361106.726617] PHY 1000BASE-T Status  3800
[361106.726618] PHY Extended Status3000
[361106.726619] PCI Status 10

[411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[411932.038651]   TDH  3d
[411932.038652]   TDT  3f
[411932.038653]   next_to_use  3f
[411932.038654]   next_to_clean3d
[411932.038655] buffer_info[next_to_clean]:
[411932.038657]   time_stamp   106223f55
[411932.038658]   next_to_watch3d
[411932.038659]   jiffies  106224069
[411932.038660]   next_to_watch.status 0
[411932.038661] MAC Status 80383
[411932.038662] PHY Status 792d
[411932.038663] PHY 1000BASE-T Status  3800
[411932.038664] PHY Extended Status3000
[411932.038665] PCI Status 10
[422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[422584.120475]   TDH  15
[422584.120477]   TDT  16
[422584.120478]   next_to_use  16
[422584.120479]   next_to_clean15
[422584.120480] buffer_info[next_to_clean]:
[422584.120481]   time_stamp   1064ae19c
[422584.120483]   next_to_watch15
[422584.120484]   jiffies  1064ae2d6
[422584.120485]   next_to_watch.status 0
[422584.120486] MAC Status 80383
[422584.120487] PHY Status 792d
[422584.120488] PHY 1000BASE-T Status  3800
[422584.120489] PHY Extended Status3000
[422584.120491] PCI Status 10

Thank you again for all the help


--Andrew



On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N
tushar.n.d...@intel.com wrote:
 We can find the reason now.
 Please enable TSO back.
 Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that 
 logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur next 
 time please send me the full dmesg log.

 -Tushar

-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Wednesday, July 18, 2012 6:24 AM
To: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Thus far disabling TSO via ethtool has seemed to work - can anyone explain
the technical reason why this appears to have fixed the issue?

--Andrew

On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote:
 Sorry folks, but I just realized that I hadn't been replying to the
 list properly and instead I was mistakenly  emailing Dave directly.

 I'm consolidating and re-sending the information to the list.

 BIOS on the HP N40L does not specify any options for AER or PCIe error
 management, or packet size (referenced in another thread)

 I have also tried to disable PCIe power management to no success.

 I did see one options in the BIOS relating to ACPI functionality, and
 referencing a document that Dave sent me saying the AER kernel driver
 may not be loaded if certain ACPI modules are loaded, I will disable
 this and check for errors. I don't have convenient physical access to
 the server so this will take a few days.

 I am attaching the dmesg and lspci -vvv (as root) output to this
message.

 Thanks for all the help folks.

 --Andrew

 On Wed, Jul 11, 2012 at 8:37 PM, Dave, Tushar N
tushar.n.d...@intel.com wrote:
-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Wednesday, July 11, 2012 8:50 AM
To: e1000-devel@lists.sourceforge.net
Subject: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Folks, I've been getting some strange error messages in my home
server / router that I've been having trouble debugging. I'm decently
proficient in Linux, but I fear I'm in over my head with this one.

The hardware is a HP N40L Microserver - here are the hardware details
- http://n40l.wikia.com/wiki/Base_Hardware

I am running Debian Squeeze 6.0:
pengc99@gaia:/$ sudo uname -a
Linux gaia 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64
GNU/Linux

I also subscribe to Ksplice's Uptrack system but since I have the
newest kernel installed (as released by Debian) there have been no
hot-patches yet.

This is the message I've been getting in /var/log/kern.log:
Jul 11 08:55:38 gaia kernel: [402056.009687] e1000e :02:00.0:
eth1: Detected Hardware Unit Hang:
Jul 11 08:55:38 gaia kernel: [402056.009690]   TDH
fc
Jul 11 08:55:38 gaia kernel: [402056.009692

Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-07-19 Thread Flavio Leitner

Those messages reminds me this bug:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=09357b00255c233705b1cf6d76a8d147340545b8

and then:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bf03085f85112eac2d19036ea3003071220285bb

Can you check if you have those patches applied?

fbl


On Thu, 19 Jul 2012 18:42:16 -0500
Andrew Peng peng...@gmail.com wrote:

 Attached is the dmesg output. Please let me know if this looks right.
 There are two instances of the error here:
 
 [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
 [361106.726604]   TDH  c5
 [361106.726606]   TDT  c7
 [361106.726607]   next_to_use  c7
 [361106.726608]   next_to_cleanc5
 [361106.726609] buffer_info[next_to_clean]:
 [361106.726610]   time_stamp   105605cd5
 [361106.726611]   next_to_watchc5
 [361106.726612]   jiffies  105605e51
 [361106.726614]   next_to_watch.status 0
 [361106.726615] MAC Status 80383
 [361106.726616] PHY Status 792d
 [361106.726617] PHY 1000BASE-T Status  3800
 [361106.726618] PHY Extended Status3000
 [361106.726619] PCI Status 10
 
 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
 [411932.038651]   TDH  3d
 [411932.038652]   TDT  3f
 [411932.038653]   next_to_use  3f
 [411932.038654]   next_to_clean3d
 [411932.038655] buffer_info[next_to_clean]:
 [411932.038657]   time_stamp   106223f55
 [411932.038658]   next_to_watch3d
 [411932.038659]   jiffies  106224069
 [411932.038660]   next_to_watch.status 0
 [411932.038661] MAC Status 80383
 [411932.038662] PHY Status 792d
 [411932.038663] PHY 1000BASE-T Status  3800
 [411932.038664] PHY Extended Status3000
 [411932.038665] PCI Status 10
 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
 [422584.120475]   TDH  15
 [422584.120477]   TDT  16
 [422584.120478]   next_to_use  16
 [422584.120479]   next_to_clean15
 [422584.120480] buffer_info[next_to_clean]:
 [422584.120481]   time_stamp   1064ae19c
 [422584.120483]   next_to_watch15
 [422584.120484]   jiffies  1064ae2d6
 [422584.120485]   next_to_watch.status 0
 [422584.120486] MAC Status 80383
 [422584.120487] PHY Status 792d
 [422584.120488] PHY 1000BASE-T Status  3800
 [422584.120489] PHY Extended Status3000
 [422584.120491] PCI Status 10
 
 Thank you again for all the help
 
 
 --Andrew
 
 
 
 On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N
 tushar.n.d...@intel.com wrote:
  We can find the reason now.
  Please enable TSO back.
  Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code that 
  logs HW ring data (into dmesg log) when Tx hang occurs. When issue occur 
  next time please send me the full dmesg log.
 
  -Tushar
 
 -Original Message-
 From: Andrew Peng [mailto:peng...@gmail.com]
 Sent: Wednesday, July 18, 2012 6:24 AM
 To: e1000-devel@lists.sourceforge.net
 Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 Thus far disabling TSO via ethtool has seemed to work - can anyone explain
 the technical reason why this appears to have fixed the issue?
 
 --Andrew
 
 On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote:
  Sorry folks, but I just realized that I hadn't been replying to the
  list properly and instead I was mistakenly  emailing Dave directly.
 
  I'm consolidating and re-sending the information to the list.
 
  BIOS on the HP N40L does not specify any options for AER or PCIe error
  management, or packet size (referenced in another thread)
 
  I have also tried to disable PCIe power management to no success.
 
  I did see one options in the BIOS relating to ACPI functionality, and
  referencing a document that Dave sent me saying the AER kernel driver
  may not be loaded if certain ACPI modules are loaded, I will disable
  this and check for errors. I don't have convenient physical access to
  the server so this will take a few days.
 
  I am attaching the dmesg and lspci -vvv (as root) output to this
 message.
 
  Thanks for all the help folks.
 
  --Andrew
 
  On Wed, Jul 11, 2012 at 8:37 PM, Dave, Tushar N
 tushar.n.d...@intel.com wrote:
 -Original Message-
 From: Andrew Peng [mailto:peng...@gmail.com]
 Sent: Wednesday, July 11, 2012 8:50 AM
 To: e1000-devel@lists.sourceforge.net
 Subject: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 Folks, I've been getting some strange error messages in my home
 server / router that I've been having trouble debugging. I'm decently
 proficient in Linux, but I fear I'm in over my head with this one.
 
 The hardware is a HP N40L Microserver - here are the hardware details
 - http://n40l.wikia.com/wiki/Base_Hardware
 
 I am running

Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-07-19 Thread Dave, Tushar N
Andrew,

I don't think current message level set correctly.
Have you ran 'ethtool -s ethx msglvl 0x2c01'
I don't see HW ring dump in the log. 
Please confirm that msglvl is set correctly by running 'ethtool ethx'

-Tushar



-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Thursday, July 19, 2012 4:42 PM
To: Dave, Tushar N
Cc: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Attached is the dmesg output. Please let me know if this looks right.
There are two instances of the error here:

[361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[361106.726604]   TDH  c5
[361106.726606]   TDT  c7
[361106.726607]   next_to_use  c7
[361106.726608]   next_to_cleanc5
[361106.726609] buffer_info[next_to_clean]:
[361106.726610]   time_stamp   105605cd5
[361106.726611]   next_to_watchc5
[361106.726612]   jiffies  105605e51
[361106.726614]   next_to_watch.status 0
[361106.726615] MAC Status 80383
[361106.726616] PHY Status 792d
[361106.726617] PHY 1000BASE-T Status  3800
[361106.726618] PHY Extended Status3000
[361106.726619] PCI Status 10

[411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[411932.038651]   TDH  3d
[411932.038652]   TDT  3f
[411932.038653]   next_to_use  3f
[411932.038654]   next_to_clean3d
[411932.038655] buffer_info[next_to_clean]:
[411932.038657]   time_stamp   106223f55
[411932.038658]   next_to_watch3d
[411932.038659]   jiffies  106224069
[411932.038660]   next_to_watch.status 0
[411932.038661] MAC Status 80383
[411932.038662] PHY Status 792d
[411932.038663] PHY 1000BASE-T Status  3800
[411932.038664] PHY Extended Status3000
[411932.038665] PCI Status 10
[422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[422584.120475]   TDH  15
[422584.120477]   TDT  16
[422584.120478]   next_to_use  16
[422584.120479]   next_to_clean15
[422584.120480] buffer_info[next_to_clean]:
[422584.120481]   time_stamp   1064ae19c
[422584.120483]   next_to_watch15
[422584.120484]   jiffies  1064ae2d6
[422584.120485]   next_to_watch.status 0
[422584.120486] MAC Status 80383
[422584.120487] PHY Status 792d
[422584.120488] PHY 1000BASE-T Status  3800
[422584.120489] PHY Extended Status3000
[422584.120491] PCI Status 10

Thank you again for all the help


--Andrew



On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com
wrote:
 We can find the reason now.
 Please enable TSO back.
 Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code
that logs HW ring data (into dmesg log) when Tx hang occurs. When issue
occur next time please send me the full dmesg log.

 -Tushar

-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Wednesday, July 18, 2012 6:24 AM
To: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Thus far disabling TSO via ethtool has seemed to work - can anyone
explain the technical reason why this appears to have fixed the issue?

--Andrew

On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote:
 Sorry folks, but I just realized that I hadn't been replying to the
 list properly and instead I was mistakenly  emailing Dave directly.

 I'm consolidating and re-sending the information to the list.

 BIOS on the HP N40L does not specify any options for AER or PCIe
 error management, or packet size (referenced in another thread)

 I have also tried to disable PCIe power management to no success.

 I did see one options in the BIOS relating to ACPI functionality,
 and referencing a document that Dave sent me saying the AER kernel
 driver may not be loaded if certain ACPI modules are loaded, I will
 disable this and check for errors. I don't have convenient physical
 access to the server so this will take a few days.

 I am attaching the dmesg and lspci -vvv (as root) output to this
message.

 Thanks for all the help folks.

 --Andrew

 On Wed, Jul 11, 2012 at 8:37 PM, Dave, Tushar N
tushar.n.d...@intel.com wrote:
-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Wednesday, July 11, 2012 8:50 AM
To: e1000-devel@lists.sourceforge.net
Subject: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Folks, I've been getting some strange error messages in my home
server / router that I've been having trouble debugging. I'm
decently proficient in Linux, but I fear I'm in over my head with
this one.

The hardware is a HP N40L Microserver - here are the hardware
details
- http://n40l.wikia.com/wiki/Base_Hardware

I am running Debian Squeeze 6.0:
pengc99@gaia:/$ sudo uname -a
Linux gaia 2.6.32-5-amd64 #1 SMP

Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-07-19 Thread Andrew Peng
Flavio;

I am using the stock kernel driver with the stock Debian Squeeze kernel.

Tushar;

I've double checked that the message level is set correctly:
Current message level: 0x2c01 (11265)
Link detected: yes

However, I just checked all of the logs on the server and I do not see
a HW ring dump.

Thanks all again for help


--Andrew

On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N tushar.n.d...@intel.com wrote:
 Andrew,

 I don't think current message level set correctly.
 Have you ran 'ethtool -s ethx msglvl 0x2c01'
 I don't see HW ring dump in the log.
 Please confirm that msglvl is set correctly by running 'ethtool ethx'

 -Tushar



-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Thursday, July 19, 2012 4:42 PM
To: Dave, Tushar N
Cc: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Attached is the dmesg output. Please let me know if this looks right.
There are two instances of the error here:

[361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[361106.726604]   TDH  c5
[361106.726606]   TDT  c7
[361106.726607]   next_to_use  c7
[361106.726608]   next_to_cleanc5
[361106.726609] buffer_info[next_to_clean]:
[361106.726610]   time_stamp   105605cd5
[361106.726611]   next_to_watchc5
[361106.726612]   jiffies  105605e51
[361106.726614]   next_to_watch.status 0
[361106.726615] MAC Status 80383
[361106.726616] PHY Status 792d
[361106.726617] PHY 1000BASE-T Status  3800
[361106.726618] PHY Extended Status3000
[361106.726619] PCI Status 10

[411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[411932.038651]   TDH  3d
[411932.038652]   TDT  3f
[411932.038653]   next_to_use  3f
[411932.038654]   next_to_clean3d
[411932.038655] buffer_info[next_to_clean]:
[411932.038657]   time_stamp   106223f55
[411932.038658]   next_to_watch3d
[411932.038659]   jiffies  106224069
[411932.038660]   next_to_watch.status 0
[411932.038661] MAC Status 80383
[411932.038662] PHY Status 792d
[411932.038663] PHY 1000BASE-T Status  3800
[411932.038664] PHY Extended Status3000
[411932.038665] PCI Status 10
[422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
[422584.120475]   TDH  15
[422584.120477]   TDT  16
[422584.120478]   next_to_use  16
[422584.120479]   next_to_clean15
[422584.120480] buffer_info[next_to_clean]:
[422584.120481]   time_stamp   1064ae19c
[422584.120483]   next_to_watch15
[422584.120484]   jiffies  1064ae2d6
[422584.120485]   next_to_watch.status 0
[422584.120486] MAC Status 80383
[422584.120487] PHY Status 792d
[422584.120488] PHY 1000BASE-T Status  3800
[422584.120489] PHY Extended Status3000
[422584.120491] PCI Status 10

Thank you again for all the help


--Andrew



On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com
wrote:
 We can find the reason now.
 Please enable TSO back.
 Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code
that logs HW ring data (into dmesg log) when Tx hang occurs. When issue
occur next time please send me the full dmesg log.

 -Tushar

-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Wednesday, July 18, 2012 6:24 AM
To: e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Thus far disabling TSO via ethtool has seemed to work - can anyone
explain the technical reason why this appears to have fixed the issue?

--Andrew

On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote:
 Sorry folks, but I just realized that I hadn't been replying to the
 list properly and instead I was mistakenly  emailing Dave directly.

 I'm consolidating and re-sending the information to the list.

 BIOS on the HP N40L does not specify any options for AER or PCIe
 error management, or packet size (referenced in another thread)

 I have also tried to disable PCIe power management to no success.

 I did see one options in the BIOS relating to ACPI functionality,
 and referencing a document that Dave sent me saying the AER kernel
 driver may not be loaded if certain ACPI modules are loaded, I will
 disable this and check for errors. I don't have convenient physical
 access to the server so this will take a few days.

 I am attaching the dmesg and lspci -vvv (as root) output to this
message.

 Thanks for all the help folks.

 --Andrew

 On Wed, Jul 11, 2012 at 8:37 PM, Dave, Tushar N
tushar.n.d...@intel.com wrote:
-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Wednesday, July 11, 2012 8:50 AM
To: e1000-devel@lists.sourceforge.net
Subject: [E1000-devel] 82571EB

Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-07-19 Thread Flavio Leitner
On Thu, 19 Jul 2012 20:17:14 -0500
Andrew Peng peng...@gmail.com wrote:

 Flavio;
 
 I am using the stock kernel driver with the stock Debian Squeeze kernel.


Well, I don't have the debian kernel sources handy to check, but
based on the version 2.6.32-5-amd64, It sounds like you don't have.

I pointed that patch because your card supports the write-back
feature and TDT and TDH are close to each other, less than 4,
which is a signature of the bug fixed by the first patch.

fbl

 Tushar;
 
 I've double checked that the message level is set correctly:
 Current message level: 0x2c01 (11265)
 Link detected: yes
 
 However, I just checked all of the logs on the server and I do not see
 a HW ring dump.
 
 Thanks all again for help
 
 
 --Andrew
 
 On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N tushar.n.d...@intel.com 
 wrote:
  Andrew,
 
  I don't think current message level set correctly.
  Have you ran 'ethtool -s ethx msglvl 0x2c01'
  I don't see HW ring dump in the log.
  Please confirm that msglvl is set correctly by running 'ethtool ethx'
 
  -Tushar
 
 
 
 -Original Message-
 From: Andrew Peng [mailto:peng...@gmail.com]
 Sent: Thursday, July 19, 2012 4:42 PM
 To: Dave, Tushar N
 Cc: e1000-devel@lists.sourceforge.net
 Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 Attached is the dmesg output. Please let me know if this looks right.
 There are two instances of the error here:
 
 [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
 [361106.726604]   TDH  c5
 [361106.726606]   TDT  c7
 [361106.726607]   next_to_use  c7
 [361106.726608]   next_to_cleanc5
 [361106.726609] buffer_info[next_to_clean]:
 [361106.726610]   time_stamp   105605cd5
 [361106.726611]   next_to_watchc5
 [361106.726612]   jiffies  105605e51
 [361106.726614]   next_to_watch.status 0
 [361106.726615] MAC Status 80383
 [361106.726616] PHY Status 792d
 [361106.726617] PHY 1000BASE-T Status  3800
 [361106.726618] PHY Extended Status3000
 [361106.726619] PCI Status 10
 
 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
 [411932.038651]   TDH  3d
 [411932.038652]   TDT  3f
 [411932.038653]   next_to_use  3f
 [411932.038654]   next_to_clean3d
 [411932.038655] buffer_info[next_to_clean]:
 [411932.038657]   time_stamp   106223f55
 [411932.038658]   next_to_watch3d
 [411932.038659]   jiffies  106224069
 [411932.038660]   next_to_watch.status 0
 [411932.038661] MAC Status 80383
 [411932.038662] PHY Status 792d
 [411932.038663] PHY 1000BASE-T Status  3800
 [411932.038664] PHY Extended Status3000
 [411932.038665] PCI Status 10
 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit Hang:
 [422584.120475]   TDH  15
 [422584.120477]   TDT  16
 [422584.120478]   next_to_use  16
 [422584.120479]   next_to_clean15
 [422584.120480] buffer_info[next_to_clean]:
 [422584.120481]   time_stamp   1064ae19c
 [422584.120483]   next_to_watch15
 [422584.120484]   jiffies  1064ae2d6
 [422584.120485]   next_to_watch.status 0
 [422584.120486] MAC Status 80383
 [422584.120487] PHY Status 792d
 [422584.120488] PHY 1000BASE-T Status  3800
 [422584.120489] PHY Extended Status3000
 [422584.120491] PCI Status 10
 
 Thank you again for all the help
 
 
 --Andrew
 
 
 
 On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N tushar.n.d...@intel.com
 wrote:
  We can find the reason now.
  Please enable TSO back.
  Then run ethtool -s ethx msglvl 0x2c01. This will enable debug code
 that logs HW ring data (into dmesg log) when Tx hang occurs. When issue
 occur next time please send me the full dmesg log.
 
  -Tushar
 
 -Original Message-
 From: Andrew Peng [mailto:peng...@gmail.com]
 Sent: Wednesday, July 18, 2012 6:24 AM
 To: e1000-devel@lists.sourceforge.net
 Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 Thus far disabling TSO via ethtool has seemed to work - can anyone
 explain the technical reason why this appears to have fixed the issue?
 
 --Andrew
 
 On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com wrote:
  Sorry folks, but I just realized that I hadn't been replying to the
  list properly and instead I was mistakenly  emailing Dave directly.
 
  I'm consolidating and re-sending the information to the list.
 
  BIOS on the HP N40L does not specify any options for AER or PCIe
  error management, or packet size (referenced in another thread)
 
  I have also tried to disable PCIe power management to no success.
 
  I did see one options in the BIOS relating to ACPI functionality,
  and referencing a document that Dave sent me saying the AER kernel
  driver may not be loaded if certain ACPI

Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-07-19 Thread Dave, Tushar N
In that case, you can use our e1000e outbox driver from Sourceforge (which 
should have patches mentioned by Flavio).

-Tushar

-Original Message-
From: Flavio Leitner [mailto:f...@redhat.com]
Sent: Thursday, July 19, 2012 6:39 PM
To: Andrew Peng
Cc: Dave, Tushar N; e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

On Thu, 19 Jul 2012 20:17:14 -0500
Andrew Peng peng...@gmail.com wrote:

 Flavio;

 I am using the stock kernel driver with the stock Debian Squeeze kernel.


Well, I don't have the debian kernel sources handy to check, but based on
the version 2.6.32-5-amd64, It sounds like you don't have.

I pointed that patch because your card supports the write-back feature and
TDT and TDH are close to each other, less than 4, which is a signature of
the bug fixed by the first patch.

fbl

 Tushar;

 I've double checked that the message level is set correctly:
 Current message level: 0x2c01 (11265)
 Link detected: yes

 However, I just checked all of the logs on the server and I do not see
 a HW ring dump.

 Thanks all again for help


 --Andrew

 On Thu, Jul 19, 2012 at 7:46 PM, Dave, Tushar N
tushar.n.d...@intel.com wrote:
  Andrew,
 
  I don't think current message level set correctly.
  Have you ran 'ethtool -s ethx msglvl 0x2c01'
  I don't see HW ring dump in the log.
  Please confirm that msglvl is set correctly by running 'ethtool ethx'
 
  -Tushar
 
 
 
 -Original Message-
 From: Andrew Peng [mailto:peng...@gmail.com]
 Sent: Thursday, July 19, 2012 4:42 PM
 To: Dave, Tushar N
 Cc: e1000-devel@lists.sourceforge.net
 Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 Attached is the dmesg output. Please let me know if this looks right.
 There are two instances of the error here:
 
 [361106.726601] e1000e :02:00.0: eth1: Detected Hardware Unit
Hang:
 [361106.726604]   TDH  c5
 [361106.726606]   TDT  c7
 [361106.726607]   next_to_use  c7
 [361106.726608]   next_to_cleanc5
 [361106.726609] buffer_info[next_to_clean]:
 [361106.726610]   time_stamp   105605cd5
 [361106.726611]   next_to_watchc5
 [361106.726612]   jiffies  105605e51
 [361106.726614]   next_to_watch.status 0
 [361106.726615] MAC Status 80383
 [361106.726616] PHY Status 792d
 [361106.726617] PHY 1000BASE-T Status  3800
 [361106.726618] PHY Extended Status3000
 [361106.726619] PCI Status 10
 
 [411932.038648] e1000e :02:00.0: eth1: Detected Hardware Unit
Hang:
 [411932.038651]   TDH  3d
 [411932.038652]   TDT  3f
 [411932.038653]   next_to_use  3f
 [411932.038654]   next_to_clean3d
 [411932.038655] buffer_info[next_to_clean]:
 [411932.038657]   time_stamp   106223f55
 [411932.038658]   next_to_watch3d
 [411932.038659]   jiffies  106224069
 [411932.038660]   next_to_watch.status 0
 [411932.038661] MAC Status 80383
 [411932.038662] PHY Status 792d
 [411932.038663] PHY 1000BASE-T Status  3800
 [411932.038664] PHY Extended Status3000
 [411932.038665] PCI Status 10
 [422584.120473] e1000e :02:00.0: eth1: Detected Hardware Unit
Hang:
 [422584.120475]   TDH  15
 [422584.120477]   TDT  16
 [422584.120478]   next_to_use  16
 [422584.120479]   next_to_clean15
 [422584.120480] buffer_info[next_to_clean]:
 [422584.120481]   time_stamp   1064ae19c
 [422584.120483]   next_to_watch15
 [422584.120484]   jiffies  1064ae2d6
 [422584.120485]   next_to_watch.status 0
 [422584.120486] MAC Status 80383
 [422584.120487] PHY Status 792d
 [422584.120488] PHY 1000BASE-T Status  3800
 [422584.120489] PHY Extended Status3000
 [422584.120491] PCI Status 10
 
 Thank you again for all the help
 
 
 --Andrew
 
 
 
 On Wed, Jul 18, 2012 at 11:53 AM, Dave, Tushar N
 tushar.n.d...@intel.com
 wrote:
  We can find the reason now.
  Please enable TSO back.
  Then run ethtool -s ethx msglvl 0x2c01. This will enable debug
  code
 that logs HW ring data (into dmesg log) when Tx hang occurs. When
 issue occur next time please send me the full dmesg log.
 
  -Tushar
 
 -Original Message-
 From: Andrew Peng [mailto:peng...@gmail.com]
 Sent: Wednesday, July 18, 2012 6:24 AM
 To: e1000-devel@lists.sourceforge.net
 Subject: Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang
 
 Thus far disabling TSO via ethtool has seemed to work - can anyone
 explain the technical reason why this appears to have fixed the
issue?
 
 --Andrew
 
 On Mon, Jul 16, 2012 at 3:47 PM, Andrew Peng peng...@gmail.com
wrote:
  Sorry folks, but I just realized that I hadn't been replying to
  the list properly and instead I was mistakenly  emailing Dave
directly.
 
  I'm consolidating and re-sending the information to the list.
 
  BIOS on the HP N40L

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Ben Hutchings
On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
 On Sun, 15 Jul 2012, Dave, Tushar N wrote:
  Somehow setting max payload to 256 from BIOS does not set this value for 
  all devices. I believe this is a BIOS bug.
 
 And preferably, Linux should complain about it.  Since we know it is going
 to cause problems, and since we know it does happen, we should be raising a
 ruckus about it in the kernel log (and probably fixing it to min(path) while
 at it)...
 
 Is this something that should be raised as a feature request with the
 PCI/PCIe subsystem?

The feature is there, but we ended up with:

commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
Author: Jon Mason ma...@myri.com
Date:   Mon Oct 3 09:50:20 2011 -0500

PCI: Disable MPS configuration by default

But you are welcome to share use of the fixup_mpss_256() quirk.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Henrique de Moraes Holschuh
On Mon, 16 Jul 2012, Ben Hutchings wrote:
 On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
  On Sun, 15 Jul 2012, Dave, Tushar N wrote:
   Somehow setting max payload to 256 from BIOS does not set this value for 
   all devices. I believe this is a BIOS bug.
  
  And preferably, Linux should complain about it.  Since we know it is going
  to cause problems, and since we know it does happen, we should be raising a
  ruckus about it in the kernel log (and probably fixing it to min(path) while
  at it)...
  
  Is this something that should be raised as a feature request with the
  PCI/PCIe subsystem?
 
 The feature is there, but we ended up with:
 
 commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
 Author: Jon Mason ma...@myri.com
 Date:   Mon Oct 3 09:50:20 2011 -0500
 
 PCI: Disable MPS configuration by default
 
 But you are welcome to share use of the fixup_mpss_256() quirk.

Meh.  I'd be happy with a warning if MPSS decreases when walking up to
the tree root... i.e. a warning if any child has a MPSS larger than the
parent.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Jon Mason
On Mon, Jul 16, 2012 at 9:08 AM, Henrique de Moraes Holschuh
h...@hmh.eng.br wrote:
 On Mon, 16 Jul 2012, Ben Hutchings wrote:
 On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
  On Sun, 15 Jul 2012, Dave, Tushar N wrote:
   Somehow setting max payload to 256 from BIOS does not set this value for 
   all devices. I believe this is a BIOS bug.
 
  And preferably, Linux should complain about it.  Since we know it is going
  to cause problems, and since we know it does happen, we should be raising a
  ruckus about it in the kernel log (and probably fixing it to min(path) 
  while
  at it)...
 
  Is this something that should be raised as a feature request with the
  PCI/PCIe subsystem?

 The feature is there, but we ended up with:

 commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
 Author: Jon Mason ma...@myri.com
 Date:   Mon Oct 3 09:50:20 2011 -0500

 PCI: Disable MPS configuration by default

 But you are welcome to share use of the fixup_mpss_256() quirk.

 Meh.  I'd be happy with a warning if MPSS decreases when walking up to
 the tree root... i.e. a warning if any child has a MPSS larger than the
 parent.

You can add pci=pcie_bus_safe to the kernel params and it should
resolve your issue.

 --
   One disk to rule them all, One disk to find them. One disk to bring
   them all and in the darkness grind them. In the Land of Redmond
   where the shadows lie. -- The Silicon Valley Tarot
   Henrique Holschuh
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-15 Thread Henrique de Moraes Holschuh
On Sun, 15 Jul 2012, Dave, Tushar N wrote:
 Somehow setting max payload to 256 from BIOS does not set this value for all 
 devices. I believe this is a BIOS bug.

And preferably, Linux should complain about it.  Since we know it is going
to cause problems, and since we know it does happen, we should be raising a
ruckus about it in the kernel log (and probably fixing it to min(path) while
at it)...

Is this something that should be raised as a feature request with the
PCI/PCIe subsystem?

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-14 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Thursday, July 12, 2012 9:34 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/13/12 12:10, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Thursday, July 12, 2012 4:46 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 Thanks for sending full dmesg log. I am still investigating. I think
this issue can occur if two PCIe link partner *i.e pcie bridge and pcie
device do not have same max payload size.
 I need 2 more info.
 1) PBA number of the card.

This is a remote server and I could not get this.

 2) full lspci -vvv output of entire system 'after you have changed max
payload size to 128'.

Somehow setting max payload to 256 from BIOS does not set this value for all 
devices. I believe this is a BIOS bug.
All devices in path from root complex to 82571, should have same max payload 
size otherwise it can cause hang. When you set max payload to 128 from BIOS, 
all device in path from root complex to 82571 got assigned same max payload 
size. This resolves the issue.

I hope this helps.

-Tushar

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-14 Thread Joe Jin
On 07/15/12 11:42, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Thursday, July 12, 2012 9:34 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/13/12 12:10, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Thursday, July 12, 2012 4:46 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 Thanks for sending full dmesg log. I am still investigating. I think
 this issue can occur if two PCIe link partner *i.e pcie bridge and pcie
 device do not have same max payload size.
 I need 2 more info.
 1) PBA number of the card.

 This is a remote server and I could not get this.

 2) full lspci -vvv output of entire system 'after you have changed max
 payload size to 128'.
 
 Somehow setting max payload to 256 from BIOS does not set this value for all 
 devices. I believe this is a BIOS bug.
 All devices in path from root complex to 82571, should have same max payload 
 size otherwise it can cause hang. When you set max payload to 128 from BIOS, 
 all device in path from root complex to 82571 got assigned same max payload 
 size. This resolves the issue.
 
 I hope this helps.

Tushar,

Thanks a lot for your help, will send this to hardware engineer.

Regards,
Joe


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N
On 07/12/12 13:57, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 8:13 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 11:07, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 7:58 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

 Error message from BIOS event log:
 07/12/12 05:54:00
PCI Express Non-Fatal Error

 Thanks,
 Joe
 Hi Tushar,

 Please find eeprom from attachment.

 Do you have lspci -vvv dump of entire system before and after issue
occurs? If you have can you send it to me?


Sorry but I meant the full lspci -vvv of *entire system* before and after issue 
occurs and not of 82571 only.


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Thursday, July 12, 2012 12:11 AM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/12/12 14:41, Dave, Tushar N wrote:
 On 07/12/12 13:57, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 8:13 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 11:07, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 7:58 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

 Error message from BIOS event log:
 07/12/12 05:54:00
PCI Express Non-Fatal Error

 Thanks,
 Joe
 Hi Tushar,

 Please find eeprom from attachment.

 Do you have lspci -vvv dump of entire system before and after issue
 occurs? If you have can you send it to me?


 Sorry but I meant the full lspci -vvv of *entire system* before and
after issue occurs and not of 82571 only.


Before:
===
00:00.0 Host bridge: Intel Corporation 5500 I/O Hub to ESI Port (rev 22)
   Subsystem: Oracle Corporation Device 5352

Joe, thanks for all the data.
You said you have changed max payload size and issue stop occurring. How did 
you change it? Where did you make that change in BIOS or EEPROM or in PCIe 
config space?
Also please send me the full dmesg of entire system after you change max 
payload size.

Thanks.

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Thursday, July 12, 2012 4:46 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

Thanks for sending full dmesg log. I am still investigating. I think this issue 
can occur if two PCIe link partner *i.e pcie bridge and pcie device do not have 
same max payload size.
I need 2 more info. 
1) PBA number of the card.
2) full lspci -vvv output of entire system 'after you have changed max payload 
size to 128'.

Thanks.

-Tushar

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, July 10, 2012 10:03 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have you
install RHEl5/6 kernel and reproduced it? If so I think I should install
RHEL6 and try reproduce it locally!

Yes I reproduced this on both RHEL5 and RHEL6.

So far I tried to scp big file (~1GB) will hit it at once.

Thanks,
Joe

Joe,
Can you please send lspci -vvv output for failing port before issue occurs.
Thanks.

-Tushar

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin
On 07/11/12 15:11, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, July 10, 2012 10:03 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have you
 install RHEl5/6 kernel and reproduced it? If so I think I should install
 RHEL6 and try reproduce it locally!

 Yes I reproduced this on both RHEL5 and RHEL6.

 So far I tried to scp big file (~1GB) will hit it at once.

 Thanks,
 Joe
 
 Joe,
 Can you please send lspci -vvv output for failing port before issue occurs.
 Thanks.
 
# lspci -s 05:00.0 -vvv
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP 
Low Profile Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin B routed to IRQ 80
Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at dc00 [size=32]
Expansion ROM at fbda [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee21000  Data: 40cb
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ 
TransPend-
LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
Kernel driver in use: e1000e
Kernel modules: e1000e


Thanks,
Joe

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin
On 07/11/12 15:37, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 12:18 AM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 15:11, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, July 10, 2012 10:03 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have
 you
 install RHEl5/6 kernel and reproduced it? If so I think I should
 install
 RHEL6 and try reproduce it locally!

 Yes I reproduced this on both RHEL5 and RHEL6.

 So far I tried to scp big file (~1GB) will hit it at once.

 Thanks,
 Joe

 Joe,
 Can you please send lspci -vvv output for failing port before issue
 occurs.
 Thanks.

 # lspci -s 05:00.0 -vvv
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
 Controller (Copper) (rev 06)
  Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet
 UTP Low Profile Adapter
  Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B- DisINTx+
  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR- INTx-
  Latency: 0, Cache Line Size: 256 bytes
  Interrupt: pin B routed to IRQ 80
  Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
  Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
  Region 2: I/O ports at dc00 [size=32]
  Expansion ROM at fbda [disabled] [size=128K]
  Capabilities: [c8] Power Management version 2
  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
 ,D3hot+,D3cold+)
  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
  Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
  Address: fee21000  Data: 40cb
  Capabilities: [e0] Express (v1) Endpoint, MSI 00
  DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
 512ns, L1 64us
  ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
 Unsupported-
  RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
  MaxPayload 128 bytes, MaxReadReq 512 bytes
  DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
 TransPend-
  LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s,
 Latency L0 4us, L1 64us
  ClockPM- Surprise- LLActRep- BwNot-
  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
 CommClk-
  ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
  LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
 DLActive- BWMgmt- ABWMgmt-
  Capabilities: [100 v1] Advanced Error Reporting
  UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
 RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
 RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
  UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
 UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
  AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap-
 ChkEn-
  Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
  Kernel driver in use: e1000e
  Kernel modules: e1000e


 Thanks,
 Joe
 
 was this lspci output taken on freshly booted system?
 

Yes, any issue do you find?

Thanks,
Joe



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, July 11, 2012 12:39 AM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 15:37, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 12:18 AM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 15:11, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, July 10, 2012 10:03 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers,
 have you
 install RHEl5/6 kernel and reproduced it? If so I think I should
 install
 RHEL6 and try reproduce it locally!

 Yes I reproduced this on both RHEL5 and RHEL6.

 So far I tried to scp big file (~1GB) will hit it at once.

 Thanks,
 Joe

 Joe,
 Can you please send lspci -vvv output for failing port before issue
 occurs.
 Thanks.

 # lspci -s 05:00.0 -vvv
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)
 Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet
 UTP Low Profile Adapter
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR- INTx-
 Latency: 0, Cache Line Size: 256 bytes
 Interrupt: pin B routed to IRQ 80
 Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
 Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
 Region 2: I/O ports at dc00 [size=32]
 Expansion ROM at fbda [disabled] [size=128K]
 Capabilities: [c8] Power Management version 2
 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
 ,D3hot+,D3cold+)
 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
 Address: fee21000  Data: 40cb
 Capabilities: [e0] Express (v1) Endpoint, MSI 00
 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
 512ns, L1 64us
 ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
 DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
 Unsupported-
 RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 MaxPayload 128 bytes, MaxReadReq 512 bytes
 DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
 TransPend-
 LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s,
 Latency L0 4us, L1 64us
 ClockPM- Surprise- LLActRep- BwNot-
 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
 CommClk-
 ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
 LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
 DLActive- BWMgmt- ABWMgmt-
 Capabilities: [100 v1] Advanced Error Reporting
 UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
 RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
 RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
 UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
 AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap-
 ChkEn-
 Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
 Kernel driver in use: e1000e
 Kernel modules: e1000e


 Thanks,
 Joe

 was this lspci output taken on freshly booted system?


Yes, any issue do you find?

Thanks,
Joe


Device status and AER sections show some errors that looks little suspicious to 
me but I'm not too sure. I will get back tomorrow.

-Tushar

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin
On 07/11/12 15:50, Dave, Tushar N wrote:
 Device status and AER sections show some errors that looks little suspicious 
 to me but I'm not too sure. I will get back tomorrow.
 

Thanks a lot, Tushar!

Joe


-- 
Oracle http://www.oracle.com
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, July 10, 2012 10:03 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have you
install RHEl5/6 kernel and reproduced it? If so I think I should install
RHEL6 and try reproduce it locally!

Yes I reproduced this on both RHEL5 and RHEL6.

So far I tried to scp big file (~1GB) will hit it at once.

Thanks,
Joe

Joe,

I see couple of errors in lspci output.
Device capability status register shows UnCorrectable PCIe error. This means 
there is certainly something went wrong. The only way to recover from 
Uncorrectable errors is reset.
   
DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend-

Also AER sections in lspci output shows PCIe completion timeout.

Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-

I suggest you should load AER driver and check for any error messages in log. 
Also please check any error message reported by system in BIOS log. Are there 
any machine check errors? 

When did you notice this issue? have 82571 ever been working before on this 
server?

One more thing, Cache line size 256 is little unusual( I never seen this value 
before, mostly it's 64). Does BIOS settings have been changed? Are you using 
default BIOS setting?

Thanks.

-Tushar

  




--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB - Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N
-Original Message-
From: Andrew Peng [mailto:peng...@gmail.com]
Sent: Wednesday, July 11, 2012 8:50 AM
To: e1000-devel@lists.sourceforge.net
Subject: [E1000-devel] 82571EB - Detected Hardware Unit Hang

Folks, I've been getting some strange error messages in my home server /
router that I've been having trouble debugging. I'm decently proficient in
Linux, but I fear I'm in over my head with this one.

The hardware is a HP N40L Microserver - here are the hardware details
- http://n40l.wikia.com/wiki/Base_Hardware

I am running Debian Squeeze 6.0:
pengc99@gaia:/$ sudo uname -a
Linux gaia 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64
GNU/Linux

I also subscribe to Ksplice's Uptrack system but since I have the newest
kernel installed (as released by Debian) there have been no hot-patches
yet.

This is the message I've been getting in /var/log/kern.log:
Jul 11 08:55:38 gaia kernel: [402056.009687] e1000e :02:00.0:
eth1: Detected Hardware Unit Hang:
Jul 11 08:55:38 gaia kernel: [402056.009690]   TDH  fc
Jul 11 08:55:38 gaia kernel: [402056.009692]   TDT  fd
Jul 11 08:55:38 gaia kernel: [402056.009693]   next_to_use  fd
Jul 11 08:55:38 gaia kernel: [402056.009694]   next_to_cleanfc
Jul 11 08:55:38 gaia kernel: [402056.009695] buffer_info[next_to_clean]:
Jul 11 08:55:38 gaia kernel: [402056.009697]   time_stamp
105fc92b2
Jul 11 08:55:38 gaia kernel: [402056.009698]   next_to_watchfc
Jul 11 08:55:38 gaia kernel: [402056.009699]   jiffies
105fc93da
Jul 11 08:55:38 gaia kernel: [402056.009700]   next_to_watch.status 0
Jul 11 08:55:38 gaia kernel: [402056.009701] MAC Status
80383
Jul 11 08:55:38 gaia kernel: [402056.009702] PHY Status 792d
Jul 11 08:55:38 gaia kernel: [402056.009703] PHY 1000BASE-T Status  3800
Jul 11 08:55:38 gaia kernel: [402056.009705] PHY Extended Status3000
Jul 11 08:55:38 gaia kernel: [402056.009706] PCI Status 10

Complete output of lspci:
pengc99@gaia:/$ lspci
00:00.0 Host bridge: Advanced Micro Devices [AMD] RS880 Host Bridge
00:01.0 PCI bridge: Hewlett-Packard Company Device 9602
00:02.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge
(ext gfx port 0)
00:06.0 PCI bridge: Advanced Micro Devices [AMD] RS780 PCI to PCI bridge
(PCIE port 2)
00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller
[AHCI mode] (rev 40)
00:12.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0
Controller
00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI
Controller
00:13.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0
Controller
00:13.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI
Controller
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 42)
00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller
(rev 40)
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
00:16.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0
Controller
00:16.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI
Controller
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor
Link Control
01:05.0 VGA compatible controller: ATI Technologies Inc M880G [Mobility
Radeon HD 4200]
02:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
02:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5723
Gigabit Ethernet PCIe (rev 10)

Output of lspci -vvv (as root, network adapter section):
02:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
Subsystem: Hewlett-Packard Company NC360T PCI Express Dual Port
Gigabit Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
TAbort- MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 26
Region 0: Memory at fe8e (32-bit, non-prefetchable)
[size=128K]
Region 1: Memory at fe8c (32-bit, non-prefetchable)
[size=128K]
Region 2: I/O ports at e800 [size=32]
Expansion ROM at fe8a [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin
On 07/12/12 02:51, Dave, Tushar N wrote:
 
 Joe,
 
 I see couple of errors in lspci output.
 Device capability status register shows UnCorrectable PCIe error. This means 
 there is certainly something went wrong. The only way to recover from 
 Uncorrectable errors is reset.

   DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend-
 
 Also AER sections in lspci output shows PCIe completion timeout.
   
   Capabilities: [100 v1] Advanced Error Reporting
   UESta:  DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- 
 RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
 
 I suggest you should load AER driver and check for any error messages in log. 
 Also please check any error message reported by system in BIOS log. Are there 
 any machine check errors? 
 
 When did you notice this issue? have 82571 ever been working before on this 
 server?
 
 One more thing, Cache line size 256 is little unusual( I never seen this 
 value before, mostly it's 64). Does BIOS settings have been changed? Are you 
 using default BIOS setting?
 

I checked BIOS's log found the fault from the device, I changed PCI-E Payload 
Size
from 256(default) to 128, now the device works.

I compared lspci output found Address for data of MSI Capabilities's be changed:

Old:
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee21000  Data: 40cb

New:
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee24000  Data: 405c

Mostly like it's a BIOS bug? please comments.

Thanks,
Joe


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin
On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

Error message from BIOS event log:
07/12/12 05:54:00
PCI Express Non-Fatal Error

Thanks,
Joe

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, July 11, 2012 7:58 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

Error message from BIOS event log:
07/12/12 05:54:00
PCI Express Non-Fatal Error

Thanks,
Joe

Thanks.  Well, I will check with team tomorrow if this  (max payload size) can 
be treated as solution to this issue. 
We can know more about what exact non-fatal error occurred if we capture bus 
trace.
We should check the eeprom on this device to make sure they are up-to-date.
Send me the full eeprom dump in a file and I will confirm with team that it is 
up-to-date.
Thanks for your work.

-Tushar

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, July 11, 2012 8:13 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/12/12 11:07, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 7:58 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

 Error message from BIOS event log:
 07/12/12 05:54:00
PCI Express Non-Fatal Error

 Thanks,
 Joe
Hi Tushar,

Please find eeprom from attachment.

Do you have lspci -vvv dump of entire system before and after issue occurs? If 
you have can you send it to me?


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin
When I debug the driver I found before Detected HW hang, driver unable to clean
and reclaim the resources:

1457 while ((eop_desc-upper.data  cpu_to_le32(E1000_TXD_STAT_DD))   
== at here upper.data always is 0x300
1458(count  tx_ring-count)) {
 --- snip ---
1487 }


I checked all driver codes I did not found anywhere will set the upper.data 
with 
E1000_TXD_STAT_DD, I guess upper.data be set by hardware?
If OS is 32bit system, what which happen?

Thanks in advance,
Joe 

On 07/09/12 16:51, Joe Jin wrote:
 Hi list,
 
 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
 scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
 a big file (500M) from another server will hit it at once. 
 
 Would you please help on this?
 
 device info:
 # lspci -s 05:00.0 
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
 Controller (Copper) (rev 06)
 
 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)
 
 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0
 
 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on
 
 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2
 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
 Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc 
 cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm 
 ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i 
 libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs 
 sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) 
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device igb 
 snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer snd 
 soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr i7core_edac 
 iTCO_vendor_support ioatdma ghes dca edac_core hed dm_snapshot dm_zero 
 dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci 
 libahci ext3 jbd mbcache [last unloaded: microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1
 Call Trace:
  [c07d9ac5] ? dev_watchdog+0x225/0x230
  [c045ba61] warn_slowpath_common+0x81/0xa0
  [c07d9ac5] ? dev_watchdog+0x225/0x230
  [c045bb23] warn_slowpath_fmt+0x33/0x40
  [c07d9ac5] dev_watchdog+0x225/0x230
  [c07d98a0] ? dev_activate+0xb0/0xb0
  [c0468e82] call_timer_fn+0x32/0xf0
  [c04bceb0] ? rcu_check_callbacks+0x80/0x80
  [c046a76d] run_timer_softirq+0xed/0x1b0
  [c07d98a0] ? dev_activate+0xb0/0xb0
  [c0461a81] __do_softirq+0x91/0x1a0
  [c04619f0] ? local_bh_enable+0x80/0x80
  IRQ  [c0462295] ? irq_exit+0x95/0xa0
  [c087f8b8] ? smp_apic_timer_interrupt+0x38/0x42
  [c08784f5] ? apic_timer_interrupt+0x31/0x38
  [c046007b] ? do_exit+0x11b/0x370
  [c065eae4] ? intel_idle+0xa4/0x100
  [c078d9b9] ? cpuidle_idle_call+0xb9/0x1e0
  [c0411d77] ? cpu_idle+0x97/0xd0
  [c085cbbd] ? rest_init+0x5d/0x70
  [c0b07a7a] ? start_kernel+0x28a/0x340
  [c0b074b0] ? obsolete_checksetup+0xb0/0xb0
  [c0b070a4] ? i386_start_kernel+0x64/0xb0
 ---[ end trace 5502b55cd4d4e5cb ]---
 e1000e :05:00.0: eth0: Reset adapter
 e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
 
 Thanks,
 Joe
 


-- 
Oracle http://www.oracle.com
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N
-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Tuesday, July 10, 2012 12:40 AM
To: Joe Jin
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

When I debug the driver I found before Detected HW hang, driver unable to
clean and reclaim the resources:

1457 while ((eop_desc-upper.data 
cpu_to_le32(E1000_TXD_STAT_DD))   == at here upper.data always is 0x300
1458(count  tx_ring-count)) {
 --- snip ---
1487 }


I checked all driver codes I did not found anywhere will set the
upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver 
checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means 
HW has processed that descriptor and driver can now clean that descriptor.
With value 0x300 , DD bit is not set. That means HW has not processed that 
descriptor.

How fast does tx hang reproduce? I suggest you to enable debug code in driver 
so when tx hang occurs it will dump the HW desc ring info into kernel log.
You can run ethtool -s ethx msglvl 0x2c00 to enable debug.
Once tx hang occurs please send me the full dmesg log.

Does tx hang occur with in-kernel e1000e driver too?

Thanks.

-Tushar


If OS is 32bit system, what which happen?



Thanks in advance,
Joe

On 07/09/12 16:51, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
 just copy a big file (500M) from another server will hit it at once.

 Would you please help on this?

 device info:
 # lspci -s 05:00.0
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)

 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)

 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0

 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on

 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
 transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
 bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
 be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
 ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
 acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
 igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer
 snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr
 i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed
 dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage
 sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded:
 microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call
 Trace:
  [c07d9ac5] ? dev_watchdog+0x225/0x230  [c045ba61]
 warn_slowpath_common+0x81/0xa0  [c07d9ac5] ?
 dev_watchdog+0x225/0x230  [c045bb23] warn_slowpath_fmt+0x33/0x40
 [c07d9ac5] dev_watchdog+0x225/0x230  [c07d98a0] ?
 dev_activate+0xb0/0xb0  [c0468e82] call_timer_fn+0x32/0xf0
 [c04bceb0] ? rcu_check_callbacks+0x80/0x80  [c046a76d]
 run_timer_softirq+0xed/0x1b0  [c07d98a0] ? dev_activate+0xb0/0xb0
 [c0461a81] __do_softirq+0x91/0x1a0  [c04619f0] ?
 local_bh_enable+0x80/0x80  IRQ  [c0462295] ? irq_exit+0x95/0xa0
 [c087f8b8] ? smp_apic_timer_interrupt+0x38/0x42
  [c08784f5] ? apic_timer_interrupt+0x31/0x38  [c046007b] ?
 do_exit+0x11b/0x370  [c065eae4] ? intel_idle+0xa4/0x100

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N

-Original Message-
From: Dave, Tushar N
Sent: Tuesday, July 10, 2012 12:02 PM
To: Joe Jin
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org; Dave, Tushar N
Subject: RE: 82571EB: Detected Hardware Unit Hang

-Original Message-
From: netdev-ow...@vger.kernel.org
[mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Tuesday, July 10, 2012 12:40 AM
To: Joe Jin
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

When I debug the driver I found before Detected HW hang, driver unable
to clean and reclaim the resources:

1457 while ((eop_desc-upper.data 
cpu_to_le32(E1000_TXD_STAT_DD))   == at here upper.data always is
0x300
1458(count  tx_ring-count)) {
 --- snip ---
1487 }


I checked all driver codes I did not found anywhere will set the
upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver
checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that
means HW has processed that descriptor and driver can now clean that
descriptor.
With value 0x300 , DD bit is not set. That means HW has not processed that
descriptor.

How fast does tx hang reproduce? I suggest you to enable debug code in
driver so when tx hang occurs it will dump the HW desc ring info into
kernel log.
You can run ethtool -s ethx msglvl 0x2c00 to enable debug.
Once tx hang occurs please send me the full dmesg log.

Does tx hang occur with in-kernel e1000e driver too?

Thanks.

-Tushar
One change , please use  ethtool -s ethx msglvl 0x2c01 so to keep default 
'drv' msglvl enabled.
Confirm the message level set correctly by running command 'ethtool ethx'.
Last few will be

Current message level: 0x2c01 (11265)
   drv tx_done rx_status hw




If OS is 32bit system, what which happen?



Thanks in advance,
Joe

On 07/09/12 16:51, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270
 M2, just copy a big file (500M) from another server will hit it at
once.

 Would you please help on this?

 device info:
 # lspci -s 05:00.0
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)

 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)

 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0

 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on

 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
 transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
 bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
 be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
 ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
 acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
 igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon
 snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core
 pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed
 dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod
 usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last
unloaded:
 microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call
 Trace:
  [c07d9ac5] ? dev_watchdog+0x225/0x230  [c045ba61]
 warn_slowpath_common+0x81/0xa0  [c07d9ac5] ?
 dev_watchdog+0x225/0x230  

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin
On 07/11/12 03:02, Dave, Tushar N wrote:
 -Original Message-
 From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
 On Behalf Of Joe Jin
 Sent: Tuesday, July 10, 2012 12:40 AM
 To: Joe Jin
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 When I debug the driver I found before Detected HW hang, driver unable to
 clean and reclaim the resources:

 1457 while ((eop_desc-upper.data 
 cpu_to_le32(E1000_TXD_STAT_DD))   == at here upper.data always is 0x300
 1458(count  tx_ring-count)) {
 --- snip ---
 1487 }


 I checked all driver codes I did not found anywhere will set the
 upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?
 
 Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver 
 checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means 
 HW has processed that descriptor and driver can now clean that descriptor.
 With value 0x300 , DD bit is not set. That means HW has not processed that 
 descriptor.

Thanks for the clarify, might be firmware issue?
 
 How fast does tx hang reproduce? I suggest you to enable debug code in driver 
 so when tx hang occurs it will dump the HW desc ring info into kernel log.

Once I copy a file from other server, issue to be reproduced at once.
I'll enable the debug to get more debug info.

 You can run ethtool -s ethx msglvl 0x2c00 to enable debug.
 Once tx hang occurs please send me the full dmesg log.
 
 Does tx hang occur with in-kernel e1000e driver too?

I tried several drivers included rhel5 the latest, Intel the latest,
rhel6 the latest, issue see on all those drivers.

Thanks,
Joe 
 
 Thanks.
 
 -Tushar
 
 
 If OS is 32bit system, what which happen?
 
 

 Thanks in advance,
 Joe

 On 07/09/12 16:51, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
 just copy a big file (500M) from another server will hit it at once.

 Would you please help on this?

 device info:
 # lspci -s 05:00.0
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)

 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)

 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0

 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on

 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
 transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
 bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
 be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
 ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
 acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
 igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer
 snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr
 i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed
 dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage
 sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded:
 microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call
 Trace:
  [c07d9ac5] ? dev_watchdog+0x225/0x230  [c045ba61]
 warn_slowpath_common+0x81/0xa0  [c07d9ac5] ?
 dev_watchdog+0x225/0x230  [c045bb23] warn_slowpath_fmt+0x33/0x40
 [c07d9ac5] dev_watchdog+0x225/0x230  [c07d98a0] ?
 dev_activate+0xb0/0xb0  [c0468e82] call_timer_fn+0x32/0xf0
 [c04bceb0] ? 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, July 10, 2012 5:35 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 03:02, Dave, Tushar N wrote:
 -Original Message-
 From: netdev-ow...@vger.kernel.org
 [mailto:netdev-ow...@vger.kernel.org]
 On Behalf Of Joe Jin
 Sent: Tuesday, July 10, 2012 12:40 AM
 To: Joe Jin
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 When I debug the driver I found before Detected HW hang, driver
 unable to clean and reclaim the resources:

 1457 while ((eop_desc-upper.data 
 cpu_to_le32(E1000_TXD_STAT_DD))   == at here upper.data always is
0x300
 1458(count  tx_ring-count)) {
 --- snip ---
 1487 }


 I checked all driver codes I did not found anywhere will set the
 upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by
hardware?

 Yes upper.data (part of it is STATUS byte) is set by HW. Basically
driver checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set
that means HW has processed that descriptor and driver can now clean that
descriptor.
 With value 0x300 , DD bit is not set. That means HW has not processed
that descriptor.

Thanks for the clarify, might be firmware issue?

 How fast does tx hang reproduce? I suggest you to enable debug code in
driver so when tx hang occurs it will dump the HW desc ring info into
kernel log.

Once I copy a file from other server, issue to be reproduced at once.
I'll enable the debug to get more debug info.

 You can run ethtool -s ethx msglvl 0x2c00 to enable debug.
 Once tx hang occurs please send me the full dmesg log.

 Does tx hang occur with in-kernel e1000e driver too?

I tried several drivers included rhel5 the latest, Intel the latest,
rhel6 the latest, issue see on all those drivers.

Also after issue occurs please capture lspci -vvv (run as root)


Thanks,
Joe

 Thanks.

 -Tushar


 If OS is 32bit system, what which happen?



 Thanks in advance,
 Joe

 On 07/09/12 16:51, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270
 M2, just copy a big file (500M) from another server will hit it at
once.

 Would you please help on this?

 device info:
 # lspci -s 05:00.0
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)

 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)

 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0

 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on

 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
 transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
 bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
 be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
 ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
 acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
 igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon
 snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core
 pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core
 hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod
 usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last
unloaded:
 microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin
On 07/11/12 11:22, Dave, Tushar N wrote:
 Thanks for info. I see that hang occurs right when HW processing first TX 
 descriptor with TSO.
 Would you be able to reproduce issue with TSO off?  Disable TSO by 'ethtool 
 -K ethx tso off'
 Let all debug enabled as it is,  that will help us debug further if issue 
 occurs with TSO off.

Hi Tushar,

Thanks for you quick reply but disabled tso no help for this issue:

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: on

kernel log after disable tso:

e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  1
  TDT  4
  next_to_use  4
  next_to_clean1
buffer_info[next_to_clean]:
  time_stamp   103ae0aba
  next_to_watch1
  jiffies  103ae16a0
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  1
  TDT  4
  next_to_use  4
  next_to_clean1
buffer_info[next_to_clean]:
  time_stamp   103ae0aba
  next_to_watch1
  jiffies  103ae2640
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: Net device Info
e1000e: Device Name statetrans_start  last_rx
e1000e: eth00003 000103AE128A 
e1000e :05:00.0: Register Dump
e1000e:  Register Name   Value
e1000e: CTRL180c0241
e1000e: STATUS  00080387
e1000e: CTRL_EXT181400c0
e1000e: ICR 0040
e1000e: RCTL04048002
e1000e: RDLEN   1000
e1000e: RDH 0090
e1000e: RDT 0080
e1000e: RDTR0020
e1000e: RXDCTL[0-1] 01040420 01040420
e1000e: ERT 
e1000e: RDBAL   23852000
e1000e: RDBAH   000c
e1000e: RDFH075a
e1000e: RDFT0752
e1000e: RDFHS   0758
e1000e: RDFTS   0752
e1000e: RDFPC   01b4
e1000e: TCTL3003f00a
e1000e: TDBAL   1210c000
e1000e: TDBAH   000c
e1000e: TDLEN   1000
e1000e: TDH 0001
e1000e: TDT 0004
e1000e: TIDV0008
e1000e: TXDCTL[0-1] 0145011f 0145011f
e1000e: TADV0020
e1000e: TARC[0-1]   07a00403 07400403
e1000e: TDFH1308
e1000e: TDFT1308
e1000e: TDFHS   1308
e1000e: TDFTS   1308
e1000e: TDFPC   
e1000e :05:00.0: Tx Ring Summary
e1000e: Queue [NTU] [NTC] [bi(ntc)-dma  ] leng ntw timestamp
e1000e:  0 4 1 000620800C02 002A   1 000103AE0ABA
e1000e :05:00.0: Tx Ring Dump
e1000e: Tl[desc] [address 63:0  ] [SpeCssSCmCsLen] [bi-dma   ] leng  
ntw timestampbi-skb -- Legacy format
e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi-dma   ] leng  
ntw timestampbi-skb -- Ext Context format
e1000e: Td[desc] [address 63:0  ] [VlaPoRSCm1Dlen] [bi-dma   ] leng  
ntw timestampbi-skb -- Ext Data format
e1000e: Tl[0x000]000C1AA0F002 8B2A  002A
0  (null)
e1000e: Tl[0x001]000620800C02 8B2A 000620800C02 002A
1 000103AE0ABA 88061c6b6980 NTC
e1000e: Tl[0x002]00061E6DBC02 8B2A 00061E6DBC02 002A
2 000103AE0EA2 88061c6b6880
e1000e: Tl[0x003]000620A6C402 8B2A 000620A6C402 002A
3 000103AE128A 8806230b4080
e1000e: Tl[0x004]   
0  (null) NTU
e1000e: Tl[0x005]   
0  (null)
e1000e: Tl[0x006]   
0  (null)
e1000e: Tl[0x007]   
0  (null)
e1000e: Tl[0x008]   
0  (null)
e1000e: Tl[0x009]   
0  (null)
e1000e: Tl[0x00A]   
0  (null)
e1000e: Tl[0x00B]   
0  (null)
e1000e: Tl[0x00C]   
0  (null)
e1000e: Tl[0x00D]      

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N
-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, July 10, 2012 8:29 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 11:22, Dave, Tushar N wrote:
 Thanks for info. I see that hang occurs right when HW processing first
TX descriptor with TSO.
 Would you be able to reproduce issue with TSO off?  Disable TSO by
'ethtool -K ethx tso off'
 Let all debug enabled as it is,  that will help us debug further if
issue occurs with TSO off.

Hi Tushar,

Thanks for you quick reply but disabled tso no help for this issue:

Thanks for running a quick test. I don't find anything obvious wrong in 
descriptor dump.

When you said you had this issue with RHEL5 and RHEL6 drivers, have you install 
RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try 
reproduce it locally!

-Tushar



# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: on

kernel log after disable tso:

e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  1
  TDT  4
  next_to_use  4
  next_to_clean1
buffer_info[next_to_clean]:
  time_stamp   103ae0aba
  next_to_watch1
  jiffies  103ae16a0
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  1
  TDT  4
  next_to_use  4
  next_to_clean1
buffer_info[next_to_clean]:
  time_stamp   103ae0aba
  next_to_watch1
  jiffies  103ae2640
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: Net device Info
e1000e: Device Name statetrans_start  last_rx
e1000e: eth00003 000103AE128A 
e1000e :05:00.0: Register Dump
e1000e:  Register Name   Value
e1000e: CTRL180c0241
e1000e: STATUS  00080387
e1000e: CTRL_EXT181400c0
e1000e: ICR 0040
e1000e: RCTL04048002
e1000e: RDLEN   1000
e1000e: RDH 0090
e1000e: RDT 0080
e1000e: RDTR0020
e1000e: RXDCTL[0-1] 01040420 01040420
e1000e: ERT 
e1000e: RDBAL   23852000
e1000e: RDBAH   000c
e1000e: RDFH075a
e1000e: RDFT0752
e1000e: RDFHS   0758
e1000e: RDFTS   0752
e1000e: RDFPC   01b4
e1000e: TCTL3003f00a
e1000e: TDBAL   1210c000
e1000e: TDBAH   000c
e1000e: TDLEN   1000
e1000e: TDH 0001
e1000e: TDT 0004
e1000e: TIDV0008
e1000e: TXDCTL[0-1] 0145011f 0145011f
e1000e: TADV0020
e1000e: TARC[0-1]   07a00403 07400403
e1000e: TDFH1308
e1000e: TDFT1308
e1000e: TDFHS   1308
e1000e: TDFTS   1308
e1000e: TDFPC   
e1000e :05:00.0: Tx Ring Summary
e1000e: Queue [NTU] [NTC] [bi(ntc)-dma  ] leng ntw timestamp
e1000e:  0 4 1 000620800C02 002A   1 000103AE0ABA
e1000e :05:00.0: Tx Ring Dump
e1000e: Tl[desc] [address 63:0  ] [SpeCssSCmCsLen] [bi-dma   ]
leng  ntw timestampbi-skb -- Legacy format
e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi-dma   ]
leng  ntw timestampbi-skb -- Ext Context format
e1000e: Td[desc] [address 63:0  ] [VlaPoRSCm1Dlen] [bi-dma   ]
leng  ntw timestampbi-skb -- Ext Data format
e1000e: Tl[0x000]000C1AA0F002 8B2A 
002A0  (null)
e1000e: Tl[0x001]000620800C02 8B2A 000620800C02
002A1 000103AE0ABA 88061c6b6980 NTC
e1000e: Tl[0x002]00061E6DBC02 8B2A 00061E6DBC02
002A2 000103AE0EA2 88061c6b6880
e1000e: Tl[0x003]000620A6C402 8B2A 000620A6C402
002A3 000103AE128A 8806230b4080
e1000e: Tl[0x004]  
0  (null) NTU
e1000e: Tl[0x005]  
0  (null)
e1000e: Tl[0x006]  
0  (null)
e1000e: Tl[0x007]  
0  (null)
e1000e: Tl[0x008]  
0 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2012-07-09 Thread Eric Dumazet
On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
 Hi list,
 
 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
 scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
 a big file (500M) from another server will hit it at once. 
 
 Would you please help on this?
 

Its a known problem.

But apparently Intel guys are not very responsive, as they have another
patch than the following :

http://permalink.gmane.org/gmane.linux.network/232669


We only have to wait they push their alternative patch, eventually.

In the mean time, you can use Hiroaki SHIMODA patch, it works.




--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-11-03 Thread Flavio Leitner
(moving the discussion back to the list)

Hi,

I am sorry, I didn't receive your patch as we discussed in private
and ended up writing one patch myself which essentially does the
same thing.

The patch is available at:
https://bugzilla.redhat.com/show_bug.cgi?id=746272#c13

It schedules a workqueue to flush the descriptors 500ms after
sent the first packet. This ensures that there will be a write-back
and enough time before the watchdog detects it as an old entry.

Time:  0 ms   -  x ms - y ms  -...- 500ms - 
Pkts: pkt#1   - pkt#2 - pkt#3 -...- pkt#n -pkt(n+1)
Event:schedule  -   - -   flush -schedule
  workqueue  workqueue
   
Customer reported that it works, so IMHO, the root cause is confirmed.
There is no enough packets to cause the write-back and writing to FPD
fixes it.

That patch will flush every 500ms with high traffic too which
isn't good for performance, though it would be a flush of up to
4 descriptors as far as I understand.

I like Michael's approach to let the watchdog detects the hang first,
then try to flush.  Michael told me that we could flush and use the
interrupt raised when the write-back ends to clean up.  I think if
there is a real TX hang (i.e. no interrupt event), it will take another
watchdog cycle to detect that. It seems to me too much time without
taking any action.

Maybe something like this would work:
1) watchdog detects the hang
2) check for FLAG2_DMA_BURST flag
3) if yes, force flush, set a bit flag in the TX ring and schedule
   watchdog with a short period
4) if the TXDW interrupt happens, cleans up and reset the bit flag.
5) if not, the watchdog will expire, that bit flag will remain set
   then it will take any action assuming a real hang has occurred.

thanks,
fbl

On Wed, 26 Oct 2011 17:27:04 +0800
Michael Wang wang...@linux.vnet.ibm.com wrote:

 Hi, Flavio, Jesse
 
 I have send out the patch, which I hope can do some help.
 
 Because this is my first time to send a patch, I am sorry if
 I have done some silly thing.
 
 And please tell me if there are some problem about it.
 
 Thanks  Best regards,
 Michael Wang
 
 --
 The demand for IT networking professionals continues to grow, and the
 demand for specialized networking skills is growing even more rapidly.
 Take a complimentary Learning@Cisco Self-Assessment and learn 
 about Cisco certifications, training, and career opportunities. 
 http://p.sf.net/sfu/cisco-dev2dev
 ___
 E1000-devel mailing list
 E1000-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/e1000-devel
 To learn more about Intel#174; Ethernet, visit
 http://communities.intel.com/community/wired


--
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-26 Thread Michael Wang
Hi, Flavio, Jesse

I have send out the patch, which I hope can do some help.

Because this is my first time to send a patch, I am sorry if
I have done some silly thing.

And please tell me if there are some problem about it.

Thanks  Best regards,
Michael Wang

--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-25 Thread Michael Wang
On 10/25/2011 12:26 AM, Flavio Leitner wrote:
 On Mon, 24 Oct 2011 16:26:28 +0800
 Michael Wangwang...@linux.vnet.ibm.com  wrote:

 On 10/21/2011 10:03 PM, Flavio Leitner wrote:
 On Fri, 21 Oct 2011 14:15:12 +0800
 Michael Wangwang...@linux.vnet.ibm.com   wrote:

 On 10/19/2011 08:16 PM, Flavio Leitner wrote:
 On Wed, 19 Oct 2011 12:49:48 +0800
 wangyunwang...@linux.vnet.ibm.comwrote:

 Hi, Flavio

 I am new to join the community, work on e1000e driver currently,
 And I found a thing strange in this issue, please check below.

 Thanks,
 Michael Wang

 On 10/18/2011 10:42 PM, Flavio Leitner wrote:
 On Mon, 17 Oct 2011 11:48:22 -0700
 Jesse Brandeburgjesse.brandeb...@intel.com wrote:

 On Fri, 14 Oct 2011 10:04:26 -0700
 Flavio Leitnerf...@redhat.com wrote:

 TDH is probably not moving due to the writeback threshold settings in
 TXDCTL.  netperf UDP_RR test is likely a good way to test this.

 Yeah, makes sense. I haven't heard about new events after had removed
 the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the 
 exact
 same hardware and I haven't reproduced the issue in-house yet with 
 another
 82571EB. See below about interface statistics from sar.
 Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the
 tx descriptor only when:

 1. the descriptor device cached is lower then 32.
 2. The descriptor host prepared is at least one.

 I don't think this will cause that issue, but another thing it done is to
 set the device to write-back the processed descriptor only when the
 amount reach 5(or 4).

 So may be when the device get a descriptor and processed, but the
 amount not reached 5, so it don't write-back it, but actually already
 transmitted.

 That could explain the issue and the fact that sometimes the hang
 info printed shows empty ring (write-back happened in the middle).

 But this will happen only when the transmit suddenly stopped for one
 second or more, I don't know whether this is the real traffic situation
 or not.

 At least for one customer the interface had almost no traffic.
 I will go over all the data again checking if this happens every time.


 And may be I am wrong about this, but also I think this may be the only
 reason cause this issue.

 I am seeing this based on the debugging output:

 This is the full output with debugging patch applied:
 Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
 Hang:
 Oct 11 02:03:52 kernel:   TDH25
 Oct 11 02:03:52 kernel:   TDT26
 Oct 11 02:03:52 kernel:   next_to_use26
 Oct 11 02:03:52 kernel:   next_to_clean25
 Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
 Oct 11 02:03:52 kernel:   time_stamp100b2aa22
 Oct 11 02:03:52 kernel:   next_to_watch25
 Oct 11 02:03:52 kernel:   jiffies100b2ab25
 Oct 11 02:03:52 kernel:   next_to_watch.status0
 Oct 11 02:03:52 kernel:   stored_i =25
 Oct 11 02:03:52 kernel:   stored_first =25
 Oct 11 02:03:52 kernel:   stamp =100b2aa22
 Oct 11 02:03:52 kernel:   factor =fa
 Oct 11 02:03:52 kernel:   last_clean =100b2aa1a
 Oct 11 02:03:52 kernel:   last_tx =100b2aa22
 Oct 11 02:03:52 kernel:   count =0/100
 Notice above that buffer_info time_stamp is the same as in
 last_tx (last time the xmit function was called), also that
 last_clean (last time the clean function was called) is before
 that.  Therefore, the system sent just one descriptor in about
 1 second confirming your idea.


 So have you try to use the Red Hat 6, is this problem still
 exist?

 Actually, I received few other reports that looks like to be same
 issue but with 6.2.  As far as I can tell, hardware that was working
 just fine started to show it after the kernel upgrade (coincidentally
 5.7 and 6.2 introduces FLAG2_DMA_BURST).  However, I haven't heard
 anything back since I had provided the instrumented kernel to confirm
 to you.  I will follow up as soon as I hear something.

 Assuming that your idea is true, the hang detection is broken because
 it's possible to have a descriptor apparently stuck that is just missing
 the write-back. So, is it possible to set a timer to write-back? If yes,
 it could expire and run before the hang detection period expires. Or
 perhaps force the write-back to happen before hang detection execution.


According to code ew32(TIDV, adapter-tx_int_delay);, I think
such timer has been already set, but I don't know if the
tx_int_delay is the default value which is 8(units of 1.024 μs).

TIDV means if the time expire, it will flush the write-back,
enforced.

The default value is very less than 1sec, it can not caused this
issue.

 Customer has a test system reproducing this with 5.7, we can test
 patches there if you like. Just let me know.

 thank you!
 fbl

May be you can just search macro
E1000_TXDCTL_DMA_BURST_ENABLE
in drivers/net/e1000e/e1000.h, change it to:

#define E1000_TXDCTL_DMA_BURST_ENABLE \
(E1000_TXDCTL_GRAN | /* set descriptor granularity */ \
E1000_TXDCTL_COUNT_DESC | \
(0  16) | /* wthresh must be +1 more than desired */\
(1  8) | 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-25 Thread Jesse Brandeburg
On Mon, 24 Oct 2011 23:29:34 -0700
Michael Wang wang...@linux.vnet.ibm.com wrote:
 May be you can just search macro
 E1000_TXDCTL_DMA_BURST_ENABLE
 in drivers/net/e1000e/e1000.h, change it to:
 
 #define E1000_TXDCTL_DMA_BURST_ENABLE \
 (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \
 E1000_TXDCTL_COUNT_DESC | \
 (0  16) | /* wthresh must be +1 more than desired */\
 (1  8) | /* hthresh */ \
 0x1f) /* pthresh */
 
 this will do the write-back even only one has been done, if the
 problem solved, we can think about a good solution.

I can already tell you that this will fix the problem, but wthresh=1 is
more like the hardware default after reset I think.  Doing this will
prevent the bursting behavior that got us the performance improvement
this patch was made for, which is bad.

That is why we are looking at a solution that likely involves two
flush writes via the flush partial descriptors bits.  Just do the bit
31 set in TIDV and RDTR twice in a row and then make sure it is write
flushed.

If you wish to implement that and give it a try that would be useful
information.  We haven't had time yet to get a full repro going.


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-25 Thread Michael Wang
On 10/25/2011 11:57 PM, Jesse Brandeburg wrote:
 On Mon, 24 Oct 2011 23:29:34 -0700
 Michael Wangwang...@linux.vnet.ibm.com  wrote:
 May be you can just search macro
 E1000_TXDCTL_DMA_BURST_ENABLE
 in drivers/net/e1000e/e1000.h, change it to:

 #define E1000_TXDCTL_DMA_BURST_ENABLE \
 (E1000_TXDCTL_GRAN | /* set descriptor granularity */ \
 E1000_TXDCTL_COUNT_DESC | \
 (0  16) | /* wthresh must be +1 more than desired */\
 (1  8) | /* hthresh */ \
 0x1f) /* pthresh */

 this will do the write-back even only one has been done, if the
 problem solved, we can think about a good solution.
 I can already tell you that this will fix the problem, but wthresh=1 is
 more like the hardware default after reset I think.  Doing this will
 prevent the bursting behavior that got us the performance improvement
 this patch was made for, which is bad.

Hi, Jesse

I was confused about the code ew32(TIDV, adapter-tx_int_delay);
I think this will cause a enforced write-back flush every 8*1.024 μs for
default.

If it works, I don't know why wthresh = 5 will cause this issue, because
even there are not enough descriptor(over 4), the write-back will still 
be done
every 8*1.024 μs.

 That is why we are looking at a solution that likely involves two
 flush writes via the flush partial descriptors bits.  Just do the bit
 31 set in TIDV and RDTR twice in a row and then make sure it is write
 flushed.

 If you wish to implement that and give it a try that would be useful
 information.  We haven't had time yet to get a full repro going.

I think besides my confusion, I will still try to do such work, but I 
really
don't know whether this issue is caused by wthresh or not.

Thanks  Best regards
Michael Wang


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-24 Thread Michael Wang
On 10/21/2011 10:03 PM, Flavio Leitner wrote:
 On Fri, 21 Oct 2011 14:15:12 +0800
 Michael Wangwang...@linux.vnet.ibm.com  wrote:

 On 10/19/2011 08:16 PM, Flavio Leitner wrote:
 On Wed, 19 Oct 2011 12:49:48 +0800
 wangyunwang...@linux.vnet.ibm.com   wrote:

 Hi, Flavio

 I am new to join the community, work on e1000e driver currently,
 And I found a thing strange in this issue, please check below.

 Thanks,
 Michael Wang

 On 10/18/2011 10:42 PM, Flavio Leitner wrote:
 On Mon, 17 Oct 2011 11:48:22 -0700
 Jesse Brandeburgjesse.brandeb...@intel.comwrote:

 On Fri, 14 Oct 2011 10:04:26 -0700
 Flavio Leitnerf...@redhat.comwrote:

 Hi,

 I got few reports so far that 82571EB models are having the
 Detected Hardware Unit Hang issue after upgrading the kernel.

 Further debugging with an instrumented kernel revealed that the
 socket buffer time stamp matches with the last time e1000_xmit_frame()
 was called. Also that the time stamp of e1000_clean_tx_irq() last run
 is prior to the one in socket buffer.

 However, ~1 second later, an interrupt is fired and the old entry
 is found. Sometimes, the scheduled print_hang_task dumps the
 information _after_ the old entry is sent (shows empty ring),
 indicating that the HW TX unit isn't really stuck and apparently
 just missed the signal to initiate the transmission.

 Order of events:
 (1) skb is pushed down
 (2) e1000_xmit_frame() is called
 (3) ring is filled with one entry
 (4) TDT is updated
 (5) nothing happens for little more than 1 second
 (6) interrupt is fired
 (7) e1000_clean_tx_irq() is called
 (8) finds the entry not ready with an old time stamp,
 schedules print_hang_task and stops the TX queue.
 (9) print_hang_task runs, dump the info but the old entry is now 
 sent
 (10) apparently the TX queue is back.
 Flavio, thanks for the detailed info, please be sure to supply us the
 bugzilla number.

 It was buried in the end of the first email:
 https://bugzilla.redhat.com/show_bug.cgi?id=746272

 TDH is probably not moving due to the writeback threshold settings in
 TXDCTL.  netperf UDP_RR test is likely a good way to test this.

 Yeah, makes sense. I haven't heard about new events after had removed
 the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
 same hardware and I haven't reproduced the issue in-house yet with another
 82571EB. See below about interface statistics from sar.

Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the
tx descriptor only when:

1. the descriptor device cached is lower then 32.
2. The descriptor host prepared is at least one.

I don't think this will cause that issue, but another thing it done is to
set the device to write-back the processed descriptor only when the
amount reach 5(or 4).

So may be when the device get a descriptor and processed, but the
amount not reached 5, so it don't write-back it, but actually already
transmitted.

But this will happen only when the transmit suddenly stopped for one
second or more, I don't know whether this is the real traffic situation
or not.

And may be I am wrong about this, but also I think this may be the only
reason cause this issue.


 I don't think the sequence is quite what you said.  We are going to
 work with the hardware team to get a sequence that works right, and we
 should have a fix for you soon.
 Yeah, the sequence might not be exact, but gives us a good idea of
 what could be happening.

 There are two events right after another:

 Oct  9 05:45:23  kernel:   TDH48
 Oct  9 05:45:23  kernel:   TDT49
 Oct  9 05:45:23  kernel:   next_to_use49
 Oct  9 05:45:23  kernel:   next_to_clean48
 Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
 Oct  9 05:45:23  kernel:   time_stamp102338ca6
 Oct  9 05:45:23  kernel:   next_to_watch48
 Oct  9 05:45:23  kernel:   jiffies102338dc1
 Oct  9 05:45:23  kernel:   next_to_watch.status0
 Oct  9 05:45:23  kernel: MAC Status80383
 Oct  9 05:45:23  kernel: PHY Status792d
 Oct  9 05:45:23  kernel: PHY 1000BASE-T Status3800
 Oct  9 05:45:23  kernel: PHY Extended Status3000
 Oct  9 05:45:23  kernel: PCI Status10
 Oct  9 05:51:54  kernel: e1000e :22:00.1: eth7: Detected Hardware 
 Unit Hang:
 Oct  9 05:51:54  kernel:   TDH55
 Oct  9 05:51:54  kernel:   TDT56
 Oct  9 05:51:54  kernel:   next_to_use56
 Oct  9 05:51:54  kernel:   next_to_clean55
 Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
 Oct  9 05:51:54  kernel:   time_stamp102350986
 Oct  9 05:51:54  kernel:   next_to_watch55
 Oct  9 05:51:54  kernel:   jiffies102350b07
 Oct  9 05:51:54  kernel:   next_to_watch.status0
 Oct  9 05:51:54  kernel: MAC Status80383
 Oct  9 05:51:54  kernel: PHY Status792d
 Oct  9 05:51:54  kernel: PHY 1000BASE-T Status3800
 Oct  9 05:51:54  kernel: PHY Extended Status3000
 Oct  9 05:51:54  kernel: PCI Status10

 I see the judgement of hang is:

 time_after(jiffies, tx_ring-buffer_info[i].time_stamp +
 (adapter-tx_timeout_factor * HZ))

 which means the hang happened when 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-24 Thread Flavio Leitner
On Mon, 24 Oct 2011 16:26:28 +0800
Michael Wang wang...@linux.vnet.ibm.com wrote:

 On 10/21/2011 10:03 PM, Flavio Leitner wrote:
  On Fri, 21 Oct 2011 14:15:12 +0800
  Michael Wangwang...@linux.vnet.ibm.com  wrote:
 
  On 10/19/2011 08:16 PM, Flavio Leitner wrote:
  On Wed, 19 Oct 2011 12:49:48 +0800
  wangyunwang...@linux.vnet.ibm.com   wrote:
 
  Hi, Flavio
 
  I am new to join the community, work on e1000e driver currently,
  And I found a thing strange in this issue, please check below.
 
  Thanks,
  Michael Wang
 
  On 10/18/2011 10:42 PM, Flavio Leitner wrote:
  On Mon, 17 Oct 2011 11:48:22 -0700
  Jesse Brandeburgjesse.brandeb...@intel.comwrote:
 
  On Fri, 14 Oct 2011 10:04:26 -0700
  Flavio Leitnerf...@redhat.comwrote:
 
  TDH is probably not moving due to the writeback threshold settings in
  TXDCTL.  netperf UDP_RR test is likely a good way to test this.
 
  Yeah, makes sense. I haven't heard about new events after had removed
  the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the 
  exact
  same hardware and I haven't reproduced the issue in-house yet with 
  another
  82571EB. See below about interface statistics from sar.
 
 Currently, if FLAG2_DMA_BURST setted, the device will pre-fetch the
 tx descriptor only when:
 
 1. the descriptor device cached is lower then 32.
 2. The descriptor host prepared is at least one.
 
 I don't think this will cause that issue, but another thing it done is to
 set the device to write-back the processed descriptor only when the
 amount reach 5(or 4).
 
 So may be when the device get a descriptor and processed, but the
 amount not reached 5, so it don't write-back it, but actually already
 transmitted.


That could explain the issue and the fact that sometimes the hang
info printed shows empty ring (write-back happened in the middle).

 
 But this will happen only when the transmit suddenly stopped for one
 second or more, I don't know whether this is the real traffic situation
 or not.
 

At least for one customer the interface had almost no traffic.
I will go over all the data again checking if this happens every time.


 And may be I am wrong about this, but also I think this may be the only
 reason cause this issue.
 

I am seeing this based on the debugging output:

  This is the full output with debugging patch applied:
  Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
  Hang:
  Oct 11 02:03:52 kernel:   TDH25
  Oct 11 02:03:52 kernel:   TDT26
  Oct 11 02:03:52 kernel:   next_to_use26
  Oct 11 02:03:52 kernel:   next_to_clean25
  Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
  Oct 11 02:03:52 kernel:   time_stamp100b2aa22
  Oct 11 02:03:52 kernel:   next_to_watch25
  Oct 11 02:03:52 kernel:   jiffies100b2ab25
  Oct 11 02:03:52 kernel:   next_to_watch.status0
  Oct 11 02:03:52 kernel:   stored_i =25
  Oct 11 02:03:52 kernel:   stored_first =25
  Oct 11 02:03:52 kernel:   stamp =100b2aa22
  Oct 11 02:03:52 kernel:   factor =fa
  Oct 11 02:03:52 kernel:   last_clean =100b2aa1a
  Oct 11 02:03:52 kernel:   last_tx =100b2aa22
  Oct 11 02:03:52 kernel:   count =0/100

Notice above that buffer_info time_stamp is the same as in
last_tx (last time the xmit function was called), also that
last_clean (last time the clean function was called) is before
that.  Therefore, the system sent just one descriptor in about
1 second confirming your idea.


 So have you try to use the Red Hat 6, is this problem still
 exist?
 

Actually, I received few other reports that looks like to be same
issue but with 6.2.  As far as I can tell, hardware that was working
just fine started to show it after the kernel upgrade (coincidentally
5.7 and 6.2 introduces FLAG2_DMA_BURST).  However, I haven't heard
anything back since I had provided the instrumented kernel to confirm
to you.  I will follow up as soon as I hear something.

Assuming that your idea is true, the hang detection is broken because
it's possible to have a descriptor apparently stuck that is just missing
the write-back. So, is it possible to set a timer to write-back? If yes,
it could expire and run before the hang detection period expires. Or
perhaps force the write-back to happen before hang detection execution.

Customer has a test system reproducing this with 5.7, we can test
patches there if you like. Just let me know.

thank you!
fbl

--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-21 Thread Michael Wang
On 10/19/2011 08:16 PM, Flavio Leitner wrote:
 On Wed, 19 Oct 2011 12:49:48 +0800
 wangyunwang...@linux.vnet.ibm.com  wrote:

 Hi, Flavio

 I am new to join the community, work on e1000e driver currently,
 And I found a thing strange in this issue, please check below.

 Thanks,
 Michael Wang

 On 10/18/2011 10:42 PM, Flavio Leitner wrote:
 On Mon, 17 Oct 2011 11:48:22 -0700
 Jesse Brandeburgjesse.brandeb...@intel.com   wrote:

 On Fri, 14 Oct 2011 10:04:26 -0700
 Flavio Leitnerf...@redhat.com   wrote:

 Hi,

 I got few reports so far that 82571EB models are having the
 Detected Hardware Unit Hang issue after upgrading the kernel.

 Further debugging with an instrumented kernel revealed that the
 socket buffer time stamp matches with the last time e1000_xmit_frame()
 was called. Also that the time stamp of e1000_clean_tx_irq() last run
 is prior to the one in socket buffer.

 However, ~1 second later, an interrupt is fired and the old entry
 is found. Sometimes, the scheduled print_hang_task dumps the
 information _after_ the old entry is sent (shows empty ring),
 indicating that the HW TX unit isn't really stuck and apparently
 just missed the signal to initiate the transmission.

 Order of events:
(1) skb is pushed down
(2) e1000_xmit_frame() is called
(3) ring is filled with one entry
(4) TDT is updated
 (5) nothing happens for little more than 1 second
(6) interrupt is fired
(7) e1000_clean_tx_irq() is called
(8) finds the entry not ready with an old time stamp,
schedules print_hang_task and stops the TX queue.
(9) print_hang_task runs, dump the info but the old entry is now sent
 (10) apparently the TX queue is back.
 Flavio, thanks for the detailed info, please be sure to supply us the
 bugzilla number.

 It was buried in the end of the first email:
 https://bugzilla.redhat.com/show_bug.cgi?id=746272

 TDH is probably not moving due to the writeback threshold settings in
 TXDCTL.  netperf UDP_RR test is likely a good way to test this.

 Yeah, makes sense. I haven't heard about new events after had removed
 the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
 same hardware and I haven't reproduced the issue in-house yet with another
 82571EB. See below about interface statistics from sar.


 I don't think the sequence is quite what you said.  We are going to
 work with the hardware team to get a sequence that works right, and we
 should have a fix for you soon.
 Yeah, the sequence might not be exact, but gives us a good idea of
 what could be happening.

 There are two events right after another:

 Oct  9 05:45:23  kernel:   TDH48
 Oct  9 05:45:23  kernel:   TDT49
 Oct  9 05:45:23  kernel:   next_to_use49
 Oct  9 05:45:23  kernel:   next_to_clean48
 Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
 Oct  9 05:45:23  kernel:   time_stamp102338ca6
 Oct  9 05:45:23  kernel:   next_to_watch48
 Oct  9 05:45:23  kernel:   jiffies102338dc1
 Oct  9 05:45:23  kernel:   next_to_watch.status0
 Oct  9 05:45:23  kernel: MAC Status80383
 Oct  9 05:45:23  kernel: PHY Status792d
 Oct  9 05:45:23  kernel: PHY 1000BASE-T Status3800
 Oct  9 05:45:23  kernel: PHY Extended Status3000
 Oct  9 05:45:23  kernel: PCI Status10
 Oct  9 05:51:54  kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
 Hang:
 Oct  9 05:51:54  kernel:   TDH55
 Oct  9 05:51:54  kernel:   TDT56
 Oct  9 05:51:54  kernel:   next_to_use56
 Oct  9 05:51:54  kernel:   next_to_clean55
 Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
 Oct  9 05:51:54  kernel:   time_stamp102350986
 Oct  9 05:51:54  kernel:   next_to_watch55
 Oct  9 05:51:54  kernel:   jiffies102350b07
 Oct  9 05:51:54  kernel:   next_to_watch.status0
 Oct  9 05:51:54  kernel: MAC Status80383
 Oct  9 05:51:54  kernel: PHY Status792d
 Oct  9 05:51:54  kernel: PHY 1000BASE-T Status3800
 Oct  9 05:51:54  kernel: PHY Extended Status3000
 Oct  9 05:51:54  kernel: PCI Status10

 I see the judgement of hang is:

 time_after(jiffies, tx_ring-buffer_info[i].time_stamp +
 (adapter-tx_timeout_factor * HZ))

 which means the hang happened when current jiffies minus buffer's time
 stamp is over
 (adapter-tx_timeout_factor * HZ).

 And I see the tx_timeout_factor will at least be 1, so on x86 the
 (jiffies-time_stamp) should
 over 1000, but here looks only around 300.

 Could you please check the HZ number of your platform?

 sure, adapter-tx_timeout_factor * HZ = 0xfa/250d
 That data came from a customer using kernel-xen, so HZ is 250.

 Here is the debugging patch used:
 http://people.redhat.com/~fleitner/linux-kernel-test.patch

 The idea was to capture all the relevant values at the time
 of the problem. (The print_hang_task is scheduled and sometimes
 it shows timestamp=0, TDH=TDT because the packet is already sent)

 This is the full output with debugging patch applied:
 Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
 Hang:
 Oct 11 02:03:52 kernel:   TDH25
 Oct 11 02:03:52 kernel:   TDT26
 Oct 11 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-21 Thread Flavio Leitner
On Fri, 21 Oct 2011 14:15:12 +0800
Michael Wang wang...@linux.vnet.ibm.com wrote:

 On 10/19/2011 08:16 PM, Flavio Leitner wrote:
  On Wed, 19 Oct 2011 12:49:48 +0800
  wangyunwang...@linux.vnet.ibm.com  wrote:
 
  Hi, Flavio
 
  I am new to join the community, work on e1000e driver currently,
  And I found a thing strange in this issue, please check below.
 
  Thanks,
  Michael Wang
 
  On 10/18/2011 10:42 PM, Flavio Leitner wrote:
  On Mon, 17 Oct 2011 11:48:22 -0700
  Jesse Brandeburgjesse.brandeb...@intel.com   wrote:
 
  On Fri, 14 Oct 2011 10:04:26 -0700
  Flavio Leitnerf...@redhat.com   wrote:
 
  Hi,
 
  I got few reports so far that 82571EB models are having the
  Detected Hardware Unit Hang issue after upgrading the kernel.
 
  Further debugging with an instrumented kernel revealed that the
  socket buffer time stamp matches with the last time e1000_xmit_frame()
  was called. Also that the time stamp of e1000_clean_tx_irq() last run
  is prior to the one in socket buffer.
 
  However, ~1 second later, an interrupt is fired and the old entry
  is found. Sometimes, the scheduled print_hang_task dumps the
  information _after_ the old entry is sent (shows empty ring),
  indicating that the HW TX unit isn't really stuck and apparently
  just missed the signal to initiate the transmission.
 
  Order of events:
 (1) skb is pushed down
 (2) e1000_xmit_frame() is called
 (3) ring is filled with one entry
 (4) TDT is updated
  (5) nothing happens for little more than 1 second
 (6) interrupt is fired
 (7) e1000_clean_tx_irq() is called
 (8) finds the entry not ready with an old time stamp,
 schedules print_hang_task and stops the TX queue.
 (9) print_hang_task runs, dump the info but the old entry is now sent
  (10) apparently the TX queue is back.
  Flavio, thanks for the detailed info, please be sure to supply us the
  bugzilla number.
 
  It was buried in the end of the first email:
  https://bugzilla.redhat.com/show_bug.cgi?id=746272
 
  TDH is probably not moving due to the writeback threshold settings in
  TXDCTL.  netperf UDP_RR test is likely a good way to test this.
 
  Yeah, makes sense. I haven't heard about new events after had removed
  the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
  same hardware and I haven't reproduced the issue in-house yet with another
  82571EB. See below about interface statistics from sar.
 
 
  I don't think the sequence is quite what you said.  We are going to
  work with the hardware team to get a sequence that works right, and we
  should have a fix for you soon.
  Yeah, the sequence might not be exact, but gives us a good idea of
  what could be happening.
 
  There are two events right after another:
 
  Oct  9 05:45:23  kernel:   TDH48
  Oct  9 05:45:23  kernel:   TDT49
  Oct  9 05:45:23  kernel:   next_to_use49
  Oct  9 05:45:23  kernel:   next_to_clean48
  Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
  Oct  9 05:45:23  kernel:   time_stamp102338ca6
  Oct  9 05:45:23  kernel:   next_to_watch48
  Oct  9 05:45:23  kernel:   jiffies102338dc1
  Oct  9 05:45:23  kernel:   next_to_watch.status0
  Oct  9 05:45:23  kernel: MAC Status80383
  Oct  9 05:45:23  kernel: PHY Status792d
  Oct  9 05:45:23  kernel: PHY 1000BASE-T Status3800
  Oct  9 05:45:23  kernel: PHY Extended Status3000
  Oct  9 05:45:23  kernel: PCI Status10
  Oct  9 05:51:54  kernel: e1000e :22:00.1: eth7: Detected Hardware 
  Unit Hang:
  Oct  9 05:51:54  kernel:   TDH55
  Oct  9 05:51:54  kernel:   TDT56
  Oct  9 05:51:54  kernel:   next_to_use56
  Oct  9 05:51:54  kernel:   next_to_clean55
  Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
  Oct  9 05:51:54  kernel:   time_stamp102350986
  Oct  9 05:51:54  kernel:   next_to_watch55
  Oct  9 05:51:54  kernel:   jiffies102350b07
  Oct  9 05:51:54  kernel:   next_to_watch.status0
  Oct  9 05:51:54  kernel: MAC Status80383
  Oct  9 05:51:54  kernel: PHY Status792d
  Oct  9 05:51:54  kernel: PHY 1000BASE-T Status3800
  Oct  9 05:51:54  kernel: PHY Extended Status3000
  Oct  9 05:51:54  kernel: PCI Status10
 
  I see the judgement of hang is:
 
  time_after(jiffies, tx_ring-buffer_info[i].time_stamp +
  (adapter-tx_timeout_factor * HZ))
 
  which means the hang happened when current jiffies minus buffer's time
  stamp is over
  (adapter-tx_timeout_factor * HZ).
 
  And I see the tx_timeout_factor will at least be 1, so on x86 the
  (jiffies-time_stamp) should
  over 1000, but here looks only around 300.
 
  Could you please check the HZ number of your platform?
 
  sure, adapter-tx_timeout_factor * HZ = 0xfa/250d
  That data came from a customer using kernel-xen, so HZ is 250.
 
  Here is the debugging patch used:
  http://people.redhat.com/~fleitner/linux-kernel-test.patch
 
  The idea was to capture all the relevant values at the time
  of the problem. (The print_hang_task is scheduled and sometimes
  it shows timestamp=0, TDH=TDT because the packet is already sent)
 
  

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-19 Thread Flavio Leitner
On Wed, 19 Oct 2011 12:49:48 +0800
wangyun wang...@linux.vnet.ibm.com wrote:

 Hi, Flavio
 
 I am new to join the community, work on e1000e driver currently,
 And I found a thing strange in this issue, please check below.
 
 Thanks,
 Michael Wang
 
 On 10/18/2011 10:42 PM, Flavio Leitner wrote:
  On Mon, 17 Oct 2011 11:48:22 -0700
  Jesse Brandeburgjesse.brandeb...@intel.com  wrote:
 
  On Fri, 14 Oct 2011 10:04:26 -0700
  Flavio Leitnerf...@redhat.com  wrote:
 
  Hi,
 
  I got few reports so far that 82571EB models are having the
  Detected Hardware Unit Hang issue after upgrading the kernel.
 
  Further debugging with an instrumented kernel revealed that the
  socket buffer time stamp matches with the last time e1000_xmit_frame()
  was called. Also that the time stamp of e1000_clean_tx_irq() last run
  is prior to the one in socket buffer.
 
  However, ~1 second later, an interrupt is fired and the old entry
  is found. Sometimes, the scheduled print_hang_task dumps the
  information _after_ the old entry is sent (shows empty ring),
  indicating that the HW TX unit isn't really stuck and apparently
  just missed the signal to initiate the transmission.
 
  Order of events:
(1) skb is pushed down
(2) e1000_xmit_frame() is called
(3) ring is filled with one entry
(4) TDT is updated
  (5) nothing happens for little more than 1 second
(6) interrupt is fired
(7) e1000_clean_tx_irq() is called
(8) finds the entry not ready with an old time stamp,
schedules print_hang_task and stops the TX queue.
(9) print_hang_task runs, dump the info but the old entry is now sent
  (10) apparently the TX queue is back.
  Flavio, thanks for the detailed info, please be sure to supply us the
  bugzilla number.
 
  It was buried in the end of the first email:
  https://bugzilla.redhat.com/show_bug.cgi?id=746272
 
  TDH is probably not moving due to the writeback threshold settings in
  TXDCTL.  netperf UDP_RR test is likely a good way to test this.
 
  Yeah, makes sense. I haven't heard about new events after had removed
  the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
  same hardware and I haven't reproduced the issue in-house yet with another
  82571EB. See below about interface statistics from sar.
 
 
  I don't think the sequence is quite what you said.  We are going to
  work with the hardware team to get a sequence that works right, and we
  should have a fix for you soon.
  Yeah, the sequence might not be exact, but gives us a good idea of
  what could be happening.
 
  There are two events right after another:
 
  Oct  9 05:45:23  kernel:   TDH48
  Oct  9 05:45:23  kernel:   TDT49
  Oct  9 05:45:23  kernel:   next_to_use49
  Oct  9 05:45:23  kernel:   next_to_clean48
  Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
  Oct  9 05:45:23  kernel:   time_stamp102338ca6
  Oct  9 05:45:23  kernel:   next_to_watch48
  Oct  9 05:45:23  kernel:   jiffies102338dc1
  Oct  9 05:45:23  kernel:   next_to_watch.status0
  Oct  9 05:45:23  kernel: MAC Status80383
  Oct  9 05:45:23  kernel: PHY Status792d
  Oct  9 05:45:23  kernel: PHY 1000BASE-T Status3800
  Oct  9 05:45:23  kernel: PHY Extended Status3000
  Oct  9 05:45:23  kernel: PCI Status10
  Oct  9 05:51:54  kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
  Hang:
  Oct  9 05:51:54  kernel:   TDH55
  Oct  9 05:51:54  kernel:   TDT56
  Oct  9 05:51:54  kernel:   next_to_use56
  Oct  9 05:51:54  kernel:   next_to_clean55
  Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
  Oct  9 05:51:54  kernel:   time_stamp102350986
  Oct  9 05:51:54  kernel:   next_to_watch55
  Oct  9 05:51:54  kernel:   jiffies102350b07
  Oct  9 05:51:54  kernel:   next_to_watch.status0
  Oct  9 05:51:54  kernel: MAC Status80383
  Oct  9 05:51:54  kernel: PHY Status792d
  Oct  9 05:51:54  kernel: PHY 1000BASE-T Status3800
  Oct  9 05:51:54  kernel: PHY Extended Status3000
  Oct  9 05:51:54  kernel: PCI Status10
 
 I see the judgement of hang is:
 
 time_after(jiffies, tx_ring-buffer_info[i].time_stamp + 
 (adapter-tx_timeout_factor * HZ))
 
 which means the hang happened when current jiffies minus buffer's time 
 stamp is over
 (adapter-tx_timeout_factor * HZ).
 
 And I see the tx_timeout_factor will at least be 1, so on x86 the 
 (jiffies-time_stamp) should
 over 1000, but here looks only around 300.
 
 Could you please check the HZ number of your platform?
 

sure, adapter-tx_timeout_factor * HZ = 0xfa/250d
That data came from a customer using kernel-xen, so HZ is 250.

Here is the debugging patch used:
http://people.redhat.com/~fleitner/linux-kernel-test.patch

The idea was to capture all the relevant values at the time
of the problem. (The print_hang_task is scheduled and sometimes
it shows timestamp=0, TDH=TDT because the packet is already sent)

This is the full output with debugging patch applied:
Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang:
Oct 11 02:03:52 kernel:   TDH

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-18 Thread Flavio Leitner
On Mon, 17 Oct 2011 11:48:22 -0700
Jesse Brandeburg jesse.brandeb...@intel.com wrote:

 On Fri, 14 Oct 2011 10:04:26 -0700
 Flavio Leitner f...@redhat.com wrote:
 
  
  Hi,
  
  I got few reports so far that 82571EB models are having the
  Detected Hardware Unit Hang issue after upgrading the kernel.
  
  Further debugging with an instrumented kernel revealed that the
  socket buffer time stamp matches with the last time e1000_xmit_frame()
  was called. Also that the time stamp of e1000_clean_tx_irq() last run
  is prior to the one in socket buffer.
  
  However, ~1 second later, an interrupt is fired and the old entry
  is found. Sometimes, the scheduled print_hang_task dumps the
  information _after_ the old entry is sent (shows empty ring),
  indicating that the HW TX unit isn't really stuck and apparently
  just missed the signal to initiate the transmission.
  
  Order of events:
   (1) skb is pushed down
   (2) e1000_xmit_frame() is called
   (3) ring is filled with one entry
   (4) TDT is updated
  (5) nothing happens for little more than 1 second
   (6) interrupt is fired
   (7) e1000_clean_tx_irq() is called
   (8) finds the entry not ready with an old time stamp,
   schedules print_hang_task and stops the TX queue.
   (9) print_hang_task runs, dump the info but the old entry is now sent
  (10) apparently the TX queue is back.
 
 Flavio, thanks for the detailed info, please be sure to supply us the
 bugzilla number.
 

It was buried in the end of the first email:
https://bugzilla.redhat.com/show_bug.cgi?id=746272

 TDH is probably not moving due to the writeback threshold settings in
 TXDCTL.  netperf UDP_RR test is likely a good way to test this.


Yeah, makes sense. I haven't heard about new events after had removed
the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
same hardware and I haven't reproduced the issue in-house yet with another
82571EB. See below about interface statistics from sar.


 I don't think the sequence is quite what you said.  We are going to
 work with the hardware team to get a sequence that works right, and we
 should have a fix for you soon.

Yeah, the sequence might not be exact, but gives us a good idea of
what could be happening.  

There are two events right after another:

Oct  9 05:45:23  kernel:   TDH  48
Oct  9 05:45:23  kernel:   TDT  49
Oct  9 05:45:23  kernel:   next_to_use  49
Oct  9 05:45:23  kernel:   next_to_clean48
Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
Oct  9 05:45:23  kernel:   time_stamp   102338ca6
Oct  9 05:45:23  kernel:   next_to_watch48
Oct  9 05:45:23  kernel:   jiffies  102338dc1
Oct  9 05:45:23  kernel:   next_to_watch.status 0
Oct  9 05:45:23  kernel: MAC Status 80383
Oct  9 05:45:23  kernel: PHY Status 792d
Oct  9 05:45:23  kernel: PHY 1000BASE-T Status  3800
Oct  9 05:45:23  kernel: PHY Extended Status3000
Oct  9 05:45:23  kernel: PCI Status 10
Oct  9 05:51:54  kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang:
Oct  9 05:51:54  kernel:   TDH  55
Oct  9 05:51:54  kernel:   TDT  56
Oct  9 05:51:54  kernel:   next_to_use  56
Oct  9 05:51:54  kernel:   next_to_clean55
Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
Oct  9 05:51:54  kernel:   time_stamp   102350986
Oct  9 05:51:54  kernel:   next_to_watch55
Oct  9 05:51:54  kernel:   jiffies  102350b07
Oct  9 05:51:54  kernel:   next_to_watch.status 0
Oct  9 05:51:54  kernel: MAC Status 80383
Oct  9 05:51:54  kernel: PHY Status 792d
Oct  9 05:51:54  kernel: PHY 1000BASE-T Status  3800
Oct  9 05:51:54  kernel: PHY Extended Status3000
Oct  9 05:51:54  kernel: PCI Status 10

This is the sar report, the interface was idling.
00:00:01IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   
txcmp/s  rxmcst/s
05:40:01 eth7  1.13  0.03944.69  4.14  0.00  
0.00  0.87
05:50:01 eth7  1.25  0.03952.37  4.13  0.00  
0.00  0.87
06:00:01 eth7  1.14  0.03947.26  4.14  0.00  
0.00  0.87

00:00:01IFACE   rxerr/s   txerr/scoll/s  rxdrop/s  txdrop/s  
txcarr/s  rxfram/s  rxfifo/s  txfifo/s
05:40:01 eth7  0.00  0.00  0.00  0.00  0.00  
0.00  0.00  0.00  0.00
05:50:01 eth7  0.00  0.00  0.00  0.00  0.00  
0.00  0.00  0.00  0.00
06:00:01 eth7  0.00  0.00  0.00  0.00  0.00  
0.00  0.00  0.00  0.00

ethtool -i eth7:
driver: e1000e
version: 1.3.10-k2
firmware-version: 5.12-2
bus-info: :22:00.1
22:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
22:00.1 0200: 8086:10bc (rev 06)
(the rest of the lspci is on the first email, 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-18 Thread wangyun
Hi, Flavio

I am new to join the community, work on e1000e driver currently,
And I found a thing strange in this issue, please check below.

Thanks,
Michael Wang

On 10/18/2011 10:42 PM, Flavio Leitner wrote:
 On Mon, 17 Oct 2011 11:48:22 -0700
 Jesse Brandeburgjesse.brandeb...@intel.com  wrote:

 On Fri, 14 Oct 2011 10:04:26 -0700
 Flavio Leitnerf...@redhat.com  wrote:

 Hi,

 I got few reports so far that 82571EB models are having the
 Detected Hardware Unit Hang issue after upgrading the kernel.

 Further debugging with an instrumented kernel revealed that the
 socket buffer time stamp matches with the last time e1000_xmit_frame()
 was called. Also that the time stamp of e1000_clean_tx_irq() last run
 is prior to the one in socket buffer.

 However, ~1 second later, an interrupt is fired and the old entry
 is found. Sometimes, the scheduled print_hang_task dumps the
 information _after_ the old entry is sent (shows empty ring),
 indicating that the HW TX unit isn't really stuck and apparently
 just missed the signal to initiate the transmission.

 Order of events:
   (1) skb is pushed down
   (2) e1000_xmit_frame() is called
   (3) ring is filled with one entry
   (4) TDT is updated
 (5) nothing happens for little more than 1 second
   (6) interrupt is fired
   (7) e1000_clean_tx_irq() is called
   (8) finds the entry not ready with an old time stamp,
   schedules print_hang_task and stops the TX queue.
   (9) print_hang_task runs, dump the info but the old entry is now sent
 (10) apparently the TX queue is back.
 Flavio, thanks for the detailed info, please be sure to supply us the
 bugzilla number.

 It was buried in the end of the first email:
 https://bugzilla.redhat.com/show_bug.cgi?id=746272

 TDH is probably not moving due to the writeback threshold settings in
 TXDCTL.  netperf UDP_RR test is likely a good way to test this.

 Yeah, makes sense. I haven't heard about new events after had removed
 the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
 same hardware and I haven't reproduced the issue in-house yet with another
 82571EB. See below about interface statistics from sar.


 I don't think the sequence is quite what you said.  We are going to
 work with the hardware team to get a sequence that works right, and we
 should have a fix for you soon.
 Yeah, the sequence might not be exact, but gives us a good idea of
 what could be happening.

 There are two events right after another:

 Oct  9 05:45:23  kernel:   TDH48
 Oct  9 05:45:23  kernel:   TDT49
 Oct  9 05:45:23  kernel:   next_to_use49
 Oct  9 05:45:23  kernel:   next_to_clean48
 Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
 Oct  9 05:45:23  kernel:   time_stamp102338ca6
 Oct  9 05:45:23  kernel:   next_to_watch48
 Oct  9 05:45:23  kernel:   jiffies102338dc1
 Oct  9 05:45:23  kernel:   next_to_watch.status0
 Oct  9 05:45:23  kernel: MAC Status80383
 Oct  9 05:45:23  kernel: PHY Status792d
 Oct  9 05:45:23  kernel: PHY 1000BASE-T Status3800
 Oct  9 05:45:23  kernel: PHY Extended Status3000
 Oct  9 05:45:23  kernel: PCI Status10
 Oct  9 05:51:54  kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
 Hang:
 Oct  9 05:51:54  kernel:   TDH55
 Oct  9 05:51:54  kernel:   TDT56
 Oct  9 05:51:54  kernel:   next_to_use56
 Oct  9 05:51:54  kernel:   next_to_clean55
 Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
 Oct  9 05:51:54  kernel:   time_stamp102350986
 Oct  9 05:51:54  kernel:   next_to_watch55
 Oct  9 05:51:54  kernel:   jiffies102350b07
 Oct  9 05:51:54  kernel:   next_to_watch.status0
 Oct  9 05:51:54  kernel: MAC Status80383
 Oct  9 05:51:54  kernel: PHY Status792d
 Oct  9 05:51:54  kernel: PHY 1000BASE-T Status3800
 Oct  9 05:51:54  kernel: PHY Extended Status3000
 Oct  9 05:51:54  kernel: PCI Status10

I see the judgement of hang is:

time_after(jiffies, tx_ring-buffer_info[i].time_stamp + 
(adapter-tx_timeout_factor * HZ))

which means the hang happened when current jiffies minus buffer's time 
stamp is over
(adapter-tx_timeout_factor * HZ).

And I see the tx_timeout_factor will at least be 1, so on x86 the 
(jiffies-time_stamp) should
over 1000, but here looks only around 300.

Could you please check the HZ number of your platform?

 This is the sar report, the interface was idling.
 00:00:01IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   
 txcmp/s  rxmcst/s
 05:40:01 eth7  1.13  0.03944.69  4.14  0.00  
 0.00  0.87
 05:50:01 eth7  1.25  0.03952.37  4.13  0.00  
 0.00  0.87
 06:00:01 eth7  1.14  0.03947.26  4.14  0.00  
 0.00  0.87

 00:00:01IFACE   rxerr/s   txerr/scoll/s  rxdrop/s  txdrop/s  
 txcarr/s  rxfram/s  rxfifo/s  txfifo/s
 05:40:01 eth7  0.00  0.00  0.00  0.00  0.00  
 0.00  0.00  0.00  0.00
 05:50:01 eth7  0.00  0.00  0.00  0.00  0.00  
 0.00  0.00  0.00  

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-17 Thread Jesse Brandeburg
On Fri, 14 Oct 2011 10:04:26 -0700
Flavio Leitner f...@redhat.com wrote:

 
 Hi,
 
 I got few reports so far that 82571EB models are having the
 Detected Hardware Unit Hang issue after upgrading the kernel.
 
 Further debugging with an instrumented kernel revealed that the
 socket buffer time stamp matches with the last time e1000_xmit_frame()
 was called. Also that the time stamp of e1000_clean_tx_irq() last run
 is prior to the one in socket buffer.
 
 However, ~1 second later, an interrupt is fired and the old entry
 is found. Sometimes, the scheduled print_hang_task dumps the
 information _after_ the old entry is sent (shows empty ring),
 indicating that the HW TX unit isn't really stuck and apparently
 just missed the signal to initiate the transmission.
 
 Order of events:
  (1) skb is pushed down
  (2) e1000_xmit_frame() is called
  (3) ring is filled with one entry
  (4) TDT is updated
 (5) nothing happens for little more than 1 second
  (6) interrupt is fired
  (7) e1000_clean_tx_irq() is called
  (8) finds the entry not ready with an old time stamp,
  schedules print_hang_task and stops the TX queue.
  (9) print_hang_task runs, dump the info but the old entry is now sent
 (10) apparently the TX queue is back.

Flavio, thanks for the detailed info, please be sure to supply us the
bugzilla number.

TDH is probably not moving due to the writeback threshold settings in
TXDCTL.  netperf UDP_RR test is likely a good way to test this.

I don't think the sequence is quite what you said.  We are going to
work with the hardware team to get a sequence that works right, and we
should have a fix for you soon.

 
 The following commit seems to be related to the symptoms seen above:
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3a3b75860527a11ba5035c6aa576079245d09e2a
 
  From: Jesse Brandeburg jesse.brandeb...@intel.com
  Date: Wed, 29 Sep 2010 21:38:49 + (+)
  Subject: e1000e: use hardware writeback batching
  X-Git-Tag: v2.6.37-rc1~147^2~299
  X-Git-Url:
 http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=3a3b75860527a11ba5035c6aa576079245d09e2a
  
 
  e1000e: use hardware writeback batching
 
  Most e1000e parts support batching writebacks.  The problem with this is
  that when some of the TADV or TIDV timers are not set, Tx can sit forever.
 
  This is solved in this patch with write flushes using the Flush Partial
  Descriptors (FPD) bit in TIDV and RDTR.
 
  This improves bus utilization and removes partial writes on e1000e,
  particularly from 82571 parts in S5500 chipset based machines.
 
  Only ES2LAN and 82571/2 parts are included in this optimization, to reduce
  testing load.
 
 We have modified the instrumented kernel to include the following patch
 disabling writeback batching feature to narrow down the problem:
 
 --- debug/drivers/net/e1000e/82571.c.orig  2011-10-11 14:00:44.0
 -0300
 +++ debug/drivers/net/e1000e/82571.c   2011-10-11 15:02:51.0 -0300
 @@ -2028,8 +2028,7 @@ struct e1000_info e1000_82571_info = {
  | FLAG_RESET_OVERWRITES_LAA /* errata */
  | FLAG_TARC_SPEED_MODE_BIT /* errata */
  | FLAG_APME_CHECK_PORT_B,
 -  .flags2 = FLAG2_DISABLE_ASPM_L1 /* errata 13 */
 -| FLAG2_DMA_BURST,
 +  .flags2 = FLAG2_DISABLE_ASPM_L1, /* errata 13 */
.pba= 38,
.max_hw_frame_size  = DEFAULT_JUMBO,
 
 
 and the customer confirmed that the issue has disappeared since then.
 
 Board info:
 1e:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
 Controller (Copper) (rev 06)
 
 1e:00.0 0200: 8086:10bc (rev 06)
 Subsystem: 103c:704b
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
 Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- 
 TAbort-
 MAbort- SERR- PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin B routed to IRQ 224
 Region 0: Memory at fd4e (32-bit, non-prefetchable) [size=128K]
 Region 1: Memory at fd40 (32-bit, non-prefetchable) [size=512K]
 Region 2: I/O ports at 7000 [size=32]
 Capabilities: [c8] Power Management version 2
 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
 PME(D0+,D1-,D2-,D3hot+,D3cold+)
 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
 Address: fee0  Data: 4073
 Capabilities: [e0] Express (v1) Endpoint, MSI 00
 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns,
 L1 64us
 ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
 DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+
 Unsupported-