subject:"RE\: 82571EB\: Detected Hardware Unit Hang"

Re: 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Joe Jin

On 11/20/12 16:59, Dave, Tushar N wrote:
> Have you power off the system completely after modifying eeprom? If not 
> please do so.

Hi Tushar,

Seems not works for me, would you please help to check what is wrong of my 
operations?

Original eeprom dump:

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
^
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, 
L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
<--snip-->

# ethtool eth3
Settings for eth3:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x0007 (7)
Link detected: yes

# ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7
# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
^ <== a6 --> a7
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, 
L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
<4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
<--snip-->

#  ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06 
^<== 07 -> 17
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values

Re: 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Joe Jin

On 11/20/12 16:59, Dave, Tushar N wrote:
> Have you power off the system completely after modifying eeprom? If not 
> please do so.

seems not works for me, would you please help to check what is wrong of my 
operations?

Original eeprom dump:

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
^
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, 
L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
<--snip-->

# ethtool eth3
Settings for eth3:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x0007 (7)
Link detected: yes

# ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7
# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
^ <== a6 --> a7
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, 
L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
<4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
<--snip-->

#  ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06 
^<== 07 -> 17
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--

RE: 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Sunday, November 18, 2012 9:38 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org; Mary Mcgrath
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 11/16/12 04:26, Dave, Tushar N wrote:
>>> Would you please help to fine the offset of max payload size in eeprom?
>>> I'd like to have a try to modify it by ethtool.
>>
>> It is defined using bit 8 of word 0x1A.
>> Bit value 0 = 128B , bit value 1 = 256B
>
>Hi Tushar,
>
>I checked one of my server which Max Payload Size is 128:
>
># lspci -vvv -s 52:00.1
>52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>Controller (rev 06)
>Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
>Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>ParErr+ Stepping- SERR- FastB2B- DisINTx+
>Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>SERR- Latency: 0, Cache Line Size: 64 bytes
>Interrupt: pin B routed to IRQ 266
>Region 0: Memory at dfea (32-bit, non-prefetchable)
>[size=128K]
>Region 1: Memory at dfe8 (32-bit, non-prefetchable)
>[size=128K]
>Region 2: I/O ports at 6020 [size=32]
>[virtual] Expansion ROM at d812 [disabled] [size=128K]
>Capabilities: [c8] Power Management version 2
>Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
>,D3hot+,D3cold-)
>Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>Address: fee0  Data: 409a
>Capabilities: [e0] Express (v1) Endpoint, MSI 00
>DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
><512ns, L1 <64us
>ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
>DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
>Unsupported+
>RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>MaxPayload 128 bytes, MaxReadReq 4096 bytes
>DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr-
>TransPend-
>LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s,
>Latency L0 <4us, L1 <64us
>ClockPM- Surprise- LLActRep- BwNot-
>LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
>CommClk+
>ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
>DLActive- BWMgmt- ABWMgmt-
>Capabilities: [100 v1] Advanced Error Reporting
>UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+
>RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
>CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
>NonFatalErr-
>CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+
>NonFatalErr-
>AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap-
>ChkEn-
>Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed-
>86
>Kernel driver in use: e1000e
>Kernel modules: e1000e
>
>And eeprom dump as below:
>
>Offset  Values
>--  --
>0x  00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff
>0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
>0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
>0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06
>0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
>0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>0x0060  00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff
>
>
>If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the
>bit 8 is 1, but my NIC's MPS is 128b, anything I'm wrong?

Have you power off the system completely after modifying eeprom? If not please 
do so.
-Tushar 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Sunday, November 18, 2012 9:38 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org; Mary Mcgrath
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/16/12 04:26, Dave, Tushar N wrote:
 Would you please help to fine the offset of max payload size in eeprom?
 I'd like to have a try to modify it by ethtool.

 It is defined using bit 8 of word 0x1A.
 Bit value 0 = 128B , bit value 1 = 256B

Hi Tushar,

I checked one of my server which Max Payload Size is 128:

# lspci -vvv -s 52:00.1
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr+ Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
TAbort- MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 266
Region 0: Memory at dfea (32-bit, non-prefetchable)
[size=128K]
Region 1: Memory at dfe8 (32-bit, non-prefetchable)
[size=128K]
Region 2: I/O ports at 6020 [size=32]
[virtual] Expansion ROM at d812 [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee0  Data: 409a
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
512ns, L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr-
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s,
Latency L0 4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+
RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+
NonFatalErr-
AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap-
ChkEn-
Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed-
86
Kernel driver in use: e1000e
Kernel modules: e1000e

And eeprom dump as below:

Offset  Values
--  --
0x  00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0060  00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff


If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the
bit 8 is 1, but my NIC's MPS is 128b, anything I'm wrong?

Have you power off the system completely after modifying eeprom? If not please 
do so.
-Tushar 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Joe Jin

On 11/20/12 16:59, Dave, Tushar N wrote:
 Have you power off the system completely after modifying eeprom? If not 
 please do so.

seems not works for me, would you please help to check what is wrong of my 
operations?

Original eeprom dump:

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
^
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
--snip--
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
--snip--

# ethtool eth3
Settings for eth3:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x0007 (7)
Link detected: yes

# ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7
# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
^ == a6 -- a7
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
--snip--
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
--snip--

#  ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06 
^== 07 - 17
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x

Re: 82571EB: Detected Hardware Unit Hang

2012-11-20 Thread Joe Jin

On 11/20/12 16:59, Dave, Tushar N wrote:
 Have you power off the system completely after modifying eeprom? If not 
 please do so.

Hi Tushar,

Seems not works for me, would you please help to check what is wrong of my 
operations?

Original eeprom dump:

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
^
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
--snip--
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
--snip--

# ethtool eth3
Settings for eth3:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x0007 (7)
Link detected: yes

# ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7
# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
^ == a6 -- a7
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# lspci -s :52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
--snip--
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
--snip--

#  ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17

# ethtool -e eth3 | head -8
Offset  Values
--  --
0x  00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06 
^== 07 - 17
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

# reboot

# ethtool -e eth3 | head -8
Offset  Values
--  --

Re: 82571EB: Detected Hardware Unit Hang

2012-11-18 Thread Joe Jin

On 11/16/12 04:26, Dave, Tushar N wrote:
>> Would you please help to fine the offset of max payload size in eeprom?
>> I'd like to have a try to modify it by ethtool.
> 
> It is defined using bit 8 of word 0x1A.
> Bit value 0 = 128B , bit value 1 = 256B

Hi Tushar,

I checked one of my server which Max Payload Size is 128:

# lspci -vvv -s 52:00.1
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-18 Thread Joe Jin

On 11/16/12 04:26, Dave, Tushar N wrote:
 Would you please help to fine the offset of max payload size in eeprom?
 I'd like to have a try to modify it by ethtool.
 
 It is defined using bit 8 of word 0x1A.
 Bit value 0 = 128B , bit value 1 = 256B

Hi Tushar,

I checked one of my server which Max Payload Size is 128:

# lspci -vvv -s 52:00.1
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 266
Region 0: Memory at dfea (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at dfe8 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 6020 [size=32]
[virtual] Expansion ROM at d812 [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee0  Data: 409a
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ 
RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr-
AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed-86
Kernel driver in use: e1000e
Kernel modules: e1000e

And eeprom dump as below:

Offset  Values
--  --
0x  00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff 
0x0010  57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1 
0x0020  08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0060  00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff 


If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the bit 8 
is 1, but 
my NIC's MPS is 128b, anything I'm wrong? 

Thanks,
Joe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-11-15 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Wednesday, November 14, 2012 4:33 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org; Mary Mcgrath
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 11/14/12 11:45, Dave, Tushar N wrote:
>>> -Original Message-
>>> From: Joe Jin [mailto:joe@oracle.com]
>>> Sent: Tuesday, November 13, 2012 6:48 PM
>>> To: Dave, Tushar N
>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>> ker...@vger.kernel.org; Mary Mcgrath
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>>> On 11/09/12 04:35, Dave, Tushar N wrote:
>>>> All devices in path from root complex to 82571, should have *same*
>>>> max
>>> payload size otherwise it can cause hang.
>>>> Can you double check this?
>>>
>>> Hi Tushar,
>>>
>>> Checked with hardware vendor and they said no way to modify the max
>>> payload size from BIOS, can I modify it from driver side?
>>
>> If you want to change value for 82571 device you can do it from eeprom
>but for other upstream devices I am not sure. I will check with my team.
>
>Hi Tushar,
>
>Would you please help to fine the offset of max payload size in eeprom?
>I'd like to have a try to modify it by ethtool.

It is defined using bit 8 of word 0x1A.
Bit value 0 = 128B , bit value 1 = 256B

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-11-15 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, November 14, 2012 4:33 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org; Mary Mcgrath
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/14/12 11:45, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, November 13, 2012 6:48 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org; Mary Mcgrath
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 11/09/12 04:35, Dave, Tushar N wrote:
 All devices in path from root complex to 82571, should have *same*
 max
 payload size otherwise it can cause hang.
 Can you double check this?

 Hi Tushar,

 Checked with hardware vendor and they said no way to modify the max
 payload size from BIOS, can I modify it from driver side?

 If you want to change value for 82571 device you can do it from eeprom
but for other upstream devices I am not sure. I will check with my team.

Hi Tushar,

Would you please help to fine the offset of max payload size in eeprom?
I'd like to have a try to modify it by ethtool.

It is defined using bit 8 of word 0x1A.
Bit value 0 = 128B , bit value 1 = 256B

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-14 Thread Joe Jin

On 11/14/12 11:45, Dave, Tushar N wrote:
>> -Original Message-
>> From: Joe Jin [mailto:joe@oracle.com]
>> Sent: Tuesday, November 13, 2012 6:48 PM
>> To: Dave, Tushar N
>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org; Mary Mcgrath
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 11/09/12 04:35, Dave, Tushar N wrote:
>>> All devices in path from root complex to 82571, should have *same* max
>> payload size otherwise it can cause hang.
>>> Can you double check this?
>>
>> Hi Tushar,
>>
>> Checked with hardware vendor and they said no way to modify the max
>> payload size from BIOS, can I modify it from driver side?
> 
> If you want to change value for 82571 device you can do it from eeprom but 
> for other upstream devices I am not sure. I will check with my team.

Hi Tushar,

Would you please help to fine the offset of max payload size in eeprom?
I'd like to have a try to modify it by ethtool.

Thanks in advance,
Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-14 Thread Joe Jin

On 11/14/12 11:45, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, November 13, 2012 6:48 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org; Mary Mcgrath
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 11/09/12 04:35, Dave, Tushar N wrote:
 All devices in path from root complex to 82571, should have *same* max
 payload size otherwise it can cause hang.
 Can you double check this?

 Hi Tushar,

 Checked with hardware vendor and they said no way to modify the max
 payload size from BIOS, can I modify it from driver side?

 If you want to change value for 82571 device you can do it from eeprom but 
 for other upstream devices I am not sure. I will check with my team.

Hi Tushar,

Would you please help to fine the offset of max payload size in eeprom?
I'd like to have a try to modify it by ethtool.

Thanks in advance,
Joe
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Tuesday, November 13, 2012 6:48 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org; Mary Mcgrath
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 11/09/12 04:35, Dave, Tushar N wrote:
>> All devices in path from root complex to 82571, should have *same* max
>payload size otherwise it can cause hang.
>> Can you double check this?
>
>Hi Tushar,
>
>Checked with hardware vendor and they said no way to modify the max
>payload size from BIOS, can I modify it from driver side?

If you want to change value for 82571 device you can do it from eeprom but for 
other upstream devices I am not sure. I will check with my team.

-Tushar

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Dave, Tushar N

>-Original Message-
>From: Li Yu [mailto:raise.s...@gmail.com]
>Sent: Tuesday, November 13, 2012 7:37 PM
>To: Dave, Tushar N
>Cc: Joe Jin; e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org; Mary Mcgrath
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>于 2012年11月09日 04:35, Dave, Tushar N 写道:
>>> -Original Message-
>>> From: netdev-ow...@vger.kernel.org
>>> [mailto:netdev-ow...@vger.kernel.org]
>>> On Behalf Of Joe Jin
>>> Sent: Wednesday, November 07, 2012 10:25 PM
>>> To: e1000-de...@lists.sf.net
>>> Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org; Mary
>>> Mcgrath
>>> Subject: 82571EB: Detected Hardware Unit Hang
>>>
>>> Hi list,
>>>
>>> IHAC reported "82571EB Detected Hardware Unit Hang" on HP ProLiant
>>> DL360 G6, and have to reboot the server to recover:
>>>
>>> e1000e :06:00.1: eth3: Detected Hardware Unit Hang:
>>>   TDH  <1a>
>>>   TDT  <1a>
>>>   next_to_use  <1a>
>>>   next_to_clean<18>
>>> buffer_info[next_to_clean]:
>>>   time_stamp   <10047a74e>
>>>   next_to_watch<18>
>>>   jiffies  <10047a88c>
>>>   next_to_watch.status <1>
>>> MAC Status <80383>
>>> PHY Status <792d>
>>> PHY 1000BASE-T Status  <3800>
>>> PHY Extended Status<3000>
>>> PCI Status <10>
>>>
>>> With newer kernel 2.0.0.1 the issue still reproducible.
>>>
>>> Device info:
>>> 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (Copper) (rev 06)
>>> 06:00.1 0200: 8086:10bc (rev 06)
>>>
>>> I compared lspci output before and after the issue, different as below:
>>> 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (Copper) (rev 06)
>>> Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
>>> Gigabit Server Adapter
>>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
>>> Stepping- SERR- FastB2B- DisINTx-
>>> -   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>> SERR- >> +   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>> +SERR- >
>> Are you sure this is not similar issue as before that you reported.
>> i.e.
>> On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
>>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>>> doing scp test. this issue is easy do reproduced on SUN FIRE X2270
>>> M2, just copy a big file (>500M) from another server will hit it at
>once.
>>
>> All devices in path from root complex to 82571, should have *same* max
>payload size otherwise it can cause hang.
>> Can you double check this?
>>
>
>We also found such hang problem on 82599EB (ixgbe driver) in RHEL6.3
>kernel, we ever tried to upgrade to latest version (3.8.21 or 3.10.17),
>but it still happens.
>
>Is it probably also due to wrong "max payload size" set in BIOS?
>
It could be or could not be. I would suggest please create another thread with 
that issue as these two devices are significantly different.

-Tushar

Re: 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Li Yu

于 2012年11月09日 04:35, Dave, Tushar N 写道:

-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Wednesday, November 07, 2012 10:25 PM
To: e1000-de...@lists.sf.net
Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org; Mary Mcgrath
Subject: 82571EB: Detected Hardware Unit Hang

Hi list,

IHAC reported "82571EB Detected Hardware Unit Hang" on HP ProLiant DL360
G6, and have to reboot the server to recover:

e1000e :06:00.1: eth3: Detected Hardware Unit Hang:
  TDH  <1a>
  TDT  <1a>
  next_to_use  <1a>
  next_to_clean<18>
buffer_info[next_to_clean]:
  time_stamp   <10047a74e>
  next_to_watch<18>
  jiffies  <10047a88c>
  next_to_watch.status <1>
MAC Status <80383>
PHY Status <792d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status<3000>
PCI Status <10>

With newer kernel 2.0.0.1 the issue still reproducible.

Device info:
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
06:00.1 0200: 8086:10bc (rev 06)

I compared lspci output before and after the issue, different as below:
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
Gigabit Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B- DisINTx-
-   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
SERR- TAbort-
+SERR- 

Are you sure this is not similar issue as before that you reported.
i.e.
On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:

I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
just copy a big file (>500M) from another server will hit it at once.

All devices in path from root complex to 82571, should have *same* max payload 
size otherwise it can cause hang.
Can you double check this?

We also found such hang problem on 82599EB (ixgbe driver) in RHEL6.3
kernel, we ever tried to upgrade to latest version (3.8.21 or 3.10.17),
but it still happens.

Is it probably also due to wrong "max payload size" set in BIOS?

Thanks

Yu

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Joe Jin

On 11/09/12 04:35, Dave, Tushar N wrote:
> All devices in path from root complex to 82571, should have *same* max 
> payload size otherwise it can cause hang. 
> Can you double check this?

Hi Tushar,

Checked with hardware vendor and they said no way to modify the max payload 
size 
from BIOS, can I modify it from driver side?

Thanks,
Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Joe Jin

On 11/09/12 04:35, Dave, Tushar N wrote:
 All devices in path from root complex to 82571, should have *same* max 
 payload size otherwise it can cause hang. 
 Can you double check this?

Hi Tushar,

Checked with hardware vendor and they said no way to modify the max payload 
size 
from BIOS, can I modify it from driver side?

Thanks,
Joe
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Li Yu

于 2012年11月09日 04:35, Dave, Tushar N 写道:

-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Wednesday, November 07, 2012 10:25 PM
To: e1000-de...@lists.sf.net
Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org; Mary Mcgrath
Subject: 82571EB: Detected Hardware Unit Hang

Hi list,

IHAC reported 82571EB Detected Hardware Unit Hang on HP ProLiant DL360
G6, and have to reboot the server to recover:

e1000e :06:00.1: eth3: Detected Hardware Unit Hang:
  TDH  1a
  TDT  1a
  next_to_use  1a
  next_to_clean18
buffer_info[next_to_clean]:
  time_stamp   10047a74e
  next_to_watch18
  jiffies  10047a88c
  next_to_watch.status 1
MAC Status 80383
PHY Status 792d
PHY 1000BASE-T Status  3800
PHY Extended Status3000
PCI Status 10

With newer kernel 2.0.0.1 the issue still reproducible.

Device info:
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
06:00.1 0200: 8086:10bc (rev 06)

I compared lspci output before and after the issue, different as below:
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
Gigabit Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B- DisINTx-
-   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
TAbort- MAbort- SERR- PERR- INTx-
+   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
+TAbort- MAbort- SERR- PERR- INTx+

Are you sure this is not similar issue as before that you reported.
i.e.
On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:

I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
just copy a big file (500M) from another server will hit it at once.

All devices in path from root complex to 82571, should have *same* max payload 
size otherwise it can cause hang.
Can you double check this?

We also found such hang problem on 82599EB (ixgbe driver) in RHEL6.3
kernel, we ever tried to upgrade to latest version (3.8.21 or 3.10.17),
but it still happens.

Is it probably also due to wrong max payload size set in BIOS?

Thanks

Yu

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Dave, Tushar N

-Original Message-
From: Li Yu [mailto:raise.s...@gmail.com]
Sent: Tuesday, November 13, 2012 7:37 PM
To: Dave, Tushar N
Cc: Joe Jin; e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org; Mary Mcgrath
Subject: Re: 82571EB: Detected Hardware Unit Hang

于 2012年11月09日 04:35, Dave, Tushar N 写道:
 -Original Message-
 From: netdev-ow...@vger.kernel.org
 [mailto:netdev-ow...@vger.kernel.org]
 On Behalf Of Joe Jin
 Sent: Wednesday, November 07, 2012 10:25 PM
 To: e1000-de...@lists.sf.net
 Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org; Mary
 Mcgrath
 Subject: 82571EB: Detected Hardware Unit Hang

 Hi list,

 IHAC reported 82571EB Detected Hardware Unit Hang on HP ProLiant
 DL360 G6, and have to reboot the server to recover:

 e1000e :06:00.1: eth3: Detected Hardware Unit Hang:
   TDH  1a
   TDT  1a
   next_to_use  1a
   next_to_clean18
 buffer_info[next_to_clean]:
   time_stamp   10047a74e
   next_to_watch18
   jiffies  10047a88c
   next_to_watch.status 1
 MAC Status 80383
 PHY Status 792d
 PHY 1000BASE-T Status  3800
 PHY Extended Status3000
 PCI Status 10

 With newer kernel 2.0.0.1 the issue still reproducible.

 Device info:
 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)
 06:00.1 0200: 8086:10bc (rev 06)

 I compared lspci output before and after the issue, different as below:
 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)
 Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
 Gigabit Server Adapter
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
 Stepping- SERR- FastB2B- DisINTx-
 -   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR- INTx-
 +   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 +TAbort- MAbort- SERR- PERR- INTx+

 Are you sure this is not similar issue as before that you reported.
 i.e.
 On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270
 M2, just copy a big file (500M) from another server will hit it at
once.

 All devices in path from root complex to 82571, should have *same* max
payload size otherwise it can cause hang.
 Can you double check this?

We also found such hang problem on 82599EB (ixgbe driver) in RHEL6.3
kernel, we ever tried to upgrade to latest version (3.8.21 or 3.10.17),
but it still happens.

Is it probably also due to wrong max payload size set in BIOS?

It could be or could not be. I would suggest please create another thread with 
that issue as these two devices are significantly different.

-Tushar

RE: 82571EB: Detected Hardware Unit Hang

2012-11-13 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, November 13, 2012 6:48 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org; Mary Mcgrath
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/09/12 04:35, Dave, Tushar N wrote:
 All devices in path from root complex to 82571, should have *same* max
payload size otherwise it can cause hang.
 Can you double check this?

Hi Tushar,

Checked with hardware vendor and they said no way to modify the max
payload size from BIOS, can I modify it from driver side?

If you want to change value for 82571 device you can do it from eeprom but for 
other upstream devices I am not sure. I will check with my team.

-Tushar

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-08 Thread Joe Jin

On 11/09/12 04:35, Dave, Tushar N wrote:
> Are you sure this is not similar issue as before that you reported.
> i.e. 

Tushar,

Thanks for your quick response, I'll check with customer if they can modify the 
Max
payload size from BIOS, this time issue hit on HP's server.

Thanks again,
Joe

> On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
>> > I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when 
>> > doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, 
>> > just copy a big file (>500M) from another server will hit it at once.
> All devices in path from root complex to 82571, should have *same* max 
> payload size otherwise it can cause hang. 
> Can you double check this?
> 


-- 
Oracle 
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-11-08 Thread Dave, Tushar N

>-Original Message-
>From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
>On Behalf Of Joe Jin
>Sent: Wednesday, November 07, 2012 10:25 PM
>To: e1000-de...@lists.sf.net
>Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org; Mary Mcgrath
>Subject: 82571EB: Detected Hardware Unit Hang
>
>Hi list,
>
>IHAC reported "82571EB Detected Hardware Unit Hang" on HP ProLiant DL360
>G6, and have to reboot the server to recover:
>
>e1000e :06:00.1: eth3: Detected Hardware Unit Hang:
>  TDH  <1a>
>  TDT  <1a>
>  next_to_use  <1a>
>  next_to_clean<18>
>buffer_info[next_to_clean]:
>  time_stamp   <10047a74e>
>  next_to_watch<18>
>  jiffies  <10047a88c>
>  next_to_watch.status <1>
>MAC Status <80383>
>PHY Status <792d>
>PHY 1000BASE-T Status  <3800>
>PHY Extended Status<3000>
>PCI Status <10>
>
>With newer kernel 2.0.0.1 the issue still reproducible.
>
>Device info:
>06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>Controller (Copper) (rev 06)
>06:00.1 0200: 8086:10bc (rev 06)
>
>I compared lspci output before and after the issue, different as below:
> 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>Controller (Copper) (rev 06)
>   Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
>Gigabit Server Adapter
>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
>Stepping- SERR- FastB2B- DisINTx-
>-  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>SERR- +  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>+SERR-  I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when 
> doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, 
> just copy a big file (>500M) from another server will hit it at once.

All devices in path from root complex to 82571, should have *same* max payload 
size otherwise it can cause hang. 
Can you double check this?

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-11-08 Thread Dave, Tushar N

-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Wednesday, November 07, 2012 10:25 PM
To: e1000-de...@lists.sf.net
Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org; Mary Mcgrath
Subject: 82571EB: Detected Hardware Unit Hang

Hi list,

IHAC reported 82571EB Detected Hardware Unit Hang on HP ProLiant DL360
G6, and have to reboot the server to recover:

e1000e :06:00.1: eth3: Detected Hardware Unit Hang:
  TDH  1a
  TDT  1a
  next_to_use  1a
  next_to_clean18
buffer_info[next_to_clean]:
  time_stamp   10047a74e
  next_to_watch18
  jiffies  10047a88c
  next_to_watch.status 1
MAC Status 80383
PHY Status 792d
PHY 1000BASE-T Status  3800
PHY Extended Status3000
PCI Status 10

With newer kernel 2.0.0.1 the issue still reproducible.

Device info:
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
06:00.1 0200: 8086:10bc (rev 06)

I compared lspci output before and after the issue, different as below:
 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
   Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
Gigabit Server Adapter
   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR- FastB2B- DisINTx-
-  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
TAbort- MAbort- SERR- PERR- INTx-
+  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
+TAbort- MAbort- SERR- PERR- INTx+

Are you sure this is not similar issue as before that you reported.
i.e. 
On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when 
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, 
 just copy a big file (500M) from another server will hit it at once.

All devices in path from root complex to 82571, should have *same* max payload 
size otherwise it can cause hang. 
Can you double check this?

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-11-08 Thread Joe Jin

On 11/09/12 04:35, Dave, Tushar N wrote:
 Are you sure this is not similar issue as before that you reported.
 i.e. 

Tushar,

Thanks for your quick response, I'll check with customer if they can modify the 
Max
payload size from BIOS, this time issue hit on HP's server.

Thanks again,
Joe

 On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
  I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when 
  doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, 
  just copy a big file (500M) from another server will hit it at once.
 All devices in path from root complex to 82571, should have *same* max 
 payload size otherwise it can cause hang. 
 Can you double check this?
 


-- 
Oracle http://www.oracle.com
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Jon Mason

On Mon, Jul 16, 2012 at 9:08 AM, Henrique de Moraes Holschuh
 wrote:
> On Mon, 16 Jul 2012, Ben Hutchings wrote:
>> On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
>> > On Sun, 15 Jul 2012, Dave, Tushar N wrote:
>> > > Somehow setting max payload to 256 from BIOS does not set this value for 
>> > > all devices. I believe this is a BIOS bug.
>> >
>> > And preferably, Linux should complain about it.  Since we know it is going
>> > to cause problems, and since we know it does happen, we should be raising a
>> > ruckus about it in the kernel log (and probably fixing it to min(path) 
>> > while
>> > at it)...
>> >
>> > Is this something that should be raised as a feature request with the
>> > PCI/PCIe subsystem?
>>
>> The feature is there, but we ended up with:
>>
>> commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
>> Author: Jon Mason 
>> Date:   Mon Oct 3 09:50:20 2011 -0500
>>
>> PCI: Disable MPS configuration by default
>>
>> But you are welcome to share use of the fixup_mpss_256() quirk.
>
> Meh.  I'd be happy with a warning if MPSS decreases when walking up to
> the tree root... i.e. a warning if any child has a MPSS larger than the
> parent.

You can add "pci=pcie_bus_safe" to the kernel params and it should
resolve your issue.

> --
>   "One disk to rule them all, One disk to find them. One disk to bring
>   them all and in the darkness grind them. In the Land of Redmond
>   where the shadows lie." -- The Silicon Valley Tarot
>   Henrique Holschuh
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Jon Mason

On Mon, Jul 16, 2012 at 8:47 AM, Ben Hutchings
 wrote:
> On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
>> On Sun, 15 Jul 2012, Dave, Tushar N wrote:
>> > Somehow setting max payload to 256 from BIOS does not set this value for 
>> > all devices. I believe this is a BIOS bug.
>>
>> And preferably, Linux should complain about it.  Since we know it is going
>> to cause problems, and since we know it does happen, we should be raising a
>> ruckus about it in the kernel log (and probably fixing it to min(path) while
>> at it)...
>>
>> Is this something that should be raised as a feature request with the
>> PCI/PCIe subsystem?
>
> The feature is there, but we ended up with:
>
> commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
> Author: Jon Mason 
> Date:   Mon Oct 3 09:50:20 2011 -0500
>
> PCI: Disable MPS configuration by default

With the quirk, it should work now if pcie_bus_config is set to
PCIE_BUS_SAFE.  With that patch was pushed it was too late in the
release to fix it and see if there were any other ones out there (or
incur the wrath of Linus).  If you are brave enough, you can enable it
by default again and see if there are any other quirks out there. ;-)

> But you are welcome to share use of the fixup_mpss_256() quirk.
>
> Ben.
>
> --
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Henrique de Moraes Holschuh

On Mon, 16 Jul 2012, Ben Hutchings wrote:
> On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
> > On Sun, 15 Jul 2012, Dave, Tushar N wrote:
> > > Somehow setting max payload to 256 from BIOS does not set this value for 
> > > all devices. I believe this is a BIOS bug.
> > 
> > And preferably, Linux should complain about it.  Since we know it is going
> > to cause problems, and since we know it does happen, we should be raising a
> > ruckus about it in the kernel log (and probably fixing it to min(path) while
> > at it)...
> > 
> > Is this something that should be raised as a feature request with the
> > PCI/PCIe subsystem?
> 
> The feature is there, but we ended up with:
> 
> commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
> Author: Jon Mason 
> Date:   Mon Oct 3 09:50:20 2011 -0500
> 
> PCI: Disable MPS configuration by default
> 
> But you are welcome to share use of the fixup_mpss_256() quirk.

Meh.  I'd be happy with a warning if MPSS decreases when walking up to
the tree root... i.e. a warning if any child has a MPSS larger than the
parent.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Ben Hutchings

On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
> On Sun, 15 Jul 2012, Dave, Tushar N wrote:
> > Somehow setting max payload to 256 from BIOS does not set this value for 
> > all devices. I believe this is a BIOS bug.
> 
> And preferably, Linux should complain about it.  Since we know it is going
> to cause problems, and since we know it does happen, we should be raising a
> ruckus about it in the kernel log (and probably fixing it to min(path) while
> at it)...
> 
> Is this something that should be raised as a feature request with the
> PCI/PCIe subsystem?

The feature is there, but we ended up with:

commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
Author: Jon Mason 
Date:   Mon Oct 3 09:50:20 2011 -0500

PCI: Disable MPS configuration by default

But you are welcome to share use of the fixup_mpss_256() quirk.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Ben Hutchings

On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
 On Sun, 15 Jul 2012, Dave, Tushar N wrote:
  Somehow setting max payload to 256 from BIOS does not set this value for 
  all devices. I believe this is a BIOS bug.
 
 And preferably, Linux should complain about it.  Since we know it is going
 to cause problems, and since we know it does happen, we should be raising a
 ruckus about it in the kernel log (and probably fixing it to min(path) while
 at it)...
 
 Is this something that should be raised as a feature request with the
 PCI/PCIe subsystem?

The feature is there, but we ended up with:

commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
Author: Jon Mason ma...@myri.com
Date:   Mon Oct 3 09:50:20 2011 -0500

PCI: Disable MPS configuration by default

But you are welcome to share use of the fixup_mpss_256() quirk.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Henrique de Moraes Holschuh

On Mon, 16 Jul 2012, Ben Hutchings wrote:
 On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
  On Sun, 15 Jul 2012, Dave, Tushar N wrote:
   Somehow setting max payload to 256 from BIOS does not set this value for 
   all devices. I believe this is a BIOS bug.
  
  And preferably, Linux should complain about it.  Since we know it is going
  to cause problems, and since we know it does happen, we should be raising a
  ruckus about it in the kernel log (and probably fixing it to min(path) while
  at it)...
  
  Is this something that should be raised as a feature request with the
  PCI/PCIe subsystem?
 
 The feature is there, but we ended up with:
 
 commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
 Author: Jon Mason ma...@myri.com
 Date:   Mon Oct 3 09:50:20 2011 -0500
 
 PCI: Disable MPS configuration by default
 
 But you are welcome to share use of the fixup_mpss_256() quirk.

Meh.  I'd be happy with a warning if MPSS decreases when walking up to
the tree root... i.e. a warning if any child has a MPSS larger than the
parent.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Jon Mason

On Mon, Jul 16, 2012 at 8:47 AM, Ben Hutchings
bhutchi...@solarflare.com wrote:
 On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
 On Sun, 15 Jul 2012, Dave, Tushar N wrote:
  Somehow setting max payload to 256 from BIOS does not set this value for 
  all devices. I believe this is a BIOS bug.

 And preferably, Linux should complain about it.  Since we know it is going
 to cause problems, and since we know it does happen, we should be raising a
 ruckus about it in the kernel log (and probably fixing it to min(path) while
 at it)...

 Is this something that should be raised as a feature request with the
 PCI/PCIe subsystem?

 The feature is there, but we ended up with:

 commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
 Author: Jon Mason ma...@myri.com
 Date:   Mon Oct 3 09:50:20 2011 -0500

 PCI: Disable MPS configuration by default

With the quirk, it should work now if pcie_bus_config is set to
PCIE_BUS_SAFE.  With that patch was pushed it was too late in the
release to fix it and see if there were any other ones out there (or
incur the wrath of Linus).  If you are brave enough, you can enable it
by default again and see if there are any other quirks out there. ;-)

 But you are welcome to share use of the fixup_mpss_256() quirk.

 Ben.

 --
 Ben Hutchings, Staff Engineer, Solarflare
 Not speaking for my employer; that's the marketing department's job.
 They asked us to note that Solarflare product names are trademarked.

 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-16 Thread Jon Mason

On Mon, Jul 16, 2012 at 9:08 AM, Henrique de Moraes Holschuh
h...@hmh.eng.br wrote:
 On Mon, 16 Jul 2012, Ben Hutchings wrote:
 On Sun, 2012-07-15 at 10:35 -0300, Henrique de Moraes Holschuh wrote:
  On Sun, 15 Jul 2012, Dave, Tushar N wrote:
   Somehow setting max payload to 256 from BIOS does not set this value for 
   all devices. I believe this is a BIOS bug.
 
  And preferably, Linux should complain about it.  Since we know it is going
  to cause problems, and since we know it does happen, we should be raising a
  ruckus about it in the kernel log (and probably fixing it to min(path) 
  while
  at it)...
 
  Is this something that should be raised as a feature request with the
  PCI/PCIe subsystem?

 The feature is there, but we ended up with:

 commit 5f39e6705faade2e89d119958a8c51b9b6e2c53c
 Author: Jon Mason ma...@myri.com
 Date:   Mon Oct 3 09:50:20 2011 -0500

 PCI: Disable MPS configuration by default

 But you are welcome to share use of the fixup_mpss_256() quirk.

 Meh.  I'd be happy with a warning if MPSS decreases when walking up to
 the tree root... i.e. a warning if any child has a MPSS larger than the
 parent.

You can add pci=pcie_bus_safe to the kernel params and it should
resolve your issue.

 --
   One disk to rule them all, One disk to find them. One disk to bring
   them all and in the darkness grind them. In the Land of Redmond
   where the shadows lie. -- The Silicon Valley Tarot
   Henrique Holschuh
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-15 Thread Henrique de Moraes Holschuh

On Sun, 15 Jul 2012, Dave, Tushar N wrote:
> Somehow setting max payload to 256 from BIOS does not set this value for all 
> devices. I believe this is a BIOS bug.

And preferably, Linux should complain about it.  Since we know it is going
to cause problems, and since we know it does happen, we should be raising a
ruckus about it in the kernel log (and probably fixing it to min(path) while
at it)...

Is this something that should be raised as a feature request with the
PCI/PCIe subsystem?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-15 Thread Henrique de Moraes Holschuh

On Sun, 15 Jul 2012, Dave, Tushar N wrote:
 Somehow setting max payload to 256 from BIOS does not set this value for all 
 devices. I believe this is a BIOS bug.

And preferably, Linux should complain about it.  Since we know it is going
to cause problems, and since we know it does happen, we should be raising a
ruckus about it in the kernel log (and probably fixing it to min(path) while
at it)...

Is this something that should be raised as a feature request with the
PCI/PCIe subsystem?

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-14 Thread Joe Jin

On 07/15/12 11:42, Dave, Tushar N wrote:
>> -Original Message-
>> From: Joe Jin [mailto:joe@oracle.com]
>> Sent: Thursday, July 12, 2012 9:34 PM
>> To: Dave, Tushar N
>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 07/13/12 12:10, Dave, Tushar N wrote:
>>>> -Original Message-
>>>> From: Joe Jin [mailto:joe@oracle.com]
>>>> Sent: Thursday, July 12, 2012 4:46 PM
>>>> To: Dave, Tushar N
>>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>>> ker...@vger.kernel.org
>>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>>
>>> Thanks for sending full dmesg log. I am still investigating. I think
>> this issue can occur if two PCIe link partner *i.e pcie bridge and pcie
>> device do not have same max payload size.
>>> I need 2 more info.
>>> 1) PBA number of the card.
>>
>> This is a remote server and I could not get this.
>>
>>> 2) full lspci -vvv output of entire system 'after you have changed max
>> payload size to 128'.
> 
> Somehow setting max payload to 256 from BIOS does not set this value for all 
> devices. I believe this is a BIOS bug.
> All devices in path from root complex to 82571, should have same max payload 
> size otherwise it can cause hang. When you set max payload to 128 from BIOS, 
> all device in path from root complex to 82571 got assigned same max payload 
> size. This resolves the issue.
> 
> I hope this helps.

Tushar,

Thanks a lot for your help, will send this to hardware engineer.

Regards,
Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-14 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Thursday, July 12, 2012 9:34 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/13/12 12:10, Dave, Tushar N wrote:
>>> -Original Message-
>>> From: Joe Jin [mailto:joe@oracle.com]
>>> Sent: Thursday, July 12, 2012 4:46 PM
>>> To: Dave, Tushar N
>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>> ker...@vger.kernel.org
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>> Thanks for sending full dmesg log. I am still investigating. I think
>this issue can occur if two PCIe link partner *i.e pcie bridge and pcie
>device do not have same max payload size.
>> I need 2 more info.
>> 1) PBA number of the card.
>
>This is a remote server and I could not get this.
>
>> 2) full lspci -vvv output of entire system 'after you have changed max
>payload size to 128'.

Somehow setting max payload to 256 from BIOS does not set this value for all 
devices. I believe this is a BIOS bug.
All devices in path from root complex to 82571, should have same max payload 
size otherwise it can cause hang. When you set max payload to 128 from BIOS, 
all device in path from root complex to 82571 got assigned same max payload 
size. This resolves the issue.

I hope this helps.

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-14 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Thursday, July 12, 2012 9:34 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/13/12 12:10, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Thursday, July 12, 2012 4:46 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 Thanks for sending full dmesg log. I am still investigating. I think
this issue can occur if two PCIe link partner *i.e pcie bridge and pcie
device do not have same max payload size.
 I need 2 more info.
 1) PBA number of the card.

This is a remote server and I could not get this.

 2) full lspci -vvv output of entire system 'after you have changed max
payload size to 128'.

Somehow setting max payload to 256 from BIOS does not set this value for all 
devices. I believe this is a BIOS bug.
All devices in path from root complex to 82571, should have same max payload 
size otherwise it can cause hang. When you set max payload to 128 from BIOS, 
all device in path from root complex to 82571 got assigned same max payload 
size. This resolves the issue.

I hope this helps.

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-14 Thread Joe Jin

On 07/15/12 11:42, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Thursday, July 12, 2012 9:34 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/13/12 12:10, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Thursday, July 12, 2012 4:46 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 Thanks for sending full dmesg log. I am still investigating. I think
 this issue can occur if two PCIe link partner *i.e pcie bridge and pcie
 device do not have same max payload size.
 I need 2 more info.
 1) PBA number of the card.

 This is a remote server and I could not get this.

 2) full lspci -vvv output of entire system 'after you have changed max
 payload size to 128'.

 Somehow setting max payload to 256 from BIOS does not set this value for all 
 devices. I believe this is a BIOS bug.
 All devices in path from root complex to 82571, should have same max payload 
 size otherwise it can cause hang. When you set max payload to 128 from BIOS, 
 all device in path from root complex to 82571 got assigned same max payload 
 size. This resolves the issue.

 I hope this helps.

Tushar,

Thanks a lot for your help, will send this to hardware engineer.

Regards,
Joe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Thursday, July 12, 2012 4:46 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
Thanks for sending full dmesg log. I am still investigating. I think this issue 
can occur if two PCIe link partner *i.e pcie bridge and pcie device do not have 
same max payload size.
I need 2 more info. 
1) PBA number of the card.
2) full lspci -vvv output of entire system 'after you have changed max payload 
size to 128'.

Thanks.

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Thursday, July 12, 2012 12:11 AM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/12/12 14:41, Dave, Tushar N wrote:
>>> On 07/12/12 13:57, Dave, Tushar N wrote:
>>>>> -Original Message-
>>>>> From: Joe Jin [mailto:joe@oracle.com]
>>>>> Sent: Wednesday, July 11, 2012 8:13 PM
>>>>> To: Dave, Tushar N
>>>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>>>> ker...@vger.kernel.org
>>>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>>>
>>>>> On 07/12/12 11:07, Dave, Tushar N wrote:
>>>>>>> -Original Message-
>>>>>>> From: Joe Jin [mailto:joe@oracle.com]
>>>>>>> Sent: Wednesday, July 11, 2012 7:58 PM
>>>>>>> To: Dave, Tushar N
>>>>>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>>>>>> ker...@vger.kernel.org
>>>>>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>>>>>
>>>>>>> On 07/12/12 10:52, Dave, Tushar N wrote:
>>>>>>>> What is the exact error messages in BIOS log?
>>>>>>>
>>>>>>> Error message from BIOS event log:
>>>>>>> 07/12/12 05:54:00
>>>>>>>PCI Express Non-Fatal Error
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Joe
>>>>> Hi Tushar,
>>>>>
>>>>> Please find eeprom from attachment.
>>>>
>>>> Do you have lspci -vvv dump of entire system before and after issue
>>> occurs? If you have can you send it to me?
>>>>
>>>
>> Sorry but I meant the full lspci -vvv of *entire system* before and
>after issue occurs and not of 82571 only.
>>
>
>Before:
>===
>00:00.0 Host bridge: Intel Corporation 5500 I/O Hub to ESI Port (rev 22)
>   Subsystem: Oracle Corporation Device 5352

Joe, thanks for all the data.
You said you have changed max payload size and issue stop occurring. How did 
you change it? Where did you make that change in BIOS or EEPROM or in PCIe 
config space?
Also please send me the full dmesg of entire system after you change max 
payload size.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N

>On 07/12/12 13:57, Dave, Tushar N wrote:
>>> -Original Message-
>>> From: Joe Jin [mailto:joe@oracle.com]
>>> Sent: Wednesday, July 11, 2012 8:13 PM
>>> To: Dave, Tushar N
>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>> ker...@vger.kernel.org
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>>> On 07/12/12 11:07, Dave, Tushar N wrote:
>>>>> -Original Message-
>>>>> From: Joe Jin [mailto:joe@oracle.com]
>>>>> Sent: Wednesday, July 11, 2012 7:58 PM
>>>>> To: Dave, Tushar N
>>>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>>>> ker...@vger.kernel.org
>>>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>>>
>>>>> On 07/12/12 10:52, Dave, Tushar N wrote:
>>>>>> What is the exact error messages in BIOS log?
>>>>>
>>>>> Error message from BIOS event log:
>>>>> 07/12/12 05:54:00
>>>>>PCI Express Non-Fatal Error
>>>>>
>>>>> Thanks,
>>>>> Joe
>>> Hi Tushar,
>>>
>>> Please find eeprom from attachment.
>>
>> Do you have lspci -vvv dump of entire system before and after issue
>occurs? If you have can you send it to me?
>>
>
Sorry but I meant the full lspci -vvv of *entire system* before and after issue 
occurs and not of 82571 only.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Joe Jin

On 07/12/12 13:57, Dave, Tushar N wrote:
>> -Original Message-
>> From: Joe Jin [mailto:joe@oracle.com]
>> Sent: Wednesday, July 11, 2012 8:13 PM
>> To: Dave, Tushar N
>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 07/12/12 11:07, Dave, Tushar N wrote:
>>>> -Original Message-
>>>> From: Joe Jin [mailto:joe@oracle.com]
>>>> Sent: Wednesday, July 11, 2012 7:58 PM
>>>> To: Dave, Tushar N
>>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>>> ker...@vger.kernel.org
>>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>>
>>>> On 07/12/12 10:52, Dave, Tushar N wrote:
>>>>> What is the exact error messages in BIOS log?
>>>>
>>>> Error message from BIOS event log:
>>>> 07/12/12 05:54:00
>>>>PCI Express Non-Fatal Error
>>>>
>>>> Thanks,
>>>> Joe
>> Hi Tushar,
>>
>> Please find eeprom from attachment.
> 
> Do you have lspci -vvv dump of entire system before and after issue occurs? 
> If you have can you send it to me?
> 

Before:
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP 
Low Profile Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Thursday, July 12, 2012 4:46 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

Thanks for sending full dmesg log. I am still investigating. I think this issue 
can occur if two PCIe link partner *i.e pcie bridge and pcie device do not have 
same max payload size.
I need 2 more info. 
1) PBA number of the card.
2) full lspci -vvv output of entire system 'after you have changed max payload 
size to 128'.

Thanks.

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Joe Jin

On 07/12/12 13:57, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 8:13 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 11:07, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 7:58 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

 Error message from BIOS event log:
 07/12/12 05:54:00
PCI Express Non-Fatal Error

 Thanks,
 Joe
 Hi Tushar,

 Please find eeprom from attachment.
 
 Do you have lspci -vvv dump of entire system before and after issue occurs? 
 If you have can you send it to me?
 

Before:
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP 
Low Profile Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin B routed to IRQ 80
Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at dc00 [size=32]
Expansion ROM at fbda [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee33000  Data: 407c
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ 
TransPend-
LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
Kernel driver in use: e1000e
Kernel modules: e1000e


After:
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP 
Low Profile Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin B routed to IRQ 80
Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at dc00 [size=32]
Expansion ROM at fbda [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee33000  Data: 407c
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap

RE: 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N

On 07/12/12 13:57, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 8:13 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 11:07, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 7:58 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

 Error message from BIOS event log:
 07/12/12 05:54:00
PCI Express Non-Fatal Error

 Thanks,
 Joe
 Hi Tushar,

 Please find eeprom from attachment.

 Do you have lspci -vvv dump of entire system before and after issue
occurs? If you have can you send it to me?


Sorry but I meant the full lspci -vvv of *entire system* before and after issue 
occurs and not of 82571 only.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-12 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Thursday, July 12, 2012 12:11 AM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/12/12 14:41, Dave, Tushar N wrote:
 On 07/12/12 13:57, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 8:13 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 11:07, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 7:58 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

 Error message from BIOS event log:
 07/12/12 05:54:00
PCI Express Non-Fatal Error

 Thanks,
 Joe
 Hi Tushar,

 Please find eeprom from attachment.

 Do you have lspci -vvv dump of entire system before and after issue
 occurs? If you have can you send it to me?

 Sorry but I meant the full lspci -vvv of *entire system* before and
after issue occurs and not of 82571 only.

Before:
===
00:00.0 Host bridge: Intel Corporation 5500 I/O Hub to ESI Port (rev 22)
   Subsystem: Oracle Corporation Device 5352

Joe, thanks for all the data.
You said you have changed max payload size and issue stop occurring. How did 
you change it? Where did you make that change in BIOS or EEPROM or in PCIe 
config space?
Also please send me the full dmesg of entire system after you change max 
payload size.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Wednesday, July 11, 2012 8:13 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/12/12 11:07, Dave, Tushar N wrote:
>>> -Original Message-
>>> From: Joe Jin [mailto:joe@oracle.com]
>>> Sent: Wednesday, July 11, 2012 7:58 PM
>>> To: Dave, Tushar N
>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>> ker...@vger.kernel.org
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>>> On 07/12/12 10:52, Dave, Tushar N wrote:
>>>> What is the exact error messages in BIOS log?
>>>
>>> Error message from BIOS event log:
>>> 07/12/12 05:54:00
>>>PCI Express Non-Fatal Error
>>>
>>> Thanks,
>>> Joe
>Hi Tushar,
>
>Please find eeprom from attachment.

Do you have lspci -vvv dump of entire system before and after issue occurs? If 
you have can you send it to me?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/12/12 11:07, Dave, Tushar N wrote:
>> -Original Message-
>> From: Joe Jin [mailto:joe@oracle.com]
>> Sent: Wednesday, July 11, 2012 7:58 PM
>> To: Dave, Tushar N
>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 07/12/12 10:52, Dave, Tushar N wrote:
>>> What is the exact error messages in BIOS log?
>>
>> Error message from BIOS event log:
>> 07/12/12 05:54:00
>>PCI Express Non-Fatal Error
>>
>> Thanks,
>> Joe
> 
> Thanks.  Well, I will check with team tomorrow if this  (max payload size) 
> can be treated as solution to this issue. 
> We can know more about what exact non-fatal error occurred if we capture bus 
> trace.
> We should check the eeprom on this device to make sure they are up-to-date.
> Send me the full eeprom dump in a file and I will confirm with team that it 
> is up-to-date.
> Thanks for your work.
> 

Hi Tushar,

Please find eeprom from attachment.

Thanks a lot of your help,
Joe
<>Offset  Values
--  --
0x  00 15 17 b9 77 9c 24 05 ff ff a2 50 ff ff ff ff 
0x0010  01 d9 04 97 2f 24 bc 11 8e 10 bc 10 86 80 65 b1 
0x0020  08 00 bc 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 1e 64 21 40 00 01 48 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0060  00 01 00 40 2a 12 07 40 00 01 00 40 ff ff ff ff 
0x0070  ff ff ff ff ff ff ff ff ff ff 97 01 ff ff 4b e8 
0x0080  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0090  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00a0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00b0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00c0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00d0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00e0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00f0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0100  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0110  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0120  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0130  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0140  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0150  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0160  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0170  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0180  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0190  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01a0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01b0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01c0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01d0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01e0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 6f 
0x01f0  87 04 00 00 00 00 00 00 00 00 00 00 ff ff ff 16 
0x0200  03 00 22 00 00 00 07 30 00 e5 49 00 df 61 15 34 
0x0210  00 36 81 2f 04 50 00 3b 15 34 00 36 04 60 00 49 
0x0220  15 34 00 36 04 70 00 9a 15 34 00 36 04 80 00 27 
0x0230  15 34 00 36 05 40 00 c1 47 02 00 14 00 10 04 24 
0x0240  00 e1 00 14 00 10 38 02 00 15 3f 04 5b 2f 3b 04 
0x0250  1b 00 32 04 87 00 3f 04 70 2f 30 04 a4 a8 3f 04 
0x0260  90 2f 30 04 c0 0e 3f 04 11 20 31 04 20 04 3f 04 
0x0270  00 00 20 04 40 01 3f 04 7a 18 1a 04 00 08 3f 04 
0x0280  30 1f 30 04 06 16 35 04 2a 01 3e 04 67 00 3f 04 
0x0290  54 1f 34 04 65 00 35 04 2a 00 36 04 2a 00 3f 04 
0x02a0  72 1f 32 04 b0 3f 36 04 ff c0 37 04 ec 1d 38 04 
0x02b0  ef f9 39 04 10 02 3c 06 00 0c 3f 04 95 18 35 04 
0x02c0  03 00 3f 04 96 17 36 04 08 00 3f 04 98 1f 38 04 
0x02d0  08 d0 3f 04 00 00 20 04 40 13 3f 04 5b 2f 3b 04 
0x02e0  18 90 32 04 00 00 3f 04 70 2f 30 04 e4 29 3f 04 
0x02f0  90 2f 30 04 c0 06 3f 04 11 20 31 04 00 04 30 04 
0x0300  b0 10 3f 04 b1 2f 31 04 24 8d 32 04 f0 f8 3f 04 
0x0310  dc 20 3c 04 00 00 3d 04 0a 00 3e 04 d3 00 3f 04 
0x0320  b4 28 34 04 ce 04 3f 04 00 00 20 04 40 13 69 53 
0x0330  e0 05 01 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0340  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0350  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0360  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0370  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0380  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0390  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Wednesday, July 11, 2012 7:58 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/12/12 10:52, Dave, Tushar N wrote:
>> What is the exact error messages in BIOS log?
>
>Error message from BIOS event log:
>07/12/12 05:54:00
>PCI Express Non-Fatal Error
>
>Thanks,
>Joe

Thanks.  Well, I will check with team tomorrow if this  (max payload size) can 
be treated as solution to this issue. 
We can know more about what exact non-fatal error occurred if we capture bus 
trace.
We should check the eeprom on this device to make sure they are up-to-date.
Send me the full eeprom dump in a file and I will confirm with team that it is 
up-to-date.
Thanks for your work.

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/12/12 10:52, Dave, Tushar N wrote:
> What is the exact error messages in BIOS log?

Error message from BIOS event log:
07/12/12 05:54:00
PCI Express Non-Fatal Error

Thanks,
Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Wednesday, July 11, 2012 7:23 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/12/12 02:51, Dave, Tushar N wrote:
>>
>> Joe,
>>
>> I see couple of errors in lspci output.
>> Device capability status register shows UnCorrectable PCIe error. This
>means there is certainly something went wrong. The only way to recover
>from Uncorrectable errors is reset.
>>
>>  DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+
>TransPend-
>>
>> Also AER sections in lspci output shows PCIe completion timeout.
>>
>>  Capabilities: [100 v1] Advanced Error Reporting
>>  UESta:  DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
>>
>> I suggest you should load AER driver and check for any error messages in
>log. Also please check any error message reported by system in BIOS log.
>Are there any machine check errors?
>>
>> When did you notice this issue? have 82571 ever been working before on
>this server?
>>
>> One more thing, Cache line size 256 is little unusual( I never seen this
>value before, mostly it's 64). Does BIOS settings have been changed? Are
>you using default BIOS setting?
>>
>
>I checked BIOS's log found the fault from the device, I changed "PCI-E
>Payload Size"
>from 256(default) to 128, now the device works.
>
>I compared lspci output found Address for data of MSI Capabilities's be
>changed:
>
>Old:
>Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>Address: fee21000  Data: 40cb
>
>New:
>Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>Address: fee24000  Data: 405c
>
>Mostly like it's a BIOS bug? please comments.
>
>Thanks,
>Joe

What is the exact error messages in BIOS log?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/12/12 02:51, Dave, Tushar N wrote:
> 
> Joe,
> 
> I see couple of errors in lspci output.
> Device capability status register shows UnCorrectable PCIe error. This means 
> there is certainly something went wrong. The only way to recover from 
> Uncorrectable errors is reset.
>
>   DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend-
> 
> Also AER sections in lspci output shows PCIe completion timeout.
>   
>   Capabilities: [100 v1] Advanced Error Reporting
>   UESta:  DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- 
> RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
> 
> I suggest you should load AER driver and check for any error messages in log. 
> Also please check any error message reported by system in BIOS log. Are there 
> any machine check errors? 
> 
> When did you notice this issue? have 82571 ever been working before on this 
> server?
> 
> One more thing, Cache line size 256 is little unusual( I never seen this 
> value before, mostly it's 64). Does BIOS settings have been changed? Are you 
> using default BIOS setting?
> 

I checked BIOS's log found the fault from the device, I changed "PCI-E Payload 
Size"
from 256(default) to 128, now the device works.

I compared lspci output found Address for data of MSI Capabilities's be changed:

Old:
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee21000  Data: 40cb

New:
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee24000  Data: 405c

Mostly like it's a BIOS bug? please comments.

Thanks,
Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Tuesday, July 10, 2012 10:03 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/11/12 12:05, Dave, Tushar N wrote:
>> When you said you had this issue with RHEL5 and RHEL6 drivers, have you
>install RHEl5/6 kernel and reproduced it? If so I think I should install
>RHEL6 and try reproduce it locally!
>>
>Yes I reproduced this on both RHEL5 and RHEL6.
>
>So far I tried to scp big file (~1GB) will hit it at once.
>
>Thanks,
>Joe

Joe,

I see couple of errors in lspci output.
Device capability status register shows UnCorrectable PCIe error. This means 
there is certainly something went wrong. The only way to recover from 
Uncorrectable errors is reset.

DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend-

Also AER sections in lspci output shows PCIe completion timeout.

Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-

I suggest you should load AER driver and check for any error messages in log. 
Also please check any error message reported by system in BIOS log. Are there 
any machine check errors? 

When did you notice this issue? have 82571 ever been working before on this 
server?

One more thing, Cache line size 256 is little unusual( I never seen this value 
before, mostly it's 64). Does BIOS settings have been changed? Are you using 
default BIOS setting?

Thanks.

-Tushar

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/11/12 15:50, Dave, Tushar N wrote:
> Device status and AER sections show some errors that looks little suspicious 
> to me but I'm not too sure. I will get back tomorrow.
> 

Thanks a lot, Tushar!

Joe


-- 
Oracle 
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Wednesday, July 11, 2012 12:39 AM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/11/12 15:37, Dave, Tushar N wrote:
>>> -Original Message-
>>> From: Joe Jin [mailto:joe@oracle.com]
>>> Sent: Wednesday, July 11, 2012 12:18 AM
>>> To: Dave, Tushar N
>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>> ker...@vger.kernel.org
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>>> On 07/11/12 15:11, Dave, Tushar N wrote:
>>>>> -Original Message-
>>>>> From: Joe Jin [mailto:joe@oracle.com]
>>>>> Sent: Tuesday, July 10, 2012 10:03 PM
>>>>> To: Dave, Tushar N
>>>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>>>> ker...@vger.kernel.org
>>>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>>>
>>>>> On 07/11/12 12:05, Dave, Tushar N wrote:
>>>>>> When you said you had this issue with RHEL5 and RHEL6 drivers,
>>>>>> have you
>>>>> install RHEl5/6 kernel and reproduced it? If so I think I should
>>>>> install
>>>>> RHEL6 and try reproduce it locally!
>>>>>>
>>>>> Yes I reproduced this on both RHEL5 and RHEL6.
>>>>>
>>>>> So far I tried to scp big file (~1GB) will hit it at once.
>>>>>
>>>>> Thanks,
>>>>> Joe
>>>>
>>>> Joe,
>>>> Can you please send lspci -vvv output for failing port before issue
>>> occurs.
>>>> Thanks.
>>>>
>>> # lspci -s 05:00.0 -vvv
>>> 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (Copper) (rev 06)
>>> Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet
>>> UTP Low Profile Adapter
>>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>>> Stepping- SERR- FastB2B- DisINTx+
>>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>> SERR- >> Latency: 0, Cache Line Size: 256 bytes
>>> Interrupt: pin B routed to IRQ 80
>>> Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
>>> Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
>>> Region 2: I/O ports at dc00 [size=32]
>>> Expansion ROM at fbda [disabled] [size=128K]
>>> Capabilities: [c8] Power Management version 2
>>> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
>>> ,D3hot+,D3cold+)
>>> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>>> Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>> Address: fee21000  Data: 40cb
>>> Capabilities: [e0] Express (v1) Endpoint, MSI 00
>>> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
>>> <512ns, L1 <64us
>>> ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
>>> DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
>>> Unsupported-
>>> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>> MaxPayload 128 bytes, MaxReadReq 512 bytes
>>> DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
>>> TransPend-
>>> LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s,
>>> Latency L0 <4us, L1 <64us
>>> ClockPM- Surprise- LLActRep- BwNot-
>>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
>>> CommClk-
>>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>> LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
>>> DLActive- BWMgmt- ABWMgmt-
>>> Capabilities: [100 v1] Advanced Error Reporting
>>> UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
>>> RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
>>> UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>> RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>>> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
>>> UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>> CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>>> CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>>> AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap-
>>> ChkEn-
>>> Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
>>> Kernel driver in use: e1000e
>>> Kernel modules: e1000e
>>>
>>>
>>> Thanks,
>>> Joe
>>
>> was this lspci output taken on freshly booted system?
>>
>
>Yes, any issue do you find?
>
>Thanks,
>Joe
>

Device status and AER sections show some errors that looks little suspicious to 
me but I'm not too sure. I will get back tomorrow.

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/11/12 15:37, Dave, Tushar N wrote:
>> -Original Message-
>> From: Joe Jin [mailto:joe@oracle.com]
>> Sent: Wednesday, July 11, 2012 12:18 AM
>> To: Dave, Tushar N
>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 07/11/12 15:11, Dave, Tushar N wrote:
>>>> -Original Message-
>>>> From: Joe Jin [mailto:joe@oracle.com]
>>>> Sent: Tuesday, July 10, 2012 10:03 PM
>>>> To: Dave, Tushar N
>>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>>> ker...@vger.kernel.org
>>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>>
>>>> On 07/11/12 12:05, Dave, Tushar N wrote:
>>>>> When you said you had this issue with RHEL5 and RHEL6 drivers, have
>>>>> you
>>>> install RHEl5/6 kernel and reproduced it? If so I think I should
>>>> install
>>>> RHEL6 and try reproduce it locally!
>>>>>
>>>> Yes I reproduced this on both RHEL5 and RHEL6.
>>>>
>>>> So far I tried to scp big file (~1GB) will hit it at once.
>>>>
>>>> Thanks,
>>>> Joe
>>>
>>> Joe,
>>> Can you please send lspci -vvv output for failing port before issue
>> occurs.
>>> Thanks.
>>>
>> # lspci -s 05:00.0 -vvv
>> 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>> Controller (Copper) (rev 06)
>>  Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet
>> UTP Low Profile Adapter
>>  Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx+
>>  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> SERR- >  Latency: 0, Cache Line Size: 256 bytes
>>  Interrupt: pin B routed to IRQ 80
>>  Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
>>  Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
>>  Region 2: I/O ports at dc00 [size=32]
>>  Expansion ROM at fbda [disabled] [size=128K]
>>  Capabilities: [c8] Power Management version 2
>>  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
>> ,D3hot+,D3cold+)
>>  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>>  Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>  Address: fee21000  Data: 40cb
>>  Capabilities: [e0] Express (v1) Endpoint, MSI 00
>>  DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
>> <512ns, L1 <64us
>>  ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
>>  DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
>> Unsupported-
>>  RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>  MaxPayload 128 bytes, MaxReadReq 512 bytes
>>  DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
>> TransPend-
>>  LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s,
>> Latency L0 <4us, L1 <64us
>>  ClockPM- Surprise- LLActRep- BwNot-
>>  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
>> CommClk-
>>  ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>  LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
>> DLActive- BWMgmt- ABWMgmt-
>>  Capabilities: [100 v1] Advanced Error Reporting
>>  UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
>> RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
>>  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>> RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>>  UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
>> UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>>  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>>  AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap-
>> ChkEn-
>>  Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
>>  Kernel driver in use: e1000e
>>  Kernel modules: e1000e
>>
>>
>> Thanks,
>> Joe
> 
> was this lspci output taken on freshly booted system?
> 

Yes, any issue do you find?

Thanks,
Joe


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Wednesday, July 11, 2012 12:18 AM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/11/12 15:11, Dave, Tushar N wrote:
>>> -Original Message-
>>> From: Joe Jin [mailto:joe@oracle.com]
>>> Sent: Tuesday, July 10, 2012 10:03 PM
>>> To: Dave, Tushar N
>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>> ker...@vger.kernel.org
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>>> On 07/11/12 12:05, Dave, Tushar N wrote:
>>>> When you said you had this issue with RHEL5 and RHEL6 drivers, have
>>>> you
>>> install RHEl5/6 kernel and reproduced it? If so I think I should
>>> install
>>> RHEL6 and try reproduce it locally!
>>>>
>>> Yes I reproduced this on both RHEL5 and RHEL6.
>>>
>>> So far I tried to scp big file (~1GB) will hit it at once.
>>>
>>> Thanks,
>>> Joe
>>
>> Joe,
>> Can you please send lspci -vvv output for failing port before issue
>occurs.
>> Thanks.
>>
># lspci -s 05:00.0 -vvv
>05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>Controller (Copper) (rev 06)
>   Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet
>UTP Low Profile Adapter
>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>Stepping- SERR- FastB2B- DisINTx+
>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>SERR-Latency: 0, Cache Line Size: 256 bytes
>   Interrupt: pin B routed to IRQ 80
>   Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
>   Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
>   Region 2: I/O ports at dc00 [size=32]
>   Expansion ROM at fbda [disabled] [size=128K]
>   Capabilities: [c8] Power Management version 2
>   Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
>,D3hot+,D3cold+)
>   Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>   Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>   Address: fee21000  Data: 40cb
>   Capabilities: [e0] Express (v1) Endpoint, MSI 00
>   DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
><512ns, L1 <64us
>   ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
>   DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
>Unsupported-
>   RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>   MaxPayload 128 bytes, MaxReadReq 512 bytes
>   DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
>TransPend-
>   LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s,
>Latency L0 <4us, L1 <64us
>   ClockPM- Surprise- LLActRep- BwNot-
>   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
>CommClk-
>   ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>   LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
>DLActive- BWMgmt- ABWMgmt-
>   Capabilities: [100 v1] Advanced Error Reporting
>   UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
>   UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>   UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
>UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>   CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>   CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>   AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap-
>ChkEn-
>   Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
>   Kernel driver in use: e1000e
>   Kernel modules: e1000e
>
>
>Thanks,
>Joe

was this lspci output taken on freshly booted system?

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/11/12 15:11, Dave, Tushar N wrote:
>> -Original Message-
>> From: Joe Jin [mailto:joe@oracle.com]
>> Sent: Tuesday, July 10, 2012 10:03 PM
>> To: Dave, Tushar N
>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 07/11/12 12:05, Dave, Tushar N wrote:
>>> When you said you had this issue with RHEL5 and RHEL6 drivers, have you
>> install RHEl5/6 kernel and reproduced it? If so I think I should install
>> RHEL6 and try reproduce it locally!
>>>
>> Yes I reproduced this on both RHEL5 and RHEL6.
>>
>> So far I tried to scp big file (~1GB) will hit it at once.
>>
>> Thanks,
>> Joe
> 
> Joe,
> Can you please send lspci -vvv output for failing port before issue occurs.
> Thanks.
> 
# lspci -s 05:00.0 -vvv
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP 
Low Profile Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Tuesday, July 10, 2012 10:03 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/11/12 12:05, Dave, Tushar N wrote:
>> When you said you had this issue with RHEL5 and RHEL6 drivers, have you
>install RHEl5/6 kernel and reproduced it? If so I think I should install
>RHEL6 and try reproduce it locally!
>>
>Yes I reproduced this on both RHEL5 and RHEL6.
>
>So far I tried to scp big file (~1GB) will hit it at once.
>
>Thanks,
>Joe

Joe,
Can you please send lspci -vvv output for failing port before issue occurs.
Thanks.

-Tushar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, July 10, 2012 10:03 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have you
install RHEl5/6 kernel and reproduced it? If so I think I should install
RHEL6 and try reproduce it locally!

Yes I reproduced this on both RHEL5 and RHEL6.

So far I tried to scp big file (~1GB) will hit it at once.

Thanks,
Joe

Joe,
Can you please send lspci -vvv output for failing port before issue occurs.
Thanks.

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/11/12 15:11, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, July 10, 2012 10:03 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have you
 install RHEl5/6 kernel and reproduced it? If so I think I should install
 RHEL6 and try reproduce it locally!

 Yes I reproduced this on both RHEL5 and RHEL6.

 So far I tried to scp big file (~1GB) will hit it at once.

 Thanks,
 Joe
 
 Joe,
 Can you please send lspci -vvv output for failing port before issue occurs.
 Thanks.
 
# lspci -s 05:00.0 -vvv
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet UTP 
Low Profile Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin B routed to IRQ 80
Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at dc00 [size=32]
Expansion ROM at fbda [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee21000  Data: 40cb
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 512ns, 
L1 64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ 
TransPend-
LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 
4us, L1 64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
Kernel driver in use: e1000e
Kernel modules: e1000e


Thanks,
Joe
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, July 11, 2012 12:18 AM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 15:11, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, July 10, 2012 10:03 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have
 you
 install RHEl5/6 kernel and reproduced it? If so I think I should
 install
 RHEL6 and try reproduce it locally!

 Yes I reproduced this on both RHEL5 and RHEL6.

 So far I tried to scp big file (~1GB) will hit it at once.

 Thanks,
 Joe

 Joe,
 Can you please send lspci -vvv output for failing port before issue
occurs.
 Thanks.

# lspci -s 05:00.0 -vvv
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (Copper) (rev 06)
   Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet
UTP Low Profile Adapter
   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
TAbort- MAbort- SERR- PERR- INTx-
   Latency: 0, Cache Line Size: 256 bytes
   Interrupt: pin B routed to IRQ 80
   Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
   Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
   Region 2: I/O ports at dc00 [size=32]
   Expansion ROM at fbda [disabled] [size=128K]
   Capabilities: [c8] Power Management version 2
   Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
,D3hot+,D3cold+)
   Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
   Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
   Address: fee21000  Data: 40cb
   Capabilities: [e0] Express (v1) Endpoint, MSI 00
   DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
512ns, L1 64us
   ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
   DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
   RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
   MaxPayload 128 bytes, MaxReadReq 512 bytes
   DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
TransPend-
   LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s,
Latency L0 4us, L1 64us
   ClockPM- Surprise- LLActRep- BwNot-
   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
CommClk-
   ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
   LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
   Capabilities: [100 v1] Advanced Error Reporting
   UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
   UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
   UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
   CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
   CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
   AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap-
ChkEn-
   Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
   Kernel driver in use: e1000e
   Kernel modules: e1000e


Thanks,
Joe

was this lspci output taken on freshly booted system?

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/11/12 15:37, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 12:18 AM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 15:11, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, July 10, 2012 10:03 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have
 you
 install RHEl5/6 kernel and reproduced it? If so I think I should
 install
 RHEL6 and try reproduce it locally!

 Yes I reproduced this on both RHEL5 and RHEL6.

 So far I tried to scp big file (~1GB) will hit it at once.

 Thanks,
 Joe

 Joe,
 Can you please send lspci -vvv output for failing port before issue
 occurs.
 Thanks.

 # lspci -s 05:00.0 -vvv
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
 Controller (Copper) (rev 06)
  Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet
 UTP Low Profile Adapter
  Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B- DisINTx+
  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR- INTx-
  Latency: 0, Cache Line Size: 256 bytes
  Interrupt: pin B routed to IRQ 80
  Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
  Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
  Region 2: I/O ports at dc00 [size=32]
  Expansion ROM at fbda [disabled] [size=128K]
  Capabilities: [c8] Power Management version 2
  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
 ,D3hot+,D3cold+)
  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
  Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
  Address: fee21000  Data: 40cb
  Capabilities: [e0] Express (v1) Endpoint, MSI 00
  DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
 512ns, L1 64us
  ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
 Unsupported-
  RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
  MaxPayload 128 bytes, MaxReadReq 512 bytes
  DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
 TransPend-
  LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s,
 Latency L0 4us, L1 64us
  ClockPM- Surprise- LLActRep- BwNot-
  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
 CommClk-
  ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
  LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
 DLActive- BWMgmt- ABWMgmt-
  Capabilities: [100 v1] Advanced Error Reporting
  UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
 RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
  UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
 RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
  UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
 UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
  CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
  CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
  AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap-
 ChkEn-
  Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
  Kernel driver in use: e1000e
  Kernel modules: e1000e


 Thanks,
 Joe
 
 was this lspci output taken on freshly booted system?
 

Yes, any issue do you find?

Thanks,
Joe


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, July 11, 2012 12:39 AM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 15:37, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 12:18 AM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 15:11, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Tuesday, July 10, 2012 10:03 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers,
 have you
 install RHEl5/6 kernel and reproduced it? If so I think I should
 install
 RHEL6 and try reproduce it locally!

 Yes I reproduced this on both RHEL5 and RHEL6.

 So far I tried to scp big file (~1GB) will hit it at once.

 Thanks,
 Joe

 Joe,
 Can you please send lspci -vvv output for failing port before issue
 occurs.
 Thanks.

 # lspci -s 05:00.0 -vvv
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)
 Subsystem: Oracle Corporation x4 PCI-Express Quad Gigabit Ethernet
 UTP Low Profile Adapter
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
 Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
 TAbort- MAbort- SERR- PERR- INTx-
 Latency: 0, Cache Line Size: 256 bytes
 Interrupt: pin B routed to IRQ 80
 Region 0: Memory at fbde (32-bit, non-prefetchable) [size=128K]
 Region 1: Memory at fbdc (32-bit, non-prefetchable) [size=128K]
 Region 2: I/O ports at dc00 [size=32]
 Expansion ROM at fbda [disabled] [size=128K]
 Capabilities: [c8] Power Management version 2
 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
 ,D3hot+,D3cold+)
 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
 Address: fee21000  Data: 40cb
 Capabilities: [e0] Express (v1) Endpoint, MSI 00
 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
 512ns, L1 64us
 ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
 DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
 Unsupported-
 RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 MaxPayload 128 bytes, MaxReadReq 512 bytes
 DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+
 TransPend-
 LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L0s,
 Latency L0 4us, L1 64us
 ClockPM- Surprise- LLActRep- BwNot-
 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
 CommClk-
 ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
 LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
 DLActive- BWMgmt- ABWMgmt-
 Capabilities: [100 v1] Advanced Error Reporting
 UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
 RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
 RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
 UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
 AERCap: First Error Pointer: 12, GenCap- CGenEn- ChkCap-
 ChkEn-
 Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-b9-77-9c
 Kernel driver in use: e1000e
 Kernel modules: e1000e


 Thanks,
 Joe

 was this lspci output taken on freshly booted system?


Yes, any issue do you find?

Thanks,
Joe


Device status and AER sections show some errors that looks little suspicious to 
me but I'm not too sure. I will get back tomorrow.

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/11/12 15:50, Dave, Tushar N wrote:
 Device status and AER sections show some errors that looks little suspicious 
 to me but I'm not too sure. I will get back tomorrow.
 

Thanks a lot, Tushar!

Joe


-- 
Oracle http://www.oracle.com
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, July 10, 2012 10:03 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have you
install RHEl5/6 kernel and reproduced it? If so I think I should install
RHEL6 and try reproduce it locally!

Yes I reproduced this on both RHEL5 and RHEL6.

So far I tried to scp big file (~1GB) will hit it at once.

Thanks,
Joe

Joe,

I see couple of errors in lspci output.
Device capability status register shows UnCorrectable PCIe error. This means 
there is certainly something went wrong. The only way to recover from 
Uncorrectable errors is reset.

DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend-

Also AER sections in lspci output shows PCIe completion timeout.

Capabilities: [100 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-

I suggest you should load AER driver and check for any error messages in log. 
Also please check any error message reported by system in BIOS log. Are there 
any machine check errors? 

When did you notice this issue? have 82571 ever been working before on this 
server?

One more thing, Cache line size 256 is little unusual( I never seen this value 
before, mostly it's 64). Does BIOS settings have been changed? Are you using 
default BIOS setting?

Thanks.

-Tushar

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/12/12 02:51, Dave, Tushar N wrote:
 
 Joe,
 
 I see couple of errors in lspci output.
 Device capability status register shows UnCorrectable PCIe error. This means 
 there is certainly something went wrong. The only way to recover from 
 Uncorrectable errors is reset.

   DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+ TransPend-
 
 Also AER sections in lspci output shows PCIe completion timeout.
   
   Capabilities: [100 v1] Advanced Error Reporting
   UESta:  DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt- 
 RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-
 
 I suggest you should load AER driver and check for any error messages in log. 
 Also please check any error message reported by system in BIOS log. Are there 
 any machine check errors? 
 
 When did you notice this issue? have 82571 ever been working before on this 
 server?
 
 One more thing, Cache line size 256 is little unusual( I never seen this 
 value before, mostly it's 64). Does BIOS settings have been changed? Are you 
 using default BIOS setting?
 

I checked BIOS's log found the fault from the device, I changed PCI-E Payload 
Size
from 256(default) to 128, now the device works.

I compared lspci output found Address for data of MSI Capabilities's be changed:

Old:
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee21000  Data: 40cb

New:
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee24000  Data: 405c

Mostly like it's a BIOS bug? please comments.

Thanks,
Joe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, July 11, 2012 7:23 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/12/12 02:51, Dave, Tushar N wrote:

 Joe,

 I see couple of errors in lspci output.
 Device capability status register shows UnCorrectable PCIe error. This
means there is certainly something went wrong. The only way to recover
from Uncorrectable errors is reset.

  DevSta: CorrErr- *UncorrErr+ FatalErr+ UnsuppReq+ AuxPwr+
TransPend-

 Also AER sections in lspci output shows PCIe completion timeout.

  Capabilities: [100 v1] Advanced Error Reporting
  UESta:  DLP- SDES- TLP- FCP- *CmpltTO+ CmpltAbrt- UnxCmplt-
RxOF- MalfTLP+ ECRC- UnsupReq+ ACSViol-

 I suggest you should load AER driver and check for any error messages in
log. Also please check any error message reported by system in BIOS log.
Are there any machine check errors?

 When did you notice this issue? have 82571 ever been working before on
this server?

 One more thing, Cache line size 256 is little unusual( I never seen this
value before, mostly it's 64). Does BIOS settings have been changed? Are
you using default BIOS setting?

I checked BIOS's log found the fault from the device, I changed PCI-E
Payload Size
from 256(default) to 128, now the device works.

I compared lspci output found Address for data of MSI Capabilities's be
changed:

Old:
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee21000  Data: 40cb

New:
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee24000  Data: 405c

Mostly like it's a BIOS bug? please comments.

Thanks,
Joe

What is the exact error messages in BIOS log?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

Error message from BIOS event log:
07/12/12 05:54:00
PCI Express Non-Fatal Error

Thanks,
Joe
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, July 11, 2012 7:58 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

Error message from BIOS event log:
07/12/12 05:54:00
PCI Express Non-Fatal Error

Thanks,
Joe

Thanks.  Well, I will check with team tomorrow if this  (max payload size) can 
be treated as solution to this issue. 
We can know more about what exact non-fatal error occurred if we capture bus 
trace.
We should check the eeprom on this device to make sure they are up-to-date.
Send me the full eeprom dump in a file and I will confirm with team that it is 
up-to-date.
Thanks for your work.

-Tushar
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Joe Jin

On 07/12/12 11:07, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 7:58 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

 Error message from BIOS event log:
 07/12/12 05:54:00
PCI Express Non-Fatal Error

 Thanks,
 Joe
 
 Thanks.  Well, I will check with team tomorrow if this  (max payload size) 
 can be treated as solution to this issue. 
 We can know more about what exact non-fatal error occurred if we capture bus 
 trace.
 We should check the eeprom on this device to make sure they are up-to-date.
 Send me the full eeprom dump in a file and I will confirm with team that it 
 is up-to-date.
 Thanks for your work.
 

Hi Tushar,

Please find eeprom from attachment.

Thanks a lot of your help,
Joe
attachment: eeprom.rawOffset  Values
--  --
0x  00 15 17 b9 77 9c 24 05 ff ff a2 50 ff ff ff ff 
0x0010  01 d9 04 97 2f 24 bc 11 8e 10 bc 10 86 80 65 b1 
0x0020  08 00 bc 10 00 58 00 00 01 50 00 00 00 00 00 01 
0x0030  f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06 
0x0040  08 00 f0 1e 64 21 40 00 01 48 00 00 00 00 00 00 
0x0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0060  00 01 00 40 2a 12 07 40 00 01 00 40 ff ff ff ff 
0x0070  ff ff ff ff ff ff ff ff ff ff 97 01 ff ff 4b e8 
0x0080  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0090  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00a0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00b0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00c0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00d0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00e0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x00f0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0100  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0110  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0120  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0130  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0140  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0150  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0160  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0170  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0180  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0190  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01a0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01b0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01c0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01d0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x01e0  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 6f 
0x01f0  87 04 00 00 00 00 00 00 00 00 00 00 ff ff ff 16 
0x0200  03 00 22 00 00 00 07 30 00 e5 49 00 df 61 15 34 
0x0210  00 36 81 2f 04 50 00 3b 15 34 00 36 04 60 00 49 
0x0220  15 34 00 36 04 70 00 9a 15 34 00 36 04 80 00 27 
0x0230  15 34 00 36 05 40 00 c1 47 02 00 14 00 10 04 24 
0x0240  00 e1 00 14 00 10 38 02 00 15 3f 04 5b 2f 3b 04 
0x0250  1b 00 32 04 87 00 3f 04 70 2f 30 04 a4 a8 3f 04 
0x0260  90 2f 30 04 c0 0e 3f 04 11 20 31 04 20 04 3f 04 
0x0270  00 00 20 04 40 01 3f 04 7a 18 1a 04 00 08 3f 04 
0x0280  30 1f 30 04 06 16 35 04 2a 01 3e 04 67 00 3f 04 
0x0290  54 1f 34 04 65 00 35 04 2a 00 36 04 2a 00 3f 04 
0x02a0  72 1f 32 04 b0 3f 36 04 ff c0 37 04 ec 1d 38 04 
0x02b0  ef f9 39 04 10 02 3c 06 00 0c 3f 04 95 18 35 04 
0x02c0  03 00 3f 04 96 17 36 04 08 00 3f 04 98 1f 38 04 
0x02d0  08 d0 3f 04 00 00 20 04 40 13 3f 04 5b 2f 3b 04 
0x02e0  18 90 32 04 00 00 3f 04 70 2f 30 04 e4 29 3f 04 
0x02f0  90 2f 30 04 c0 06 3f 04 11 20 31 04 00 04 30 04 
0x0300  b0 10 3f 04 b1 2f 31 04 24 8d 32 04 f0 f8 3f 04 
0x0310  dc 20 3c 04 00 00 3d 04 0a 00 3e 04 d3 00 3f 04 
0x0320  b4 28 34 04 ce 04 3f 04 00 00 20 04 40 13 69 53 
0x0330  e0 05 01 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0340  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0350  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0360  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0370  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0380  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0390  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x03a0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x03b0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x03c0  00 00 00 00 00 00 00 00

RE: 82571EB: Detected Hardware Unit Hang

2012-07-11 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Wednesday, July 11, 2012 8:13 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/12/12 11:07, Dave, Tushar N wrote:
 -Original Message-
 From: Joe Jin [mailto:joe@oracle.com]
 Sent: Wednesday, July 11, 2012 7:58 PM
 To: Dave, Tushar N
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 On 07/12/12 10:52, Dave, Tushar N wrote:
 What is the exact error messages in BIOS log?

 Error message from BIOS event log:
 07/12/12 05:54:00
PCI Express Non-Fatal Error

 Thanks,
 Joe
Hi Tushar,

Please find eeprom from attachment.

Do you have lspci -vvv dump of entire system before and after issue occurs? If 
you have can you send it to me?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin

On 07/11/12 12:05, Dave, Tushar N wrote:
> When you said you had this issue with RHEL5 and RHEL6 drivers, have you 
> install RHEl5/6 kernel and reproduced it? If so I think I should install 
> RHEL6 and try reproduce it locally!
> 
Yes I reproduced this on both RHEL5 and RHEL6.

So far I tried to scp big file (~1GB) will hit it at once.

Thanks,
Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N

>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Tuesday, July 10, 2012 8:29 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/11/12 11:22, Dave, Tushar N wrote:
>> Thanks for info. I see that hang occurs right when HW processing first
>TX descriptor with TSO.
>> Would you be able to reproduce issue with TSO off?  Disable TSO by
>'ethtool -K ethx tso off'
>> Let all debug enabled as it is,  that will help us debug further if
>issue occurs with TSO off.
>
>Hi Tushar,
>
>Thanks for you quick reply but disabled tso no help for this issue:

Thanks for running a quick test. I don't find anything obvious wrong in 
descriptor dump.

When you said you had this issue with RHEL5 and RHEL6 drivers, have you install 
RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try 
reproduce it locally!

-Tushar


>
># ethtool -k eth0
>Offload parameters for eth0:
>rx-checksumming: on
>tx-checksumming: on
>scatter-gather: on
>tcp segmentation offload: off
>udp fragmentation offload: off
>generic segmentation offload: on
>generic-receive-offload: on
>
>kernel log after disable tso:
>
>e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>  TDH  <1>
>  TDT  <4>
>  next_to_use  <4>
>  next_to_clean<1>
>buffer_info[next_to_clean]:
>  time_stamp   <103ae0aba>
>  next_to_watch<1>
>  jiffies  <103ae16a0>
>  next_to_watch.status <0>
>MAC Status <80387>
>PHY Status <792d>
>PHY 1000BASE-T Status  <3c00>
>PHY Extended Status<3000>
>PCI Status <10>
>e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>  TDH  <1>
>  TDT  <4>
>  next_to_use  <4>
>  next_to_clean<1>
>buffer_info[next_to_clean]:
>  time_stamp   <103ae0aba>
>  next_to_watch<1>
>  jiffies  <103ae2640>
>  next_to_watch.status <0>
>MAC Status <80387>
>PHY Status <792d>
>PHY 1000BASE-T Status  <3c00>
>PHY Extended Status<3000>
>PCI Status <10>
>e1000e :05:00.0: Net device Info
>e1000e: Device Name statetrans_start  last_rx
>e1000e: eth00003 000103AE128A 
>e1000e :05:00.0: Register Dump
>e1000e:  Register Name   Value
>e1000e: CTRL180c0241
>e1000e: STATUS  00080387
>e1000e: CTRL_EXT181400c0
>e1000e: ICR 0040
>e1000e: RCTL04048002
>e1000e: RDLEN   1000
>e1000e: RDH 0090
>e1000e: RDT 0080
>e1000e: RDTR0020
>e1000e: RXDCTL[0-1] 01040420 01040420
>e1000e: ERT 
>e1000e: RDBAL   23852000
>e1000e: RDBAH   000c
>e1000e: RDFH075a
>e1000e: RDFT0752
>e1000e: RDFHS   0758
>e1000e: RDFTS   0752
>e1000e: RDFPC   01b4
>e1000e: TCTL3003f00a
>e1000e: TDBAL   1210c000
>e1000e: TDBAH   000c
>e1000e: TDLEN   1000
>e1000e: TDH 0001
>e1000e: TDT 0004
>e1000e: TIDV0008
>e1000e: TXDCTL[0-1] 0145011f 0145011f
>e1000e: TADV0020
>e1000e: TARC[0-1]   07a00403 07400403
>e1000e: TDFH1308
>e1000e: TDFT1308
>e1000e: TDFHS   1308
>e1000e: TDFTS   1308
>e1000e: TDFPC   
>e1000e :05:00.0: Tx Ring Summary
>e1000e: Queue [NTU] [NTC] [bi(ntc)->dma  ] leng ntw timestamp
>e1000e:  0 4 1 000620800C02 002A   1 000103AE0ABA
>e1000e :05:00.0: Tx Ring Dump
>e1000e: Tl[desc] [address 63:0  ] [SpeCssSCmCsLen] [bi->dma   ]
>leng  ntw timestampbi->skb <-- Legacy format
>e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi->dma   ]
>leng  ntw timestampbi->skb <-- Ext Context format
>e1000e: Td[desc] [address 63:0  ] [VlaPoRSCm1Dlen] [bi->dma   ]
>leng  ntw timestampbi->skb <-- Ext Data format
>e1000e: Tl[0x000]000C1AA0F002 8B2A 
>002A0  (null)
>e1000e: Tl[0x001]000620800C02 8B2A 000620800C02
>002A1 000103AE0ABA 88061c6b6980 NTC
>e1000e: Tl[0x002]

Re: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin

On 07/11/12 11:22, Dave, Tushar N wrote:
> Thanks for info. I see that hang occurs right when HW processing first TX 
> descriptor with TSO.
> Would you be able to reproduce issue with TSO off?  Disable TSO by 'ethtool 
> -K ethx tso off'
> Let all debug enabled as it is,  that will help us debug further if issue 
> occurs with TSO off.

Hi Tushar,

Thanks for you quick reply but disabled tso no help for this issue:

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: on

kernel log after disable tso:

e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  <1>
  TDT  <4>
  next_to_use  <4>
  next_to_clean<1>
buffer_info[next_to_clean]:
  time_stamp   <103ae0aba>
  next_to_watch<1>
  jiffies  <103ae16a0>
  next_to_watch.status <0>
MAC Status <80387>
PHY Status <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status<3000>
PCI Status <10>
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  <1>
  TDT  <4>
  next_to_use  <4>
  next_to_clean<1>
buffer_info[next_to_clean]:
  time_stamp   <103ae0aba>
  next_to_watch<1>
  jiffies  <103ae2640>
  next_to_watch.status <0>
MAC Status <80387>
PHY Status <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status<3000>
PCI Status <10>
e1000e :05:00.0: Net device Info
e1000e: Device Name statetrans_start  last_rx
e1000e: eth00003 000103AE128A 
e1000e :05:00.0: Register Dump
e1000e:  Register Name   Value
e1000e: CTRL180c0241
e1000e: STATUS  00080387
e1000e: CTRL_EXT181400c0
e1000e: ICR 0040
e1000e: RCTL04048002
e1000e: RDLEN   1000
e1000e: RDH 0090
e1000e: RDT 0080
e1000e: RDTR0020
e1000e: RXDCTL[0-1] 01040420 01040420
e1000e: ERT 
e1000e: RDBAL   23852000
e1000e: RDBAH   000c
e1000e: RDFH075a
e1000e: RDFT0752
e1000e: RDFHS   0758
e1000e: RDFTS   0752
e1000e: RDFPC   01b4
e1000e: TCTL3003f00a
e1000e: TDBAL   1210c000
e1000e: TDBAH   000c
e1000e: TDLEN   1000
e1000e: TDH 0001
e1000e: TDT 0004
e1000e: TIDV0008
e1000e: TXDCTL[0-1] 0145011f 0145011f
e1000e: TADV0020
e1000e: TARC[0-1]   07a00403 07400403
e1000e: TDFH1308
e1000e: TDFT1308
e1000e: TDFHS   1308
e1000e: TDFTS   1308
e1000e: TDFPC   
e1000e :05:00.0: Tx Ring Summary
e1000e: Queue [NTU] [NTC] [bi(ntc)->dma  ] leng ntw timestamp
e1000e:  0 4 1 000620800C02 002A   1 000103AE0ABA
e1000e :05:00.0: Tx Ring Dump
e1000e: Tl[desc] [address 63:0  ] [SpeCssSCmCsLen] [bi->dma   ] leng  
ntw timestampbi->skb <-- Legacy format
e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi->dma   ] leng  
ntw timestampbi->skb <-- Ext Context format
e1000e: Td[desc] [address 63:0  ] [VlaPoRSCm1Dlen] [bi->dma   ] leng  
ntw timestampbi->skb <-- Ext Data format
e1000e: Tl[0x000]000C1AA0F002 8B2A  002A
0  (null)
e1000e: Tl[0x001]000620800C02 8B2A 000620800C02 002A
1 000103AE0ABA 88061c6b6980 NTC
e1000e: Tl[0x002]00061E6DBC02 8B2A 00061E6DBC02 002A
2 000103AE0EA2 88061c6b6880
e1000e: Tl[0x003]000620A6C402 8B2A 000620A6C402 002A
3 000103AE128A 8806230b4080
e1000e: Tl[0x004]   
0  (null) NTU
e1000e: Tl[0x005]   
0  (null)
e1000e: Tl[0x006]   
0  (null)
e1000e: Tl[0x007]   
0  (null)
e1000e: Tl[0x008]   
0  (null)
e1000e: Tl[0x009]   
0  (null)
e1000e: Tl[0x00A]   
0  (null)
e1000e: Tl[0x00B]   
0  (null)
e1000e: Tl[0x00C]   
0  (null)
e1000e:

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N


>-Original Message-
>From: Joe Jin [mailto:joe@oracle.com]
>Sent: Tuesday, July 10, 2012 5:35 PM
>To: Dave, Tushar N
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 07/11/12 03:02, Dave, Tushar N wrote:
>>> -Original Message-
>>> From: netdev-ow...@vger.kernel.org
>>> [mailto:netdev-ow...@vger.kernel.org]
>>> On Behalf Of Joe Jin
>>> Sent: Tuesday, July 10, 2012 12:40 AM
>>> To: Joe Jin
>>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>> ker...@vger.kernel.org
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>>> When I debug the driver I found before Detected HW hang, driver
>>> unable to clean and reclaim the resources:
>>>
>>> 1457 while ((eop_desc->upper.data &
>>> cpu_to_le32(E1000_TXD_STAT_DD)) &&  <== at here upper.data always is
>0x300
>>> 1458(count < tx_ring->count)) {
>>> <--- snip --->
>>> 1487 }
>>>
>>>
>>> I checked all driver codes I did not found anywhere will set the
>>> upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by
>hardware?
>>
>> Yes upper.data (part of it is STATUS byte) is set by HW. Basically
>driver checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set
>that means HW has processed that descriptor and driver can now clean that
>descriptor.
>> With value 0x300 , DD bit is not set. That means HW has not processed
>that descriptor.
>
>Thanks for the clarify, might be firmware issue?
>>
>> How fast does tx hang reproduce? I suggest you to enable debug code in
>driver so when tx hang occurs it will dump the HW desc ring info into
>kernel log.
>
>Once I copy a file from other server, issue to be reproduced at once.
>I'll enable the debug to get more debug info.
>
>> You can run "ethtool -s ethx msglvl 0x2c00" to enable debug.
>> Once tx hang occurs please send me the full dmesg log.
>>
>> Does tx hang occur with in-kernel e1000e driver too?
>
>I tried several drivers included rhel5 the latest, Intel the latest,
>rhel6 the latest, issue see on all those drivers.

Also after issue occurs please capture lspci -vvv (run as root)

>
>Thanks,
>Joe
>>
>> Thanks.
>>
>> -Tushar
>>
>>
>>> If OS is 32bit system, what which happen?
>>
>>
>>>
>>> Thanks in advance,
>>> Joe
>>>
>>> On 07/09/12 16:51, Joe Jin wrote:
>>>> Hi list,
>>>>
>>>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>>>> doing scp test. this issue is easy do reproduced on SUN FIRE X2270
>>>> M2, just copy a big file (>500M) from another server will hit it at
>once.
>>>>
>>>> Would you please help on this?
>>>>
>>>> device info:
>>>> # lspci -s 05:00.0
>>>> 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>>> Ethernet Controller (Copper) (rev 06)
>>>>
>>>> # lspci -s 05:00.0 -n
>>>> 05:00.0 0200: 8086:10bc (rev 06)
>>>>
>>>> # ethtool -i eth0
>>>> driver: e1000e
>>>> version: 2.0.0-NAPI
>>>> firmware-version: 5.10-2
>>>> bus-info: :05:00.0
>>>>
>>>> # ethtool -k eth0
>>>> Offload parameters for eth0:
>>>> rx-checksumming: on
>>>> tx-checksumming: on
>>>> scatter-gather: on
>>>> tcp segmentation offload: on
>>>> udp fragmentation offload: off
>>>> generic segmentation offload: on
>>>> generic-receive-offload: on
>>>>
>>>> kernel log:
>>>> ---
>>>> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>>>>   TDH  <6c>
>>>>   TDT  <81>
>>>>   next_to_use  <81>
>>>>   next_to_clean<6b>
>>>> buffer_info[next_to_clean]:
>>>>   time_stamp   
>>>>   next_to_watch<71>
>>>>   jiffies  
>>>>   next_to_watch.status <0>
>>>> MAC Status <80387>
>>>> PHY Status <792d>
>>>> PHY 1000BASE-T Status  <3c00>
>>>> PHY Extended Status<3000>
>>>> PCI Status <

Re: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin

On 07/11/12 03:02, Dave, Tushar N wrote:
>> -Original Message-
>> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
>> On Behalf Of Joe Jin
>> Sent: Tuesday, July 10, 2012 12:40 AM
>> To: Joe Jin
>> Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> When I debug the driver I found before Detected HW hang, driver unable to
>> clean and reclaim the resources:
>>
>> 1457 while ((eop_desc->upper.data &
>> cpu_to_le32(E1000_TXD_STAT_DD)) &&  <== at here upper.data always is 0x300
>> 1458(count < tx_ring->count)) {
>> <--- snip --->
>> 1487 }
>>
>>
>> I checked all driver codes I did not found anywhere will set the
>> upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?
> 
> Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver 
> checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means 
> HW has processed that descriptor and driver can now clean that descriptor.
> With value 0x300 , DD bit is not set. That means HW has not processed that 
> descriptor.

Thanks for the clarify, might be firmware issue?
> 
> How fast does tx hang reproduce? I suggest you to enable debug code in driver 
> so when tx hang occurs it will dump the HW desc ring info into kernel log.

Once I copy a file from other server, issue to be reproduced at once.
I'll enable the debug to get more debug info.

> You can run "ethtool -s ethx msglvl 0x2c00" to enable debug.
> Once tx hang occurs please send me the full dmesg log.
> 
> Does tx hang occur with in-kernel e1000e driver too?

I tried several drivers included rhel5 the latest, Intel the latest,
rhel6 the latest, issue see on all those drivers.

Thanks,
Joe 
> 
> Thanks.
> 
> -Tushar
> 
> 
>> If OS is 32bit system, what which happen?
> 
> 
>>
>> Thanks in advance,
>> Joe
>>
>> On 07/09/12 16:51, Joe Jin wrote:
>>> Hi list,
>>>
>>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>>> doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
>>> just copy a big file (>500M) from another server will hit it at once.
>>>
>>> Would you please help on this?
>>>
>>> device info:
>>> # lspci -s 05:00.0
>>> 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (Copper) (rev 06)
>>>
>>> # lspci -s 05:00.0 -n
>>> 05:00.0 0200: 8086:10bc (rev 06)
>>>
>>> # ethtool -i eth0
>>> driver: e1000e
>>> version: 2.0.0-NAPI
>>> firmware-version: 5.10-2
>>> bus-info: :05:00.0
>>>
>>> # ethtool -k eth0
>>> Offload parameters for eth0:
>>> rx-checksumming: on
>>> tx-checksumming: on
>>> scatter-gather: on
>>> tcp segmentation offload: on
>>> udp fragmentation offload: off
>>> generic segmentation offload: on
>>> generic-receive-offload: on
>>>
>>> kernel log:
>>> ---
>>> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>>>   TDH  <6c>
>>>   TDT  <81>
>>>   next_to_use  <81>
>>>   next_to_clean<6b>
>>> buffer_info[next_to_clean]:
>>>   time_stamp   
>>>   next_to_watch<71>
>>>   jiffies  
>>>   next_to_watch.status <0>
>>> MAC Status <80387>
>>> PHY Status <792d>
>>> PHY 1000BASE-T Status  <3c00>
>>> PHY Extended Status<3000>
>>> PCI Status <10>
>>> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>>>   TDH  <6c>
>>>   TDT  <81>
>>>   next_to_use  <81>
>>>   next_to_clean<6b>
>>> buffer_info[next_to_clean]:
>>>   time_stamp   
>>>   next_to_watch<71>
>>>   jiffies  
>>>   next_to_watch.status <0>
>>> MAC Status <80387>
>>> PHY Status <792d>
>>> PHY 1000BASE-T Status  <3c00>
>>> PHY Extended Status<3000>
>>> PCI Status <10>
>>> [ cut here ]
>>> WARNING: at net/sched/sch_gener

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N


>-Original Message-
>From: Dave, Tushar N
>Sent: Tuesday, July 10, 2012 12:02 PM
>To: Joe Jin
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org; Dave, Tushar N
>Subject: RE: 82571EB: Detected Hardware Unit Hang
>
>>-Original Message-
>>From: netdev-ow...@vger.kernel.org
>>[mailto:netdev-ow...@vger.kernel.org]
>>On Behalf Of Joe Jin
>>Sent: Tuesday, July 10, 2012 12:40 AM
>>To: Joe Jin
>>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>>ker...@vger.kernel.org
>>Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>>When I debug the driver I found before Detected HW hang, driver unable
>>to clean and reclaim the resources:
>>
>>1457 while ((eop_desc->upper.data &
>>cpu_to_le32(E1000_TXD_STAT_DD)) &&  <== at here upper.data always is
>0x300
>>1458(count < tx_ring->count)) {
>> <--- snip --->
>>1487 }
>>
>>
>>I checked all driver codes I did not found anywhere will set the
>>upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?
>
>Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver
>checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that
>means HW has processed that descriptor and driver can now clean that
>descriptor.
>With value 0x300 , DD bit is not set. That means HW has not processed that
>descriptor.
>
>How fast does tx hang reproduce? I suggest you to enable debug code in
>driver so when tx hang occurs it will dump the HW desc ring info into
>kernel log.
>You can run "ethtool -s ethx msglvl 0x2c00" to enable debug.
>Once tx hang occurs please send me the full dmesg log.
>
>Does tx hang occur with in-kernel e1000e driver too?
>
>Thanks.
>
>-Tushar
One change , please use " ethtool -s ethx msglvl 0x2c01" so to keep default 
'drv' msglvl enabled.
Confirm the message level set correctly by running command 'ethtool ethx'.
Last few will be

Current message level: 0x2c01 (11265)
   drv tx_done rx_status hw


>
>
>>If OS is 32bit system, what which happen?
>
>
>>
>>Thanks in advance,
>>Joe
>>
>>On 07/09/12 16:51, Joe Jin wrote:
>>> Hi list,
>>>
>>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>>> doing scp test. this issue is easy do reproduced on SUN FIRE X2270
>>> M2, just copy a big file (>500M) from another server will hit it at
>once.
>>>
>>> Would you please help on this?
>>>
>>> device info:
>>> # lspci -s 05:00.0
>>> 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (Copper) (rev 06)
>>>
>>> # lspci -s 05:00.0 -n
>>> 05:00.0 0200: 8086:10bc (rev 06)
>>>
>>> # ethtool -i eth0
>>> driver: e1000e
>>> version: 2.0.0-NAPI
>>> firmware-version: 5.10-2
>>> bus-info: :05:00.0
>>>
>>> # ethtool -k eth0
>>> Offload parameters for eth0:
>>> rx-checksumming: on
>>> tx-checksumming: on
>>> scatter-gather: on
>>> tcp segmentation offload: on
>>> udp fragmentation offload: off
>>> generic segmentation offload: on
>>> generic-receive-offload: on
>>>
>>> kernel log:
>>> ---
>>> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>>>   TDH  <6c>
>>>   TDT  <81>
>>>   next_to_use  <81>
>>>   next_to_clean<6b>
>>> buffer_info[next_to_clean]:
>>>   time_stamp   
>>>   next_to_watch<71>
>>>   jiffies  
>>>   next_to_watch.status <0>
>>> MAC Status <80387>
>>> PHY Status <792d>
>>> PHY 1000BASE-T Status  <3c00>
>>> PHY Extended Status<3000>
>>> PCI Status <10>
>>> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>>>   TDH  <6c>
>>>   TDT  <81>
>>>   next_to_use  <81>
>>>   next_to_clean<6b>
>>> buffer_info[next_to_clean]:
>>>   time_stamp   
>>>   next_to_watch<71>
>>>   jiffies  
>>>   next_to_watch.status <0>
>>> MAC Status <80387>
>>> PHY Status <792d>

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N

>-Original Message-
>From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
>On Behalf Of Joe Jin
>Sent: Tuesday, July 10, 2012 12:40 AM
>To: Joe Jin
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>When I debug the driver I found before Detected HW hang, driver unable to
>clean and reclaim the resources:
>
>1457 while ((eop_desc->upper.data &
>cpu_to_le32(E1000_TXD_STAT_DD)) &&  <== at here upper.data always is 0x300
>1458(count < tx_ring->count)) {
> <--- snip --->
>1487 }
>
>
>I checked all driver codes I did not found anywhere will set the
>upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver 
checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means 
HW has processed that descriptor and driver can now clean that descriptor.
With value 0x300 , DD bit is not set. That means HW has not processed that 
descriptor.

How fast does tx hang reproduce? I suggest you to enable debug code in driver 
so when tx hang occurs it will dump the HW desc ring info into kernel log.
You can run "ethtool -s ethx msglvl 0x2c00" to enable debug.
Once tx hang occurs please send me the full dmesg log.

Does tx hang occur with in-kernel e1000e driver too?

Thanks.

-Tushar


>If OS is 32bit system, what which happen?


>
>Thanks in advance,
>Joe
>
>On 07/09/12 16:51, Joe Jin wrote:
>> Hi list,
>>
>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>> doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
>> just copy a big file (>500M) from another server will hit it at once.
>>
>> Would you please help on this?
>>
>> device info:
>> # lspci -s 05:00.0
>> 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>> Ethernet Controller (Copper) (rev 06)
>>
>> # lspci -s 05:00.0 -n
>> 05:00.0 0200: 8086:10bc (rev 06)
>>
>> # ethtool -i eth0
>> driver: e1000e
>> version: 2.0.0-NAPI
>> firmware-version: 5.10-2
>> bus-info: :05:00.0
>>
>> # ethtool -k eth0
>> Offload parameters for eth0:
>> rx-checksumming: on
>> tx-checksumming: on
>> scatter-gather: on
>> tcp segmentation offload: on
>> udp fragmentation offload: off
>> generic segmentation offload: on
>> generic-receive-offload: on
>>
>> kernel log:
>> ---
>> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>>   TDH  <6c>
>>   TDT  <81>
>>   next_to_use  <81>
>>   next_to_clean<6b>
>> buffer_info[next_to_clean]:
>>   time_stamp   
>>   next_to_watch<71>
>>   jiffies  
>>   next_to_watch.status <0>
>> MAC Status <80387>
>> PHY Status <792d>
>> PHY 1000BASE-T Status  <3c00>
>> PHY Extended Status<3000>
>> PCI Status <10>
>> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>>   TDH  <6c>
>>   TDT  <81>
>>   next_to_use  <81>
>>   next_to_clean<6b>
>> buffer_info[next_to_clean]:
>>   time_stamp   
>>   next_to_watch<71>
>>   jiffies  
>>   next_to_watch.status <0>
>> MAC Status <80387>
>> PHY Status <792d>
>> PHY 1000BASE-T Status  <3c00>
>> PHY Extended Status<3000>
>> PCI Status <10>
>> [ cut here ]
>> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
>> Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
>> transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
>> bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
>> be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
>> ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
>> mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
>> acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
>> snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
>> igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer
>> snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr
>> i7core_edac iTCO_vendor_support ioatdma gh

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Wyborny, Carolyn


>-Original Message-
>From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
>On Behalf Of Joe Jin
>Sent: Tuesday, July 10, 2012 12:40 AM
>To: Joe Jin
>Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
>ker...@vger.kernel.org
>Subject: Re: 82571EB: Detected Hardware Unit Hang
[..]
>I checked all driver codes I did not found anywhere will set the
>upper.data with
>E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes, the hw sets this bit after transmit, its how the driver knows to reclaim 
it.

>> Would you please help on this?

Yes, we'll attempt to reproduce it and get back to you with any more needed 
info.  I'm sorry you're having problems with our parts.

Thanks for the report,

Carolyn

Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin

When I debug the driver I found before Detected HW hang, driver unable to clean
and reclaim the resources:

1457 while ((eop_desc->upper.data & cpu_to_le32(E1000_TXD_STAT_DD)) &&  
<== at here upper.data always is 0x300
1458(count < tx_ring->count)) {
 <--- snip --->
1487 }


I checked all driver codes I did not found anywhere will set the upper.data 
with 
E1000_TXD_STAT_DD, I guess upper.data be set by hardware?
If OS is 32bit system, what which happen?

Thanks in advance,
Joe 

On 07/09/12 16:51, Joe Jin wrote:
> Hi list,
> 
> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
> scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
> a big file (>500M) from another server will hit it at once. 
> 
> Would you please help on this?
> 
> device info:
> # lspci -s 05:00.0 
> 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
> Controller (Copper) (rev 06)
> 
> # lspci -s 05:00.0 -n
> 05:00.0 0200: 8086:10bc (rev 06)
> 
> # ethtool -i eth0
> driver: e1000e
> version: 2.0.0-NAPI
> firmware-version: 5.10-2
> bus-info: :05:00.0
> 
> # ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
> udp fragmentation offload: off
> generic segmentation offload: on
> generic-receive-offload: on
> 
> kernel log:
> ---
> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>   TDH  <6c>
>   TDT  <81>
>   next_to_use  <81>
>   next_to_clean<6b>
> buffer_info[next_to_clean]:
>   time_stamp   
>   next_to_watch<71>
>   jiffies  
>   next_to_watch.status <0>
> MAC Status <80387>
> PHY Status <792d>
> PHY 1000BASE-T Status  <3c00>
> PHY Extended Status<3000>
> PCI Status <10>
> e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
>   TDH  <6c>
>   TDT  <81>
>   next_to_use  <81>
>   next_to_clean<6b>
> buffer_info[next_to_clean]:
>   time_stamp   
>   next_to_watch<71>
>   jiffies  
>   next_to_watch.status <0>
> MAC Status <80387>
> PHY Status <792d>
> PHY 1000BASE-T Status  <3c00>
> PHY Extended Status<3000>
> PCI Status <10>
> [ cut here ]
> WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
> Hardware name: SUN FIRE X2270 M2
> NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
> Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc 
> cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm 
> ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i 
> libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs 
> sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) 
> snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device igb 
> snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer snd 
> soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr i7core_edac 
> iTCO_vendor_support ioatdma ghes dca edac_core hed dm_snapshot dm_zero 
> dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci 
> libahci ext3 jbd mbcache [last unloaded: microcode]
> Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1
> Call Trace:
>  [] ? dev_watchdog+0x225/0x230
>  [] warn_slowpath_common+0x81/0xa0
>  [] ? dev_watchdog+0x225/0x230
>  [] warn_slowpath_fmt+0x33/0x40
>  [] dev_watchdog+0x225/0x230
>  [] ? dev_activate+0xb0/0xb0
>  [] call_timer_fn+0x32/0xf0
>  [] ? rcu_check_callbacks+0x80/0x80
>  [] run_timer_softirq+0xed/0x1b0
>  [] ? dev_activate+0xb0/0xb0
>  [] __do_softirq+0x91/0x1a0
>  [] ? local_bh_enable+0x80/0x80
>[] ? irq_exit+0x95/0xa0
>  [] ? smp_apic_timer_interrupt+0x38/0x42
>  [] ? apic_timer_interrupt+0x31/0x38
>  [] ? do_exit+0x11b/0x370
>  [] ? intel_idle+0xa4/0x100
>  [] ? cpuidle_idle_call+0xb9/0x1e0
>  [] ? cpu_idle+0x97/0xd0
>  [] ? rest_init+0x5d/0x70
>  [] ? start_kernel+0x28a/0x340
>  [] ? obsolete_checksetup+0xb0/0xb0
>  [] ? i386_start_kernel+0x64/0xb0
> ---[ end trace 5502b55cd4d4e5cb ]---
> e1000e :05:00.0: eth0: Reset adapter
> e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> 
> Thanks,
> Joe
> 


-- 
Oracle 
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin

When I debug the driver I found before Detected HW hang, driver unable to clean
and reclaim the resources:

1457 while ((eop_desc-upper.data  cpu_to_le32(E1000_TXD_STAT_DD))   
== at here upper.data always is 0x300
1458(count  tx_ring-count)) {
 --- snip ---
1487 }


I checked all driver codes I did not found anywhere will set the upper.data 
with 
E1000_TXD_STAT_DD, I guess upper.data be set by hardware?
If OS is 32bit system, what which happen?

Thanks in advance,
Joe 

On 07/09/12 16:51, Joe Jin wrote:
 Hi list,
 
 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
 scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
 a big file (500M) from another server will hit it at once. 
 
 Would you please help on this?
 
 device info:
 # lspci -s 05:00.0 
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
 Controller (Copper) (rev 06)
 
 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)
 
 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0
 
 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on
 
 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2
 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
 Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc 
 cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm 
 ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i 
 libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs 
 sbshc acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) 
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device igb 
 snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer snd 
 soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr i7core_edac 
 iTCO_vendor_support ioatdma ghes dca edac_core hed dm_snapshot dm_zero 
 dm_mirror dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci 
 libahci ext3 jbd mbcache [last unloaded: microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1
 Call Trace:
  [c07d9ac5] ? dev_watchdog+0x225/0x230
  [c045ba61] warn_slowpath_common+0x81/0xa0
  [c07d9ac5] ? dev_watchdog+0x225/0x230
  [c045bb23] warn_slowpath_fmt+0x33/0x40
  [c07d9ac5] dev_watchdog+0x225/0x230
  [c07d98a0] ? dev_activate+0xb0/0xb0
  [c0468e82] call_timer_fn+0x32/0xf0
  [c04bceb0] ? rcu_check_callbacks+0x80/0x80
  [c046a76d] run_timer_softirq+0xed/0x1b0
  [c07d98a0] ? dev_activate+0xb0/0xb0
  [c0461a81] __do_softirq+0x91/0x1a0
  [c04619f0] ? local_bh_enable+0x80/0x80
  IRQ  [c0462295] ? irq_exit+0x95/0xa0
  [c087f8b8] ? smp_apic_timer_interrupt+0x38/0x42
  [c08784f5] ? apic_timer_interrupt+0x31/0x38
  [c046007b] ? do_exit+0x11b/0x370
  [c065eae4] ? intel_idle+0xa4/0x100
  [c078d9b9] ? cpuidle_idle_call+0xb9/0x1e0
  [c0411d77] ? cpu_idle+0x97/0xd0
  [c085cbbd] ? rest_init+0x5d/0x70
  [c0b07a7a] ? start_kernel+0x28a/0x340
  [c0b074b0] ? obsolete_checksetup+0xb0/0xb0
  [c0b070a4] ? i386_start_kernel+0x64/0xb0
 ---[ end trace 5502b55cd4d4e5cb ]---
 e1000e :05:00.0: eth0: Reset adapter
 e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
 
 Thanks,
 Joe
 


-- 
Oracle http://www.oracle.com
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Wyborny, Carolyn


-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Tuesday, July 10, 2012 12:40 AM
To: Joe Jin
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang
[..]
I checked all driver codes I did not found anywhere will set the
upper.data with
E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes, the hw sets this bit after transmit, its how the driver knows to reclaim 
it.

 Would you please help on this?

Yes, we'll attempt to reproduce it and get back to you with any more needed 
info.  I'm sorry you're having problems with our parts.

Thanks for the report,

Carolyn

Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N

-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Tuesday, July 10, 2012 12:40 AM
To: Joe Jin
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

When I debug the driver I found before Detected HW hang, driver unable to
clean and reclaim the resources:

1457 while ((eop_desc-upper.data 
cpu_to_le32(E1000_TXD_STAT_DD))   == at here upper.data always is 0x300
1458(count  tx_ring-count)) {
 --- snip ---
1487 }


I checked all driver codes I did not found anywhere will set the
upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver 
checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means 
HW has processed that descriptor and driver can now clean that descriptor.
With value 0x300 , DD bit is not set. That means HW has not processed that 
descriptor.

How fast does tx hang reproduce? I suggest you to enable debug code in driver 
so when tx hang occurs it will dump the HW desc ring info into kernel log.
You can run ethtool -s ethx msglvl 0x2c00 to enable debug.
Once tx hang occurs please send me the full dmesg log.

Does tx hang occur with in-kernel e1000e driver too?

Thanks.

-Tushar


If OS is 32bit system, what which happen?



Thanks in advance,
Joe

On 07/09/12 16:51, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
 just copy a big file (500M) from another server will hit it at once.

 Would you please help on this?

 device info:
 # lspci -s 05:00.0
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)

 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)

 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0

 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on

 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
 transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
 bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
 be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
 ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
 acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
 igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer
 snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr
 i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed
 dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage
 sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded:
 microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call
 Trace:
  [c07d9ac5] ? dev_watchdog+0x225/0x230  [c045ba61]
 warn_slowpath_common+0x81/0xa0  [c07d9ac5] ?
 dev_watchdog+0x225/0x230  [c045bb23] warn_slowpath_fmt+0x33/0x40
 [c07d9ac5] dev_watchdog+0x225/0x230  [c07d98a0] ?
 dev_activate+0xb0/0xb0  [c0468e82] call_timer_fn+0x32/0xf0
 [c04bceb0] ? rcu_check_callbacks+0x80/0x80  [c046a76d]
 run_timer_softirq+0xed/0x1b0  [c07d98a0] ? dev_activate+0xb0/0xb0
 [c0461a81] __do_softirq+0x91/0x1a0  [c04619f0] ?
 local_bh_enable+0x80/0x80  IRQ  [c0462295] ? irq_exit+0x95/0xa0
 [c087f8b8] ? smp_apic_timer_interrupt+0x38/0x42
  [c08784f5] ? apic_timer_interrupt+0x31/0x38  [c046007b] ?
 do_exit+0x11b/0x370  [c065eae4] ? intel_idle+0xa4/0x100

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N


-Original Message-
From: Dave, Tushar N
Sent: Tuesday, July 10, 2012 12:02 PM
To: Joe Jin
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org; Dave, Tushar N
Subject: RE: 82571EB: Detected Hardware Unit Hang

-Original Message-
From: netdev-ow...@vger.kernel.org
[mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Joe Jin
Sent: Tuesday, July 10, 2012 12:40 AM
To: Joe Jin
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

When I debug the driver I found before Detected HW hang, driver unable
to clean and reclaim the resources:

1457 while ((eop_desc-upper.data 
cpu_to_le32(E1000_TXD_STAT_DD))   == at here upper.data always is
0x300
1458(count  tx_ring-count)) {
 --- snip ---
1487 }


I checked all driver codes I did not found anywhere will set the
upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?

Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver
checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that
means HW has processed that descriptor and driver can now clean that
descriptor.
With value 0x300 , DD bit is not set. That means HW has not processed that
descriptor.

How fast does tx hang reproduce? I suggest you to enable debug code in
driver so when tx hang occurs it will dump the HW desc ring info into
kernel log.
You can run ethtool -s ethx msglvl 0x2c00 to enable debug.
Once tx hang occurs please send me the full dmesg log.

Does tx hang occur with in-kernel e1000e driver too?

Thanks.

-Tushar
One change , please use  ethtool -s ethx msglvl 0x2c01 so to keep default 
'drv' msglvl enabled.
Confirm the message level set correctly by running command 'ethtool ethx'.
Last few will be

Current message level: 0x2c01 (11265)
   drv tx_done rx_status hw




If OS is 32bit system, what which happen?



Thanks in advance,
Joe

On 07/09/12 16:51, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270
 M2, just copy a big file (500M) from another server will hit it at
once.

 Would you please help on this?

 device info:
 # lspci -s 05:00.0
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)

 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)

 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0

 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on

 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
 transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
 bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
 be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
 ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
 acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
 igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon
 snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core
 pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed
 dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod
 usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last
unloaded:
 microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call
 Trace:
  [c07d9ac5] ? dev_watchdog+0x225/0x230  [c045ba61]
 warn_slowpath_common+0x81/0xa0  [c07d9ac5] ?
 dev_watchdog+0x225/0x230  [c045bb23

Re: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin

On 07/11/12 03:02, Dave, Tushar N wrote:
 -Original Message-
 From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
 On Behalf Of Joe Jin
 Sent: Tuesday, July 10, 2012 12:40 AM
 To: Joe Jin
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 When I debug the driver I found before Detected HW hang, driver unable to
 clean and reclaim the resources:

 1457 while ((eop_desc-upper.data 
 cpu_to_le32(E1000_TXD_STAT_DD))   == at here upper.data always is 0x300
 1458(count  tx_ring-count)) {
 --- snip ---
 1487 }


 I checked all driver codes I did not found anywhere will set the
 upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by hardware?
 
 Yes upper.data (part of it is STATUS byte) is set by HW. Basically driver 
 checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set that means 
 HW has processed that descriptor and driver can now clean that descriptor.
 With value 0x300 , DD bit is not set. That means HW has not processed that 
 descriptor.

Thanks for the clarify, might be firmware issue?
 
 How fast does tx hang reproduce? I suggest you to enable debug code in driver 
 so when tx hang occurs it will dump the HW desc ring info into kernel log.

Once I copy a file from other server, issue to be reproduced at once.
I'll enable the debug to get more debug info.

 You can run ethtool -s ethx msglvl 0x2c00 to enable debug.
 Once tx hang occurs please send me the full dmesg log.
 
 Does tx hang occur with in-kernel e1000e driver too?

I tried several drivers included rhel5 the latest, Intel the latest,
rhel6 the latest, issue see on all those drivers.

Thanks,
Joe 
 
 Thanks.
 
 -Tushar
 
 
 If OS is 32bit system, what which happen?
 
 

 Thanks in advance,
 Joe

 On 07/09/12 16:51, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
 just copy a big file (500M) from another server will hit it at once.

 Would you please help on this?

 device info:
 # lspci -s 05:00.0
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)

 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)

 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0

 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on

 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
 transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
 bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
 be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
 ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
 acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
 igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon snd_timer
 snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core pcspkr
 i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core hed
 dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage
 sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last unloaded:
 microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1 Call
 Trace:
  [c07d9ac5] ? dev_watchdog+0x225/0x230  [c045ba61]
 warn_slowpath_common+0x81/0xa0  [c07d9ac5] ?
 dev_watchdog+0x225/0x230  [c045bb23] warn_slowpath_fmt+0x33/0x40
 [c07d9ac5] dev_watchdog+0x225/0x230  [c07d98a0] ?
 dev_activate+0xb0/0xb0  [c0468e82] call_timer_fn+0x32/0xf0
 [c04bceb0

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N


-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, July 10, 2012 5:35 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 03:02, Dave, Tushar N wrote:
 -Original Message-
 From: netdev-ow...@vger.kernel.org
 [mailto:netdev-ow...@vger.kernel.org]
 On Behalf Of Joe Jin
 Sent: Tuesday, July 10, 2012 12:40 AM
 To: Joe Jin
 Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
 ker...@vger.kernel.org
 Subject: Re: 82571EB: Detected Hardware Unit Hang

 When I debug the driver I found before Detected HW hang, driver
 unable to clean and reclaim the resources:

 1457 while ((eop_desc-upper.data 
 cpu_to_le32(E1000_TXD_STAT_DD))   == at here upper.data always is
0x300
 1458(count  tx_ring-count)) {
 --- snip ---
 1487 }


 I checked all driver codes I did not found anywhere will set the
 upper.data with E1000_TXD_STAT_DD, I guess upper.data be set by
hardware?

 Yes upper.data (part of it is STATUS byte) is set by HW. Basically
driver checks E1000_TXD_STAT_DD (Descriptor Done) bit. If this bit is set
that means HW has processed that descriptor and driver can now clean that
descriptor.
 With value 0x300 , DD bit is not set. That means HW has not processed
that descriptor.

Thanks for the clarify, might be firmware issue?

 How fast does tx hang reproduce? I suggest you to enable debug code in
driver so when tx hang occurs it will dump the HW desc ring info into
kernel log.

Once I copy a file from other server, issue to be reproduced at once.
I'll enable the debug to get more debug info.

 You can run ethtool -s ethx msglvl 0x2c00 to enable debug.
 Once tx hang occurs please send me the full dmesg log.

 Does tx hang occur with in-kernel e1000e driver too?

I tried several drivers included rhel5 the latest, Intel the latest,
rhel6 the latest, issue see on all those drivers.

Also after issue occurs please capture lspci -vvv (run as root)


Thanks,
Joe

 Thanks.

 -Tushar


 If OS is 32bit system, what which happen?



 Thanks in advance,
 Joe

 On 07/09/12 16:51, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
 doing scp test. this issue is easy do reproduced on SUN FIRE X2270
 M2, just copy a big file (500M) from another server will hit it at
once.

 Would you please help on this?

 device info:
 # lspci -s 05:00.0
 05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
 Ethernet Controller (Copper) (rev 06)

 # lspci -s 05:00.0 -n
 05:00.0 0200: 8086:10bc (rev 06)

 # ethtool -i eth0
 driver: e1000e
 version: 2.0.0-NAPI
 firmware-version: 5.10-2
 bus-info: :05:00.0

 # ethtool -k eth0
 Offload parameters for eth0:
 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: on

 kernel log:
 ---
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc8c0c
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
   TDH  6c
   TDT  81
   next_to_use  81
   next_to_clean6b
 buffer_info[next_to_clean]:
   time_stamp   fffc7a23
   next_to_watch71
   jiffies  fffc9bac
   next_to_watch.status 0
 MAC Status 80387
 PHY Status 792d
 PHY 1000BASE-T Status  3c00
 PHY Extended Status3000
 PCI Status 10
 [ cut here ]
 WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
 Hardware name: SUN FIRE X2270 M2 NETDEV WATCHDOG: eth0 (e1000e):
 transmit queue 0 timed out Modules linked in: autofs4 hidp rfcomm
 bluetooth rfkill lockd sunrpc cpufreq_ondemand acpi_cpufreq mperf
 be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
 ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3
 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
 acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U)
 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
 igb snd_pcm_oss serio_raw snd_mixer_oss snd_pcm tpm_infineon
 snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt i2c_core
 pcspkr i7core_edac iTCO_vendor_support ioatdma ghes dca edac_core
 hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod
 usb_storage sd_mod crc_t10dif sg ahci libahci ext3 jbd mbcache [last
unloaded:
 microcode]
 Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek

Re: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin

On 07/11/12 11:22, Dave, Tushar N wrote:
 Thanks for info. I see that hang occurs right when HW processing first TX 
 descriptor with TSO.
 Would you be able to reproduce issue with TSO off?  Disable TSO by 'ethtool 
 -K ethx tso off'
 Let all debug enabled as it is,  that will help us debug further if issue 
 occurs with TSO off.

Hi Tushar,

Thanks for you quick reply but disabled tso no help for this issue:

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: on

kernel log after disable tso:

e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  1
  TDT  4
  next_to_use  4
  next_to_clean1
buffer_info[next_to_clean]:
  time_stamp   103ae0aba
  next_to_watch1
  jiffies  103ae16a0
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  1
  TDT  4
  next_to_use  4
  next_to_clean1
buffer_info[next_to_clean]:
  time_stamp   103ae0aba
  next_to_watch1
  jiffies  103ae2640
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: Net device Info
e1000e: Device Name statetrans_start  last_rx
e1000e: eth00003 000103AE128A 
e1000e :05:00.0: Register Dump
e1000e:  Register Name   Value
e1000e: CTRL180c0241
e1000e: STATUS  00080387
e1000e: CTRL_EXT181400c0
e1000e: ICR 0040
e1000e: RCTL04048002
e1000e: RDLEN   1000
e1000e: RDH 0090
e1000e: RDT 0080
e1000e: RDTR0020
e1000e: RXDCTL[0-1] 01040420 01040420
e1000e: ERT 
e1000e: RDBAL   23852000
e1000e: RDBAH   000c
e1000e: RDFH075a
e1000e: RDFT0752
e1000e: RDFHS   0758
e1000e: RDFTS   0752
e1000e: RDFPC   01b4
e1000e: TCTL3003f00a
e1000e: TDBAL   1210c000
e1000e: TDBAH   000c
e1000e: TDLEN   1000
e1000e: TDH 0001
e1000e: TDT 0004
e1000e: TIDV0008
e1000e: TXDCTL[0-1] 0145011f 0145011f
e1000e: TADV0020
e1000e: TARC[0-1]   07a00403 07400403
e1000e: TDFH1308
e1000e: TDFT1308
e1000e: TDFHS   1308
e1000e: TDFTS   1308
e1000e: TDFPC   
e1000e :05:00.0: Tx Ring Summary
e1000e: Queue [NTU] [NTC] [bi(ntc)-dma  ] leng ntw timestamp
e1000e:  0 4 1 000620800C02 002A   1 000103AE0ABA
e1000e :05:00.0: Tx Ring Dump
e1000e: Tl[desc] [address 63:0  ] [SpeCssSCmCsLen] [bi-dma   ] leng  
ntw timestampbi-skb -- Legacy format
e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi-dma   ] leng  
ntw timestampbi-skb -- Ext Context format
e1000e: Td[desc] [address 63:0  ] [VlaPoRSCm1Dlen] [bi-dma   ] leng  
ntw timestampbi-skb -- Ext Data format
e1000e: Tl[0x000]000C1AA0F002 8B2A  002A
0  (null)
e1000e: Tl[0x001]000620800C02 8B2A 000620800C02 002A
1 000103AE0ABA 88061c6b6980 NTC
e1000e: Tl[0x002]00061E6DBC02 8B2A 00061E6DBC02 002A
2 000103AE0EA2 88061c6b6880
e1000e: Tl[0x003]000620A6C402 8B2A 000620A6C402 002A
3 000103AE128A 8806230b4080
e1000e: Tl[0x004]   
0  (null) NTU
e1000e: Tl[0x005]   
0  (null)
e1000e: Tl[0x006]   
0  (null)
e1000e: Tl[0x007]   
0  (null)
e1000e: Tl[0x008]   
0  (null)
e1000e: Tl[0x009]   
0  (null)
e1000e: Tl[0x00A]   
0  (null)
e1000e: Tl[0x00B]   
0  (null)
e1000e: Tl[0x00C]   
0  (null)
e1000e: Tl[0x00D]

RE: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Dave, Tushar N

-Original Message-
From: Joe Jin [mailto:joe@oracle.com]
Sent: Tuesday, July 10, 2012 8:29 PM
To: Dave, Tushar N
Cc: e1000-de...@lists.sf.net; net...@vger.kernel.org; linux-
ker...@vger.kernel.org
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 07/11/12 11:22, Dave, Tushar N wrote:
 Thanks for info. I see that hang occurs right when HW processing first
TX descriptor with TSO.
 Would you be able to reproduce issue with TSO off?  Disable TSO by
'ethtool -K ethx tso off'
 Let all debug enabled as it is,  that will help us debug further if
issue occurs with TSO off.

Hi Tushar,

Thanks for you quick reply but disabled tso no help for this issue:

Thanks for running a quick test. I don't find anything obvious wrong in 
descriptor dump.

When you said you had this issue with RHEL5 and RHEL6 drivers, have you install 
RHEl5/6 kernel and reproduced it? If so I think I should install RHEL6 and try 
reproduce it locally!

-Tushar



# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: on

kernel log after disable tso:

e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  1
  TDT  4
  next_to_use  4
  next_to_clean1
buffer_info[next_to_clean]:
  time_stamp   103ae0aba
  next_to_watch1
  jiffies  103ae16a0
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  1
  TDT  4
  next_to_use  4
  next_to_clean1
buffer_info[next_to_clean]:
  time_stamp   103ae0aba
  next_to_watch1
  jiffies  103ae2640
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: Net device Info
e1000e: Device Name statetrans_start  last_rx
e1000e: eth00003 000103AE128A 
e1000e :05:00.0: Register Dump
e1000e:  Register Name   Value
e1000e: CTRL180c0241
e1000e: STATUS  00080387
e1000e: CTRL_EXT181400c0
e1000e: ICR 0040
e1000e: RCTL04048002
e1000e: RDLEN   1000
e1000e: RDH 0090
e1000e: RDT 0080
e1000e: RDTR0020
e1000e: RXDCTL[0-1] 01040420 01040420
e1000e: ERT 
e1000e: RDBAL   23852000
e1000e: RDBAH   000c
e1000e: RDFH075a
e1000e: RDFT0752
e1000e: RDFHS   0758
e1000e: RDFTS   0752
e1000e: RDFPC   01b4
e1000e: TCTL3003f00a
e1000e: TDBAL   1210c000
e1000e: TDBAH   000c
e1000e: TDLEN   1000
e1000e: TDH 0001
e1000e: TDT 0004
e1000e: TIDV0008
e1000e: TXDCTL[0-1] 0145011f 0145011f
e1000e: TADV0020
e1000e: TARC[0-1]   07a00403 07400403
e1000e: TDFH1308
e1000e: TDFT1308
e1000e: TDFHS   1308
e1000e: TDFTS   1308
e1000e: TDFPC   
e1000e :05:00.0: Tx Ring Summary
e1000e: Queue [NTU] [NTC] [bi(ntc)-dma  ] leng ntw timestamp
e1000e:  0 4 1 000620800C02 002A   1 000103AE0ABA
e1000e :05:00.0: Tx Ring Dump
e1000e: Tl[desc] [address 63:0  ] [SpeCssSCmCsLen] [bi-dma   ]
leng  ntw timestampbi-skb -- Legacy format
e1000e: Tc[desc] [Ce CoCsIpceCoS] [MssHlRSCm0Plen] [bi-dma   ]
leng  ntw timestampbi-skb -- Ext Context format
e1000e: Td[desc] [address 63:0  ] [VlaPoRSCm1Dlen] [bi-dma   ]
leng  ntw timestampbi-skb -- Ext Data format
e1000e: Tl[0x000]000C1AA0F002 8B2A 
002A0  (null)
e1000e: Tl[0x001]000620800C02 8B2A 000620800C02
002A1 000103AE0ABA 88061c6b6980 NTC
e1000e: Tl[0x002]00061E6DBC02 8B2A 00061E6DBC02
002A2 000103AE0EA2 88061c6b6880
e1000e: Tl[0x003]000620A6C402 8B2A 000620A6C402
002A3 000103AE128A 8806230b4080
e1000e: Tl[0x004]  
0  (null) NTU
e1000e: Tl[0x005]  
0  (null)
e1000e: Tl[0x006]  
0  (null)
e1000e: Tl[0x007]  
0  (null)
e1000e: Tl[0x008]  
0

Re: 82571EB: Detected Hardware Unit Hang

2012-07-10 Thread Joe Jin

On 07/11/12 12:05, Dave, Tushar N wrote:
 When you said you had this issue with RHEL5 and RHEL6 drivers, have you 
 install RHEl5/6 kernel and reproduced it? If so I think I should install 
 RHEL6 and try reproduce it locally!
 
Yes I reproduced this on both RHEL5 and RHEL6.

So far I tried to scp big file (~1GB) will hit it at once.

Thanks,
Joe
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-09 Thread Joe Jin

On 07/09/12 17:21, Eric Dumazet wrote:
> On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
>> Hi list,
>>
>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
>> scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
>> a big file (>500M) from another server will hit it at once. 
>>
>> Would you please help on this?
>>
> 
> Its a known problem.
> 
> But apparently Intel guys are not very responsive, as they have another
> patch than the following :
> 
> http://permalink.gmane.org/gmane.linux.network/232669

Eris, 

Thanks for you reply, but seems this patch not help for me, 
applied the patch still hit the issue:

# dmesg
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  <6f>
  TDT  <7e>
  next_to_use  <7e>
  next_to_clean<6e>
buffer_info[next_to_clean]:
  time_stamp   
  next_to_watch<74>
  jiffies  
  next_to_watch.status <0>
MAC Status <80387>
PHY Status <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status<3000>
PCI Status <10>
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  <6f>
  TDT  <7e>
  next_to_use  <7e>
  next_to_clean<6e>
buffer_info[next_to_clean]:
  time_stamp   
  next_to_watch<74>
  jiffies  
  next_to_watch.status <0>
MAC Status <80387>
PHY Status <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status<3000>
PCI Status <10>
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  <6f>
  TDT  <7e>
  next_to_use  <7e>
  next_to_clean<6e>
buffer_info[next_to_clean]:
  time_stamp   
  next_to_watch<74>
  jiffies  
  next_to_watch.status <0>
MAC Status <80387>
PHY Status <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status<3000>
PCI Status <10>
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  <6f>
  TDT  <7e>
  next_to_use  <7e>
  next_to_clean<6e>
buffer_info[next_to_clean]:
  time_stamp   
  next_to_watch<74>
  jiffies  
  next_to_watch.status <0>
MAC Status <80387>
PHY Status <792d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status<3000>
PCI Status <10>
[ cut here ]
WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
Hardware name: SUN FIRE X2270 M2
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc 
cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm 
ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i 
libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc 
acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) 
snd_seq_dummy snd_seq_oss snd_seq_midi_event igb snd_seq snd_seq_device 
serio_raw snd_pcm_oss snd_mixer_oss snd_pcm tpm_infineon snd_timer snd 
soundcore i7core_edac iTCO_wdt iTCO_vendor_support snd_page_alloc edac_core 
i2c_i801 ioatdma i2c_core pcspkr ghes dca hed dm_snapshot dm_zero dm_mirror 
dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci libahci ext3 
jbd mbcache [last unloaded: microcode]
Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1
Call Trace:
 [] ? dev_watchdog+0x225/0x230
 [] warn_slowpath_common+0x81/0xa0
 [] ? dev_watchdog+0x225/0x230
 [] warn_slowpath_fmt+0x33/0x40
 [] dev_watchdog+0x225/0x230
 [] ? dev_activate+0xb0/0xb0
 [] call_timer_fn+0x32/0xf0
 [] run_timer_softirq+0xed/0x1b0
 [] ? dev_activate+0xb0/0xb0
 [] __do_softirq+0x91/0x1a0
 [] ? local_bh_enable+0x80/0x80
   [] ? irq_exit+0x95/0xa0
 [] ? smp_apic_timer_interrupt+0x38/0x42
 [] ? apic_timer_interrupt+0x31/0x38
 [] ? do_exit+0x11b/0x370
 [] ? intel_idle+0xa4/0x100
 [] ? cpuidle_idle_call+0xb9/0x1e0
 [] ? cpu_idle+0x97/0xd0
 [] ? rest_init+0x5d/0x70
 [] ? start_kernel+0x28a/0x340
 [] ? obsolete_checksetup+0xb0/0xb0
 [] ? i386_start_kernel+0x64/0xb0
---[ end trace 5d51553c2ad66677 ]---
e1000e :05:00.0: eth0: Reset adapter
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Any idea?

Thanks,
Joe

> 
> 
> We only have to wait they push their alternative patch, eventually.
> 
> In the mean time, you can use Hiroaki SHIMODA patch, it works.
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-09 Thread Eric Dumazet

On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
> Hi list,
> 
> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
> scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
> a big file (>500M) from another server will hit it at once. 
> 
> Would you please help on this?
> 

Its a known problem.

But apparently Intel guys are not very responsive, as they have another
patch than the following :

http://permalink.gmane.org/gmane.linux.network/232669


We only have to wait they push their alternative patch, eventually.

In the mean time, you can use Hiroaki SHIMODA patch, it works.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-09 Thread Eric Dumazet

On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
 Hi list,
 
 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
 scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
 a big file (500M) from another server will hit it at once. 
 
 Would you please help on this?
 

Its a known problem.

But apparently Intel guys are not very responsive, as they have another
patch than the following :

http://permalink.gmane.org/gmane.linux.network/232669


We only have to wait they push their alternative patch, eventually.

In the mean time, you can use Hiroaki SHIMODA patch, it works.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 82571EB: Detected Hardware Unit Hang

2012-07-09 Thread Joe Jin

On 07/09/12 17:21, Eric Dumazet wrote:
 On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
 Hi list,

 I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
 scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
 a big file (500M) from another server will hit it at once. 

 Would you please help on this?

 
 Its a known problem.
 
 But apparently Intel guys are not very responsive, as they have another
 patch than the following :
 
 http://permalink.gmane.org/gmane.linux.network/232669

Eris, 

Thanks for you reply, but seems this patch not help for me, 
applied the patch still hit the issue:

# dmesg
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  6f
  TDT  7e
  next_to_use  7e
  next_to_clean6e
buffer_info[next_to_clean]:
  time_stamp   fffd48dc
  next_to_watch74
  jiffies  fffd5344
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  6f
  TDT  7e
  next_to_use  7e
  next_to_clean6e
buffer_info[next_to_clean]:
  time_stamp   fffd48dc
  next_to_watch74
  jiffies  fffd5b14
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  6f
  TDT  7e
  next_to_use  7e
  next_to_clean6e
buffer_info[next_to_clean]:
  time_stamp   fffd48dc
  next_to_watch74
  jiffies  fffd62e4
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
e1000e :05:00.0: eth0: Detected Hardware Unit Hang:
  TDH  6f
  TDT  7e
  next_to_use  7e
  next_to_clean6e
buffer_info[next_to_clean]:
  time_stamp   fffd48dc
  next_to_watch74
  jiffies  fffd6ab4
  next_to_watch.status 0
MAC Status 80387
PHY Status 792d
PHY 1000BASE-T Status  3c00
PHY Extended Status3000
PCI Status 10
[ cut here ]
WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x225/0x230()
Hardware name: SUN FIRE X2270 M2
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc 
cpufreq_ondemand acpi_cpufreq mperf be2iscsi iscsi_boot_sysfs ib_iser rdma_cm 
ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i 
libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc 
acpi_pad acpi_ipmi ipmi_msghandler parport_pc lp parport e1000e(U) 
snd_seq_dummy snd_seq_oss snd_seq_midi_event igb snd_seq snd_seq_device 
serio_raw snd_pcm_oss snd_mixer_oss snd_pcm tpm_infineon snd_timer snd 
soundcore i7core_edac iTCO_wdt iTCO_vendor_support snd_page_alloc edac_core 
i2c_i801 ioatdma i2c_core pcspkr ghes dca hed dm_snapshot dm_zero dm_mirror 
dm_region_hash dm_log dm_mod usb_storage sd_mod crc_t10dif sg ahci libahci ext3 
jbd mbcache [last unloaded: microcode]
Pid: 0, comm: swapper Not tainted 2.6.39-200.24.1.el5uek #1
Call Trace:
 [c07d9ac5] ? dev_watchdog+0x225/0x230
 [c045ba61] warn_slowpath_common+0x81/0xa0
 [c07d9ac5] ? dev_watchdog+0x225/0x230
 [c045bb23] warn_slowpath_fmt+0x33/0x40
 [c07d9ac5] dev_watchdog+0x225/0x230
 [c07d98a0] ? dev_activate+0xb0/0xb0
 [c0468e82] call_timer_fn+0x32/0xf0
 [c046a76d] run_timer_softirq+0xed/0x1b0
 [c07d98a0] ? dev_activate+0xb0/0xb0
 [c0461a81] __do_softirq+0x91/0x1a0
 [c04619f0] ? local_bh_enable+0x80/0x80
 IRQ  [c0462295] ? irq_exit+0x95/0xa0
 [c087f8b8] ? smp_apic_timer_interrupt+0x38/0x42
 [c08784f5] ? apic_timer_interrupt+0x31/0x38
 [c046007b] ? do_exit+0x11b/0x370
 [c065eae4] ? intel_idle+0xa4/0x100
 [c078d9b9] ? cpuidle_idle_call+0xb9/0x1e0
 [c0411d77] ? cpu_idle+0x97/0xd0
 [c085cbbd] ? rest_init+0x5d/0x70
 [c0b07a7a] ? start_kernel+0x28a/0x340
 [c0b074b0] ? obsolete_checksetup+0xb0/0xb0
 [c0b070a4] ? i386_start_kernel+0x64/0xb0
---[ end trace 5d51553c2ad66677 ]---
e1000e :05:00.0: eth0: Reset adapter
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Any idea?

Thanks,
Joe

 
 
 We only have to wait they push their alternative patch, eventually.
 
 In the mean time, you can use Hiroaki SHIMODA patch, it works.
 
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

94 matches

Mail list logo