Re: bnxt: card intermittently hanging and dropping link

2018-08-19 Thread Daniel Axtens
Hi Michael,

>>> The main issue is the TX timeout.
>>> .
>>>
 [ 2682.911693] bnxt_en :3b:00.0 eth4: TX timeout detected, starting 
 reset task!
 [ 2683.782496] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51
 [ 2683.783061] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
 [ 2684.634557] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51
 [ 2684.635120] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>>>
>>> and it is not recovering.
>>>
>>> Please provide ethtool -i eth4 which will show the firmware version on
>>> the NIC.  Let's see if the firmware is too old.
>>
>> driver: bnxt_en
>> version: 1.8.0
>> firmware-version: 20.6.151.0/pkg 20.06.05.11
>
> I believe the firmware should be updated.  My colleague will contact
> you on how to proceed.

Thank you very much, I'll follow up with them off-list.

Regards,
Daniel


Re: bnxt: card intermittently hanging and dropping link

2018-08-16 Thread Michael Chan
On Thu, Aug 16, 2018 at 2:09 AM, Daniel Axtens  wrote:
> Hi Michael,
>
>> The main issue is the TX timeout.
>> .
>>
>>> [ 2682.911693] bnxt_en :3b:00.0 eth4: TX timeout detected, starting 
>>> reset task!
>>> [ 2683.782496] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51
>>> [ 2683.783061] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>>> [ 2684.634557] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51
>>> [ 2684.635120] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>>
>> and it is not recovering.
>>
>> Please provide ethtool -i eth4 which will show the firmware version on
>> the NIC.  Let's see if the firmware is too old.
>
> driver: bnxt_en
> version: 1.8.0
> firmware-version: 20.6.151.0/pkg 20.06.05.11

I believe the firmware should be updated.  My colleague will contact
you on how to proceed.

Thanks.


Re: bnxt: card intermittently hanging and dropping link

2018-08-16 Thread Daniel Axtens
Hi Michael,

> The main issue is the TX timeout.
> .
>
>> [ 2682.911693] bnxt_en :3b:00.0 eth4: TX timeout detected, starting 
>> reset task!
>> [ 2683.782496] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51
>> [ 2683.783061] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>> [ 2684.634557] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51
>> [ 2684.635120] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
>
> and it is not recovering.
>
> Please provide ethtool -i eth4 which will show the firmware version on
> the NIC.  Let's see if the firmware is too old.

driver: bnxt_en
version: 1.8.0
firmware-version: 20.6.151.0/pkg 20.06.05.11
expansion-rom-version: 
bus-info: :3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no

Thanks!

Regards,
Daniel


Re: bnxt: card intermittently hanging and dropping link

2018-08-16 Thread Michael Chan
On Wed, Aug 15, 2018 at 10:29 PM, Daniel Axtens  wrote:

> [ 2682.911295] [ cut here ]
> [ 2682.911319] NETDEV WATCHDOG: eth4 (bnxt_en): transmit queue 0 timed out

The main issue is the TX timeout.
.

> [ 2682.911693] bnxt_en :3b:00.0 eth4: TX timeout detected, starting reset 
> task!
> [ 2683.782496] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51
> [ 2683.783061] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1
> [ 2684.634557] bnxt_en :3b:00.0 eth4: Resp cmpl intr err msg: 0x51
> [ 2684.635120] bnxt_en :3b:00.0 eth4: hwrm_ring_free tx failed. rc:-1

and it is not recovering.

Please provide ethtool -i eth4 which will show the firmware version on
the NIC.  Let's see if the firmware is too old.

Thanks.


bnxt: card intermittently hanging and dropping link

2018-08-15 Thread Daniel Axtens
Hi Michael,

I have some user reports of issues with a Broadcom 57412 card with the
card intermittently hanging and dropping the link.

The problem has been observed on a Dell server with an Ubuntu 4.13
kernel (bnxt_en version 1.7.0) and with an Ubuntu 4.15 kernel (bnxt_en
version 1.8.0). It seems to occur while mounting an XFS volume over
iSCSI, although running blkid on the partition succeeds, and it seems to
be a different volume each time.

I've included an excerpt from the kernel log below.

It seems that other people have reported this with a RHEL7 kernel - I
have no idea what driver version that would be running, or what workload
they were operating, but the warnings and messages look the same:
https://www.dell.com/community/PowerEdge-Hardware-General/Critical-network-bnxt-en-module-crashes-on-14G-servers/td-p/6031769/highlight/true
The forum poster reported that disconnecting the card from power for 5
minutes was sufficient to get things working again and I have asked our
user to test that.

Is this a known issue?

Regards,
Daniel

[ 2662.526151] scsi host14: iSCSI Initiator over TCP/IP
[ 2662.538110] scsi host15: iSCSI Initiator over TCP/IP
[ 2662.547350] scsi host16: iSCSI Initiator over TCP/IP
[ 2662.554660] scsi host17: iSCSI Initiator over TCP/IP
[ 2662.813860] scsi 15:0:0:1: Direct-Access PURE FlashArray    
PQ: 0 ANSI: 6
[ 2662.813888] scsi 14:0:0:1: Direct-Access PURE FlashArray    
PQ: 0 ANSI: 6
[ 2662.813972] scsi 16:0:0:1: Direct-Access PURE FlashArray    
PQ: 0 ANSI: 6
[ 2662.814322] sd 15:0:0:1: Attached scsi generic sg1 type 0
[ 2662.814553] sd 14:0:0:1: Attached scsi generic sg2 type 0
[ 2662.814554] scsi 17:0:0:1: Direct-Access PURE FlashArray    
PQ: 0 ANSI: 6
[ 2662.814612] sd 16:0:0:1: Attached scsi generic sg3 type 0
[ 2662.815081] sd 15:0:0:1: [sdb] 10737418240 512-byte logical blocks: (5.50 
TB/5.00 TiB)
[ 2662.815195] sd 15:0:0:1: [sdb] Write Protect is off
[ 2662.815197] sd 15:0:0:1: [sdb] Mode Sense: 43 00 00 08
[ 2662.815229] sd 14:0:0:1: [sdc] 10737418240 512-byte logical blocks: (5.50 
TB/5.00 TiB)
[ 2662.815292] sd 17:0:0:1: Attached scsi generic sg4 type 0
[ 2662.815342] sd 14:0:0:1: [sdc] Write Protect is off
[ 2662.815343] sd 14:0:0:1: [sdc] Mode Sense: 43 00 00 08
[ 2662.815419] sd 15:0:0:1: [sdb] Write cache: disabled, read cache: enabled, 
doesn't support DPO or FUA
[ 2662.815447] sd 16:0:0:1: [sdd] 10737418240 512-byte logical blocks: (5.50 
TB/5.00 TiB)
[ 2662.815544] sd 14:0:0:1: [sdc] Write cache: disabled, read cache: enabled, 
doesn't support DPO or FUA
[ 2662.815614] sd 16:0:0:1: [sdd] Write Protect is off
[ 2662.815615] sd 16:0:0:1: [sdd] Mode Sense: 43 00 00 08
[ 2662.815882] sd 16:0:0:1: [sdd] Write cache: disabled, read cache: enabled, 
doesn't support DPO or FUA
[ 2662.816188] sd 17:0:0:1: [sde] 10737418240 512-byte logical blocks: (5.50 
TB/5.00 TiB)
[ 2662.816298] sd 17:0:0:1: [sde] Write Protect is off
[ 2662.816300] sd 17:0:0:1: [sde] Mode Sense: 43 00 00 08
[ 2662.816502] sd 17:0:0:1: [sde] Write cache: disabled, read cache: enabled, 
doesn't support DPO or FUA
[ 2662.820080] sd 15:0:0:1: [sdb] Attached SCSI disk
[ 2662.820594] sd 14:0:0:1: [sdc] Attached SCSI disk
[ 2662.820995] sd 17:0:0:1: [sde] Attached SCSI disk
[ 2662.821176] sd 16:0:0:1: [sdd] Attached SCSI disk
[ 2662.913642] device-mapper: multipath round-robin: version 1.2.0 loaded
[ 2663.954001] XFS (dm-2): Mounting V5 Filesystem
[ 2673.186083]  connection3:0: ping timeout of 5 secs expired, recv timeout 5, 
last rx 4295558209, last ping 4295559460, now 4295560768
[ 2673.186135]  connection3:0: detected conn error (1022)
[ 2673.186137]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, 
last rx 4295558209, last ping 4295559460, now 4295560768
[ 2673.186168]  connection2:0: detected conn error (1022)
[ 2673.186170]  connection1:0: ping timeout of 5 secs expired, recv timeout 5, 
last rx 4295558209, last ping 4295559460, now 4295560768
[ 2673.186211]  connection1:0: detected conn error (1022)
[ 2674.209870]  connection4:0: ping timeout of 5 secs expired, recv timeout 5, 
last rx 4295558463, last ping 4295559720, now 4295561024
[ 2674.209924]  connection4:0: detected conn error (1022)
[ 2678.560630]  session1: session recovery timed out after 5 secs
[ 2678.560641]  session2: session recovery timed out after 5 secs
[ 2678.560647]  session3: session recovery timed out after 5 secs
[ 2678.951453] device-mapper: multipath: Failing path 8:32.
[ 2678.951509] device-mapper: multipath: Failing path 8:48.
[ 2678.951548] device-mapper: multipath: Failing path 8:16.
[ 2679.584302]  session4: session recovery timed out after 5 secs
[ 2679.584313] sd 17:0:0:1: rejecting I/O to offline device
[ 2679.584356] sd 17:0:0:1: [sde] killing request
[ 2679.584362] sd 17:0:0:1: rejecting I/O to offline device
[ 2679.584392] sd 17:0:0:1: [sde] killing request
[ 2679.584401] sd 17:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT