Re: CURRENT: massive em0 NIC problems since IFLIB changes/introduction

2017-03-17 Thread O. Hartmann
Am Fri, 17 Mar 2017 14:15:01 +0100
Alexander Leidinger  schrieb:

> Quoting "O. Hartmann"  (from Fri, 17 Mar 2017  
> 12:20:18 +0100):
> 
> > Since the introduction of the IFLIB changes, I realise severe problems on
> > CURRENT.  
> 
> I already reported something like this to sbruno@ and M. Macy (in copy).
> 
> > Running the most recent CURRENT (FreeBSD 12.0-CURRENT #27 r315442: Fri Mar 
> > 17
> > 10:46:04 CET 2017  amd64), the problems on a workstation got severe  
> > within the
> > past two days:
> >
> > since a couple of weeks the em0 NIC (Intel i217-LM, see below) dies on heavy
> > I/O. I realised this first when "rsync"ing poudriere repositories to a 
> > remote
> > NFSv4 (automounted) folder. The em0 device could be revived by  
> > ifconfig down/up
> > procedure.
> > But not the i217-LM chip is affected. On another box equipted with a  
> > i350 dual
> > port GBit NIC I observed a similar behaviour under (artificially)  
> > high I/O load
> > (but I didn't investigate that further since it occured very seldom).  
> 
> It's not only those chipsets.
> 
> It may be beneficial if you could provide the pciconf output for those  
> devices. Mine is:
> ---snip---
> em0@pci0:2:6:0: class=0x02 card=0x13768086 chip=0x107c8086  
> rev=0x05 hdr=0x00
>  vendor = 'Intel Corporation'
>  device = '82541PI Gigabit Ethernet Controller'
> ---snip---
> 
> > Now, since around yesterday, the i217-LM dies without being reviveable with
> > ifconfig down/up: Doing so, my FreeBSD CURRENT machine (Fujitsu Celsius 
> > M740)  
> 
> I don't know if for the chip I see this issue with a simple down/up  
> would help (it's a headless server in a remote datacenter). For the  
> moment I'm using the workaround of something like "ping -C 1   
> || shutdown -r now" in crontab.
> 
> The system in question is at r314137.
> 
> > remains with a dead em0 device, reporting "no route" in some occasions but
> > stuck in the dead state. Every attempt to establish manually the route again
> > fails, only rebooting the box gives some relief.
> >
> > On the console, I have some very strange reports:
> >
> > - ping reports suddenly about no buffer space
> > - or I see sometimes massive occurences of "em0: TX(0) desc avail =  
> > 1024, pidx
> >   = 0" on the console  
> 
> I don't see this in messages or console log, but I see that ntpd can't  
> resolve hostnames in the logs.
> 
> > Either way, sending/receiving large files on an established network GBit 
> > line
> > which could be saturated by approx 100 MBytes/s tend to make the NIC fail.  
> 
> I can report that the "svnlite update" on the box of of the FreeBSD  
> src tree is able to trigger the issue in my case.
> 
> I have to add that before the iflib changes I've seen frequent  
> em-watchdog timeouts in the logs / dmesg. So for me we have two issues  
> here:
>   - the hardware wasn't 100% supported before the iflib changes (it seems)
>   - the iflib changes have lost some watchdog functionality /  
> auto-failure-recovery feature
> 
> Bye,
> Alexander.
> 

In January (18.01.2017), I reported Sean Bruno some strange behaviour of the 
same box
alongside with some details (I forgort to send in the Email you're reposnding 
to, sorry)
of the hardware, so here it is again:

[...]
Again, here is the pciconf output of the device: 

em0@pci0:0:25:0:class=0x02 card=0x11ed1734 chip=0x153a8086
rev=0x05 hdr=0x00 vendor = 'Intel Corporation'
device = 'Ethernet Connection I217-LM'
class  = network
subclass   = ethernet
bar   [10] = type Memory, range 32, base 0xfb30, size 131072, enabled
bar   [14] = type Memory, range 32, base 0xfb339000, size 4096, enabled
bar   [18] = type I/O Port, range 32, base 0xf020, size 32, enabled

[...]
The problem has become a severe state within the past two days. I did on a 
daily basis
CURRENT buildwords, did poudriere builds several times and tried to sync them 
to the
package repository server - and that failed dramatically as described above 
starting with
yesterday.

-- 
O. Hartmann

Ich widerspreche der Nutzung oder Übermittlung meiner Daten für
Werbezwecke oder für die Markt- oder Meinungsforschung (§ 28 Abs. 4 BDSG).


pgp4q4s9opgIz.pgp
Description: OpenPGP digital signature


Re: CURRENT: massive em0 NIC problems since IFLIB changes/introduction

2017-03-17 Thread Alexander Leidinger
Quoting "O. Hartmann"  (from Fri, 17 Mar 2017  
12:20:18 +0100):



Since the introduction of the IFLIB changes, I realise severe problems on
CURRENT.


I already reported something like this to sbruno@ and M. Macy (in copy).


Running the most recent CURRENT (FreeBSD 12.0-CURRENT #27 r315442: Fri Mar 17
10:46:04 CET 2017  amd64), the problems on a workstation got severe  
within the

past two days:

since a couple of weeks the em0 NIC (Intel i217-LM, see below) dies on heavy
I/O. I realised this first when "rsync"ing poudriere repositories to a remote
NFSv4 (automounted) folder. The em0 device could be revived by  
ifconfig down/up

procedure.
But not the i217-LM chip is affected. On another box equipted with a  
i350 dual
port GBit NIC I observed a similar behaviour under (artificially)  
high I/O load

(but I didn't investigate that further since it occured very seldom).


It's not only those chipsets.

It may be beneficial if you could provide the pciconf output for those  
devices. Mine is:

---snip---
em0@pci0:2:6:0: class=0x02 card=0x13768086 chip=0x107c8086  
rev=0x05 hdr=0x00

vendor = 'Intel Corporation'
device = '82541PI Gigabit Ethernet Controller'
---snip---


Now, since around yesterday, the i217-LM dies without being reviveable with
ifconfig down/up: Doing so, my FreeBSD CURRENT machine (Fujitsu Celsius M740)


I don't know if for the chip I see this issue with a simple down/up  
would help (it's a headless server in a remote datacenter). For the  
moment I'm using the workaround of something like "ping -C 1   
|| shutdown -r now" in crontab.


The system in question is at r314137.


remains with a dead em0 device, reporting "no route" in some occasions but
stuck in the dead state. Every attempt to establish manually the route again
fails, only rebooting the box gives some relief.

On the console, I have some very strange reports:

- ping reports suddenly about no buffer space
- or I see sometimes massive occurences of "em0: TX(0) desc avail =  
1024, pidx

  = 0" on the console


I don't see this in messages or console log, but I see that ntpd can't  
resolve hostnames in the logs.



Either way, sending/receiving large files on an established network GBit line
which could be saturated by approx 100 MBytes/s tend to make the NIC fail.


I can report that the "svnlite update" on the box of of the FreeBSD  
src tree is able to trigger the issue in my case.


I have to add that before the iflib changes I've seen frequent  
em-watchdog timeouts in the logs / dmesg. So for me we have two issues  
here:

 - the hardware wasn't 100% supported before the iflib changes (it seems)
 - the iflib changes have lost some watchdog functionality /  
auto-failure-recovery feature


Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


pgpqIGQZOTqna.pgp
Description: Digitale PGP-Signatur


CURRENT: massive em0 NIC problems since IFLIB changes/introduction

2017-03-17 Thread O. Hartmann
Since the introduction of the IFLIB changes, I realise severe problems on
CURRENT.

Running the most recent CURRENT (FreeBSD 12.0-CURRENT #27 r315442: Fri Mar 17
10:46:04 CET 2017  amd64), the problems on a workstation got severe within the
past two days:

since a couple of weeks the em0 NIC (Intel i217-LM, see below) dies on heavy
I/O. I realised this first when "rsync"ing poudriere repositories to a remote
NFSv4 (automounted) folder. The em0 device could be revived by ifconfig down/up
procedure.
But not the i217-LM chip is affected. On another box equipted with a i350 dual
port GBit NIC I observed a similar behaviour under (artificially) high I/O load
(but I didn't investigate that further since it occured very seldom). 

Now, since around yesterday, the i217-LM dies without being reviveable with
ifconfig down/up: Doing so, my FreeBSD CURRENT machine (Fujitsu Celsius M740)
remains with a dead em0 device, reporting "no route" in some occasions but
stuck in the dead state. Every attempt to establish manually the route again
fails, only rebooting the box gives some relief.

On the console, I have some very strange reports:

- ping reports suddenly about no buffer space
- or I see sometimes massive occurences of "em0: TX(0) desc avail = 1024, pidx
  = 0" on the console

Either way, sending/receiving large files on an established network GBit line
which could be saturated by approx 100 MBytes/s tend to make the NIC fail.

Since yesterday, it is quite impossible to tranfer larger files in a burst, the
NIC dies rapidly and can not be revived anymore except via reboot.

Kind regards,

O. Hartmann
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"