Re: ix0: CRITICAL: EXTERNAL PHY OVER TEMP!!

2019-05-15 Thread Predrag Punosevac
Stuart Henderson wrote:

> On 2019-05-15, Predrag Punosevac  wrote:
> > Hi,
> >
> > I am having an issue with a single 10 Gigabit interface on one of
> Intel
> > Xeon D-1541 network servers. Namely after the reboot the interface
> > appears to be down even with a static route
> >
> > phobos# ifconfig ix0
> > ix0: flags=8843 mtu 1500
> > lladdr ac:1f:6b:19:f7:72
> > index 1 priority 0 llprio 3
> > groups: egress
> > media: Ethernet autoselect
> > status: no carrier
> > inet 128.2.204.160 netmask 0xfffffc00 broadcast 128.2.207.255
> >
> > The only thing I can see is 
> >
> > ix0: CRITICAL: EXTERNAL PHY OVER TEMP!!  PHY will downshift to lower
> pow
> > er state!
> 
> Looking at the driver it looks like this is a high temperature alarm
> coming from the transceiver (PHY) passed on by the nic. The driver
> attempts to powers down the PHY in this condition, presumably to try to
> avoid damage.
> 
> Is the cooling in this system working correctly?
> 
> Do you still see it if you power it off for a while and let it cool
> down?
> 
> (10GBase-T is relatively power hungry.)


Hi Sten,

The network interface does come after a cold reboot (complete power off
not just reboot command). I replaced network cable and made sure that
university network guys don't have some of DHCP server "enterprise
features" on. Breaking into UEFI is not very useful but I am logged into
IPMI to a two identical SuperMicro X10SDV-TLN4F servers. One of them has
that problematic 

ix0 at pci3 dev 0 function 0 "Intel X552/X557-AT" rev 0x00: msi

interface

The only difference I see is that problematic server run 4 Celsius
degrees warmer (75 instead of 71) but according to the limits I should be
ok up until 95.

Little more search reveals that I am not the only one who got hit with
this thing 

https://tinkertry.com/how-to-work-around-intermittent-intel-x557-network-outages-on-12-core-xeon-d

I will have to think through before I decide what to do. 

Thanks for heads up. 
Predrag

P.S. I was ready to fire up Linux live CD in order to try to reproduce
the problem and see if the Intel guys have pushed some changes into the
Linux version of the driver which is not shared with this community.



Re: ix0: CRITICAL: EXTERNAL PHY OVER TEMP!!

2019-05-15 Thread Stuart Henderson
On 2019-05-15, Predrag Punosevac  wrote:
> Hi,
>
> I am having an issue with a single 10 Gigabit interface on one of Intel
> Xeon D-1541 network servers. Namely after the reboot the interface
> appears to be down even with a static route
>
> phobos# ifconfig ix0
> ix0: flags=8843 mtu 1500
> lladdr ac:1f:6b:19:f7:72
> index 1 priority 0 llprio 3
> groups: egress
> media: Ethernet autoselect
> status: no carrier
> inet 128.2.204.160 netmask 0xfc00 broadcast 128.2.207.255
>
> The only thing I can see is 
>
> ix0: CRITICAL: EXTERNAL PHY OVER TEMP!!  PHY will downshift to lower pow
> er state!

Looking at the driver it looks like this is a high temperature alarm
coming from the transceiver (PHY) passed on by the nic. The driver
attempts to powers down the PHY in this condition, presumably to try to
avoid damage.

Is the cooling in this system working correctly?

Do you still see it if you power it off for a while and let it cool down?

(10GBase-T is relatively power hungry.)




Re: ix0: CRITICAL: EXTERNAL PHY OVER TEMP!!

2019-05-14 Thread Theo de Raadt
Predrag Punosevac  wrote:

> Hi,
> 
> I am having an issue with a single 10 Gigabit interface on one of Intel
> Xeon D-1541 network servers. Namely after the reboot the interface
> appears to be down even with a static route
> 
> phobos# ifconfig ix0
> ix0: flags=8843 mtu 1500
> lladdr ac:1f:6b:19:f7:72
> index 1 priority 0 llprio 3
> groups: egress
> media: Ethernet autoselect
> status: no carrier
> inet 128.2.204.160 netmask 0xfc00 broadcast 128.2.207.255
> 
> The only thing I can see is 
> 
> ix0: CRITICAL: EXTERNAL PHY OVER TEMP!!  PHY will downshift to lower pow
> er state!
> 
> both in dmesg included at the end of this email as well in log files.
> This appears to be a line from a driver code commited few years ago by 
> Mike Belopuhov
> 
> http://openbsd-archive.7691.n7.nabble.com/Intel-10GbE-ix-driver-update-Looking-for-tests-td308300.html

If you check the source code, the top of the file says:

  Copyright (c) 2001-2013, Intel Corporation

It's from their code, handling their part, and you have one of those.

You should probably ask Intel



ix0: CRITICAL: EXTERNAL PHY OVER TEMP!!

2019-05-14 Thread Predrag Punosevac
Hi,

I am having an issue with a single 10 Gigabit interface on one of Intel
Xeon D-1541 network servers. Namely after the reboot the interface
appears to be down even with a static route

phobos# ifconfig ix0
ix0: flags=8843 mtu 1500
lladdr ac:1f:6b:19:f7:72
index 1 priority 0 llprio 3
groups: egress
media: Ethernet autoselect
status: no carrier
inet 128.2.204.160 netmask 0xfc00 broadcast 128.2.207.255

The only thing I can see is 

ix0: CRITICAL: EXTERNAL PHY OVER TEMP!!  PHY will downshift to lower pow
er state!

both in dmesg included at the end of this email as well in log files.
This appears to be a line from a driver code commited few years ago by 
Mike Belopuhov

http://openbsd-archive.7691.n7.nabble.com/Intel-10GbE-ix-driver-update-Looking-for-tests-td308300.html

The server had uptime of about a week before tonigh reboot so I have no
resons to believe that the cable CAT 6 is bad but I will replace it
tomorrow. I don't own the network equipment but I have many servers
connected to the same university switch server rack including identical 
Xeon D-1541 machines and all appear to work flawlessly.

I am not a network engineer so I am quite bewildered by the whole
situation. Any hints? 

Predrag




OpenBSD 6.5 (GENERIC.MP) #0: Wed Apr 24 23:38:54 CEST 2019

r...@syspatch-65-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 17055797248 (16265MB)
avail mem = 16529260544 (15763MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xed9b0 (39 entries)
bios0: vendor American Megatrends Inc. version "1.2" date 04/21/2017
bios0: Silicon Mechanics 1U_SoC_D-1541
acpi0 at bios0: rev 2
acpi0: sleep states S0 S4 S5
acpi0: tables DSDT FACP APIC FPDT FIDT SPMI MCFG UEFI DBG2 HPET WDDT SSDT SSDT 
SSDT PRAD DMAR HEST BERT ERST EINJ
acpi0: wakeup devices IP2P(S4) EHC1(S4) EHC2(S4) RP01(S4) RP02(S4) RP03(S4) 
RP04(S4) RP05(S4) RP06(S4) RP07(S4) RP08(S4) BR1A(S4) BR1B(S4) BR2A(S4) 
BR2B(S4) BR2C(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz, 2100.25 MHz, 06-56-03
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 100MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.2, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz, 2100.01 MHz, 06-56-03
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 4 (application processor)
cpu2: Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz, 2100.01 MHz, 06-56-03
cpu2: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu2: 256KB 64b/line 8-way L2 cache
cpu2: smt 0, core 2, package 0
cpu3 at mainbus0: apid 6 (application processor)
cpu3: Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz, 2100.01 MHz, 06-56-03
cpu3: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,RDSEED,ADX,SMAP,PT,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu3: 256KB 64b/line 8-way L2 cache
cpu3: smt 0, core 3, package 0
cpu4 at mainbus0: apid 8 (application processor)
cpu4: Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz, 2100.01 MHz, 06-56-03