I've been looking through the diff on illumos-joyent and I fail to see anything obviously relevant. :/ If noone else is seeing this I'll try to get a build environment up so I can bisect this and figure it out...
//jb 2016-08-24 10:54 GMT+02:00 Jakob Borg <[email protected]>: > Hi Robert, > > I didnt, but I can retry it now to reproduce, this server isn't > particularly critical at the moment... > > In the working state: > > root@anto:~ # dladm show-aggr -L > LINK PORT AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED > aggr0 igb0 yes yes yes yes no no > -- igb1 yes yes yes yes no no > -- igb2 yes yes yes yes no no > -- igb3 yes yes yes yes no no > > root@anto:~ # dladm show-aggr -x > LINK PORT SPEED DUPLEX STATE ADDRESS > PORTSTATE > aggr0 -- 1000Mb full up 0:14:4f:e7:39:6 -- > igb0 1000Mb full up 0:14:4f:e7:39:6 > attached > igb1 1000Mb full up 0:14:4f:e7:39:7 > attached > igb2 1000Mb full up 0:14:4f:e7:39:8 > attached > igb3 1000Mb full up 0:14:4f:e7:39:9 > attached > > After rebooting into 20160804T173241Z: > > root@anto:~ # dladm show-aggr -L > LINK PORT AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED > aggr0 igb0 yes yes yes yes no no > -- igb1 yes no no no yes no > -- igb2 yes yes yes yes no no > -- igb3 yes no no no yes no > > root@anto:~ # dladm show-aggr -x > LINK PORT SPEED DUPLEX STATE ADDRESS > PORTSTATE > aggr0 -- 1000Mb full up 0:14:4f:e7:39:6 -- > igb0 1000Mb full up 0:14:4f:e7:39:6 > attached > igb1 1000Mb full up 0:14:4f:e7:39:7 > attached > igb2 1000Mb full up 0:14:4f:e7:39:8 > attached > igb3 1000Mb full up 0:14:4f:e7:39:9 > attached > > Note that that is different from last time - two of the links are up > now, and seen as such from the other side as well. Running a snoop on > igb1 (that didn't come up) shows LACP packets going out from the > SmartOS side but nothing from the Juniper side: > > ETHER: ----- Ether Header ----- > ETHER: > ETHER: Packet 12 arrived at 8:33:34.14399 > ETHER: Packet size = 124 bytes > ETHER: Destination = 1:80:c2:0:0:2, Standard MAC Group Address > ETHER: Source = 0:14:4f:e7:39:7, > ETHER: Ethertype = 8809 (Unknown) > ETHER: > > "show interface" on the firewall side shows 0 pps input and 1 pps out, > indicating that it sends LACP PDUs of it's own but doesn't see any > packets from the SmartOS side, i.e. the opposite of what snoop sees. > > Flapping an interface from the firewall side made it come up. After boot: > > root@anto:~ # dladm show-aggr -x > LINK PORT SPEED DUPLEX STATE ADDRESS > PORTSTATE > aggr0 -- 1000Mb full up 0:14:4f:e7:39:6 -- > igb0 1000Mb full up 0:14:4f:e7:39:6 > attached > igb1 1000Mb full up 0:14:4f:e7:39:7 > attached > igb2 1000Mb full up 0:14:4f:e7:39:8 > attached > igb3 1000Mb full up 0:14:4f:e7:39:9 > attached > > root@anto:~ # dladm show-aggr -L > LINK PORT AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED > aggr0 igb0 yes yes yes yes no no > -- igb1 yes no no no yes no > -- igb2 yes yes yes yes no no > -- igb3 yes no no no yes no > > Disable the port where igb3 is connected on the firewall side... > > root@anto:~ # dladm show-aggr -x > LINK PORT SPEED DUPLEX STATE ADDRESS > PORTSTATE > aggr0 -- 1000Mb full up 0:14:4f:e7:39:6 -- > igb0 1000Mb full up 0:14:4f:e7:39:6 > attached > igb1 1000Mb full up 0:14:4f:e7:39:7 > attached > igb2 1000Mb full up 0:14:4f:e7:39:8 > attached > igb3 0Mb half down 0:14:4f:e7:39:9 standby > > Enable it again: > > root@anto:~ # dladm show-aggr -x > LINK PORT SPEED DUPLEX STATE ADDRESS > PORTSTATE > aggr0 -- 1000Mb full up 0:14:4f:e7:39:6 -- > igb0 1000Mb full up 0:14:4f:e7:39:6 > attached > igb1 1000Mb full up 0:14:4f:e7:39:7 > attached > igb2 1000Mb full up 0:14:4f:e7:39:8 > attached > igb3 1000Mb full up 0:14:4f:e7:39:9 > attached > > root@anto:~ # dladm show-aggr -L > LINK PORT AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED > aggr0 igb0 yes yes yes yes no no > -- igb1 yes no no no yes no > -- igb2 yes yes yes yes no no > -- igb3 yes yes yes yes no no > > Traffic flows. > > Any ideas for further troubleshooting here? It would seem to be > something related to the igb link layer, to me, and not LACP specific > at least. > > Rebooting back into 20160804T173241Z, it's solid. > > //jb > > 2016-08-23 23:17 GMT+02:00 Robert Mustacchi <[email protected]>: >> On 8/23/16 0:08 , Jakob Borg wrote: >>> Hi, >>> >>> I upgraded a machine from SmartOS release 20160804T173241Z to >>> 20160818T234814Z today. After rebooting, there was no network >>> connectivity. After some debugging it turned out that LACP link >>> aggregation didn't come up. >>> >>> The server is an old Sun X4170, using the built in four port igb card, >>> connected to a Juniper SRX firewall. In the failed state, all links >>> were marked as physically up but the LACP didn't negotiate: >>> >>> root@anto:~ # uname -a >>> SunOS anto.nym.se 5.11 joyent_20160818T234814Z i86pc i386 i86pc >>> >>> root@anto:~ # dladm show-link >>> LINK CLASS MTU STATE BRIDGE OVER >>> igb0 phys 1500 up -- -- >>> igb2 phys 1500 up -- -- >>> igb1 phys 1500 up -- -- >>> igb3 phys 1500 up -- -- >>> aggr0 aggr 1500 up -- igb0 igb1 igb2 igb3 >>> net0 vnic 1500 ? -- aggr0 >>> net0 vnic 1500 ? -- aggr0 >>> eth0 vnic 1500 ? -- aggr0 >>> eth0 vnic 1500 ? -- aggr0 >>> eth0 vnic 1500 ? -- aggr0 >>> net0 vnic 1500 ? -- aggr0 >>> net0 vnic 1500 ? -- aggr0 >>> net0 vnic 1500 ? -- aggr0 >>> >>> root@anto:~ # dladm show-aggr -L >>> LINK PORT AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED >>> aggr0 igb0 yes no no no yes no >>> -- igb1 yes no no no yes no >>> -- igb2 yes no no no yes no >>> -- igb3 yes no no no yes no >>> >>> On the firewall side, interfaces were up but indicating no LACP traffic: >>> >>> jb@hlv-srx240> show lacp interfaces ae0 >>> Aggregated interface: ae0 >>> LACP state: Role Exp Def Dist Col Syn Aggr Timeout >>> Activity >>> ge-0/0/12 Actor No Yes No No No Yes Fast >>> Active >>> ge-0/0/12 Partner No Yes No No No Yes Fast >>> Passive >>> ge-0/0/13 Actor No Yes No No No Yes Fast >>> Active >>> ge-0/0/13 Partner No Yes No No No Yes Fast >>> Passive >>> ge-0/0/14 Actor No Yes No No No Yes Fast >>> Active >>> ge-0/0/14 Partner No Yes No No No Yes Fast >>> Passive >>> ge-0/0/15 Actor No Yes No No No Yes Fast >>> Active >>> ge-0/0/15 Partner No Yes No No No Yes Fast >>> Passive >>> LACP protocol: Receive State Transmit State Mux State >>> ge-0/0/12 Defaulted Fast periodic Detached >>> ge-0/0/13 Defaulted Fast periodic Detached >>> ge-0/0/14 Defaulted Fast periodic Detached >>> ge-0/0/15 Defaulted Fast periodic Detached >>> >>> The configuration is fairly straight forward: >>> >>> root@anto:~ # egrep aggr\|admin /usbkey/config >>> aggr0_aggr=0:14:4f:e7:39:6,0:14:4f:e7:39:7,0:14:4f:e7:39:8,0:14:4f:e7:39:9 >>> aggr0_lacp_mode=active >>> # admin_nic is the nic admin_ip will be connected to for headnode zones. >>> admin_nic=aggr0 >>> admin_ip=172.16.32.3 >>> admin_ip6=2001:470:deeb:32::3/64 >>> admin_netmask=255.255.255.0 >>> admin_network=... >>> admin_gateway=172.16.32.11 >>> internal_nic=aggr0 >>> >>> After rebooting back into 20160804T173241Z, everything is fine again - >>> LACP comes up immediately. >>> >>> Any ideas? The changelog didn't mention anything specifically relevant >>> (searching for "lacp", "igb") that I could see... >> >> Did you also happen to grab dladm show-aggr -x when you were on the >> system where it wasn't working? >> >> What does dladm show-aggr -L and dladm show-aggr -x show now that you're >> on the system where it's working? >> >> Thanks, >> Robert >> ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
