Re: [f-nsp] FCX and target path 8.0.10m (and an aside)
On Thu, 22 Feb 2018, Kennedy, Joseph wrote: > What is connected to these stacks from a client perspective? Edge stacks :) Mostly HPE/Comware stuff of various vintages. > Are you running PIM on the interfaces? IGMP snooping? Have you checked PIM yes, IGMP snooping usually no - the VLANs are mostly routed on this switch so it is the IGMP querier. I think I see where you're going with this: over many Foundry/Brocade platforms we've had issues with multicast. > your IGMP and multicast groups and how many group members are present? Only a couple of hundred groups usually. > Do you have any link-local groups showing up? Assuming active IGMP > querier configuration of some kind, does the loss line up with any of > the IGMP interface timers reaching 0? There are link-local groups present for sure. We have a core MLX which sees far more groups (it is on the path to the RP) and shows related CPU issues, not enough to be a problem at the moment. We're gradually rolling out filtering of groups at the routed distribution layer to cut down on this (although the effectiveness of this seems a bit hit and miss on some platforms). The UPNP group is particularly pernicious, with ttls != 1. The ping loss was much less noticeable over this weekend, when there is less activity on campus. > Do your OSPF adjacencies look stable throughout or do they transition > during the events? You said you notice loss through the stack but do you > note any loss from the stack itself to the core uplinks? OSPF adjacencies totally stable. When the packet loss happens, it affects pings to the loopback address, the OSPF interface addresses, but it seems not so much addresses through it. So more control plane than data plane. On a previous stack failure we flipped the stacking cable to the opposite ports (there are two in the stack) but it failed again after that. We upgraded from tyarget path 8.0.10m to 8.0.30q, but it made absolutely no difference. Am also looking at the procedure to downgrade to 7.4 and see if the problem persists, although my gut feeling is that it is hardware-based. The ping loss problem occurs on another stacked pair that was 'upgraded' to 8.0.10m at the same time, but that doesn't exhibit the stack failure issue. I'm now monitoring the MIB so hopefully will get SMS alert on failures and have rigged up remote access to the power so we can re-power them remotely without a visit. This will bide us some time, but essentially we're looking at bringing forward a replacement we were likely to execute this summer. Good questions Joseph thanks! Addendum, Tuesday morning: No failures to this point since Friday for us. However the other FCX stack that was upgraded at the same time and exhibited the ping loss issue has now also experienced the stack break issue. It has been running 15 days (although not sure why it rebooted then). Curiously, last week, a third FCX stack broke itself apart too, but that one is running 07400m and has been stable for a long long time. I had been minded to think we just had a hardware issue on our most problematic stack, but with the two others now showing the same symptoms, I'm starting to worry about whether the stack issue is being provoked by some traffic. Seems hard to believe, but I've suspected it for other equipment in the past ho hum. Jethro. > > --JK > > -Original Message- > From: foundry-nsp [mailto:foundry-nsp-boun...@puck.nether.net] On Behalf Of > Jethro R Binks > Sent: Thursday, February 22, 2018 5:16 AM > To: foundry-nsp@puck.nether.net > Subject: Re: [f-nsp] FCX and target path 8.0.10m (and an aside) > > The silence was deafening! > > So bit of a development with this. We had three stack failure events > which required a hard reboot to sort. We made the decision to upgrade > to 8.0.30q (we also replaced the CX4 cable, just in case it is degraded > in some way). Upgrade was all fine. > > Initially after the reboot, we didn't see the ping loss issues. But > over the past few hours it has started to creep in again, much the same > as previously. I've not re-done all the tests like shutting down one > then the other ospf interface to see if it makes any difference to the > problem, but my gut feeling is it will be just the same. > > Anyone any thoughts? Could there be some sort of hardware failure in > one of the units that might cause these symptoms? Maybe I might have > more diagnostic tools available. What might also be interesting is > trying to downgrade back to the 7.4 version we were running previously, > where we didn't see these issues. But that's more service-affecting > downtime. > > Jethro. > > > > On Fri, 16 Feb 2018, Jethro R Binks wrote: > > > I thought I was doing the right thing by upgrading a couple of my > >
Re: [f-nsp] FCX and target path 8.0.10m (and an aside)
Recent 803 patch releases work a lot better than 801 or 802 code tree for us. However, we still have 802 ICX6450 with high uptime that we are going to upgrade to 803q now: STACKID 4 system uptime is 1035 days 2 hours 55 minutes 59 seconds STACKID 1 system uptime is 1037 days 35 minutes 54 seconds STACKID 2 system uptime is 518 days 46 minutes 27 seconds STACKID 3 system uptime is 1036 days 23 hours 32 minutes 13 seconds STACKID 5 system uptime is 323 days 15 hours 13 minutes 20 seconds STACKID 6 system uptime is 14 hours 46 minutes 27 seconds The system started at 11:13:42 GMT+01 Wed Apr 22 2015 We see the SSH server on 8020c become inresponsive periodically. We did not run 801 code tree to long and upgraded to 802 years ago. If you cannot reach the host via ping from the CLI, this is a connection problem, not a forwarding problem. If you can reach the host via CLI ping, but not from other hosts, this would be a forwarding problem, because the Fastiron doesn't need to route/switch its own host connections. If you don't see to many CPU packets with dm raw, this fits low CPU%, as too many packets hitting the CPU will increase CPU%. This doesn't seem to be your problem here. You can check with dm ipv4 hw-route / dm ipv4 hw-arp if HW entries are programmed correctly while the host is unreachable. Do you have independent management connectivity to the FCX to check the status while it stops routing? Best regards, Franz Georg Köhler ___ foundry-nsp mailing list foundry-nsp@puck.nether.net http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: [f-nsp] FCX and target path 8.0.10m (and an aside)
Personally, I never moved away from 7.4. Best regards. > Le 22 févr. 2018 à 11:16, Jethro R Binksa écrit : > > The silence was deafening! > > So bit of a development with this. We had three stack failure events > which required a hard reboot to sort. We made the decision to upgrade to > 8.0.30q (we also replaced the CX4 cable, just in case it is degraded in > some way). Upgrade was all fine. > > Initially after the reboot, we didn't see the ping loss issues. But over > the past few hours it has started to creep in again, much the same as > previously. I've not re-done all the tests like shutting down one then > the other ospf interface to see if it makes any difference to the problem, > but my gut feeling is it will be just the same. > > Anyone any thoughts? Could there be some sort of hardware failure in one > of the units that might cause these symptoms? Maybe I might have more > diagnostic tools available. What might also be interesting is trying to > downgrade back to the 7.4 version we were running previously, where we > didn't see these issues. But that's more service-affecting downtime. > > Jethro. > > > >> On Fri, 16 Feb 2018, Jethro R Binks wrote: >> >> I thought I was doing the right thing by upgrading a couple of my slightly >> aging FCXs to target path release 8.0.10m, which tested fine on an >> unstacked unit with a single OSPF peering. >> >> The ones I am running on are stacks of two, each with two 10Gb/s >> connections to core, one OSPF peering on each. >> >> Since the upgrade, both stacks suffer packet loss every 2 minutes (just >> about exactly) for about 5-10 seconds, demonstrated by pinging either a >> host through the stack, or an interface on the stack. There are no log >> messages or changes in OSPF status or spanning tree activity. When it >> happens, of course a remote session to the box stalls for the same period. >> >> Shutting down either one of the OSPF links doesn't make a difference. >> CPU never changes from 1%. No errors on ints. I've used dm commands to >> catch packets going to CPU at about the right time and see nothing >> particularly alarming and certainly no flooding of anything. >> >> This only started after the upgrade to 8.0.10m on each of them. I have >> other FCX stacks on other code versions not exhibiting this issue. >> >> Some of the comments in this thread seem to be reflective of my issue: >> >> https://www.reddit.com/r/networking/comments/4j47uo/brocade_is_ruining_my_week_i_need_help_to/ >> >> I'm a little dismayed to get these problems on a Target Path release, >> which I assumed would be pretty sound. I've been eyeing a potential >> upgrade to something in the 8.0.30 (recommendations?), with the usual >> added excitement of bringing a fresh set of bugs. >> >> Before I consider reporting it, I wondered if anyone had any useful >> observations or suggestions. >> >> And, as an aside, I wonder how we're all getting along in our new homes >> for our dissociated Brocade family now. Very sad to see the assets of a >> once good company scattered to the four winds like this. >> >> Jethro. >> >> . . . . . . . . . . . . . . . . . . . . . . . . . >> Jethro R Binks, Network Manager, >> Information Services Directorate, University Of Strathclyde, Glasgow, UK >> >> The University of Strathclyde is a charitable body, registered in >> Scotland, number SC015263. >> ___ >> foundry-nsp mailing list >> foundry-nsp@puck.nether.net >> http://puck.nether.net/mailman/listinfo/foundry-nsp >> > > . . . . . . . . . . . . . . . . . . . . . . . . . > Jethro R Binks, Network Manager, > Information Services Directorate, University Of Strathclyde, Glasgow, UK > > The University of Strathclyde is a charitable body, registered in > Scotland, number SC015263. > ___ > foundry-nsp mailing list > foundry-nsp@puck.nether.net > http://puck.nether.net/mailman/listinfo/foundry-nsp ___ foundry-nsp mailing list foundry-nsp@puck.nether.net http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: [f-nsp] FCX and target path 8.0.10m (and an aside)
The silence was deafening! So bit of a development with this. We had three stack failure events which required a hard reboot to sort. We made the decision to upgrade to 8.0.30q (we also replaced the CX4 cable, just in case it is degraded in some way). Upgrade was all fine. Initially after the reboot, we didn't see the ping loss issues. But over the past few hours it has started to creep in again, much the same as previously. I've not re-done all the tests like shutting down one then the other ospf interface to see if it makes any difference to the problem, but my gut feeling is it will be just the same. Anyone any thoughts? Could there be some sort of hardware failure in one of the units that might cause these symptoms? Maybe I might have more diagnostic tools available. What might also be interesting is trying to downgrade back to the 7.4 version we were running previously, where we didn't see these issues. But that's more service-affecting downtime. Jethro. On Fri, 16 Feb 2018, Jethro R Binks wrote: > I thought I was doing the right thing by upgrading a couple of my slightly > aging FCXs to target path release 8.0.10m, which tested fine on an > unstacked unit with a single OSPF peering. > > The ones I am running on are stacks of two, each with two 10Gb/s > connections to core, one OSPF peering on each. > > Since the upgrade, both stacks suffer packet loss every 2 minutes (just > about exactly) for about 5-10 seconds, demonstrated by pinging either a > host through the stack, or an interface on the stack. There are no log > messages or changes in OSPF status or spanning tree activity. When it > happens, of course a remote session to the box stalls for the same period. > > Shutting down either one of the OSPF links doesn't make a difference. > CPU never changes from 1%. No errors on ints. I've used dm commands to > catch packets going to CPU at about the right time and see nothing > particularly alarming and certainly no flooding of anything. > > This only started after the upgrade to 8.0.10m on each of them. I have > other FCX stacks on other code versions not exhibiting this issue. > > Some of the comments in this thread seem to be reflective of my issue: > > https://www.reddit.com/r/networking/comments/4j47uo/brocade_is_ruining_my_week_i_need_help_to/ > > I'm a little dismayed to get these problems on a Target Path release, > which I assumed would be pretty sound. I've been eyeing a potential > upgrade to something in the 8.0.30 (recommendations?), with the usual > added excitement of bringing a fresh set of bugs. > > Before I consider reporting it, I wondered if anyone had any useful > observations or suggestions. > > And, as an aside, I wonder how we're all getting along in our new homes > for our dissociated Brocade family now. Very sad to see the assets of a > once good company scattered to the four winds like this. > > Jethro. > > . . . . . . . . . . . . . . . . . . . . . . . . . > Jethro R Binks, Network Manager, > Information Services Directorate, University Of Strathclyde, Glasgow, UK > > The University of Strathclyde is a charitable body, registered in > Scotland, number SC015263. > ___ > foundry-nsp mailing list > foundry-nsp@puck.nether.net > http://puck.nether.net/mailman/listinfo/foundry-nsp > . . . . . . . . . . . . . . . . . . . . . . . . . Jethro R Binks, Network Manager, Information Services Directorate, University Of Strathclyde, Glasgow, UK The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. ___ foundry-nsp mailing list foundry-nsp@puck.nether.net http://puck.nether.net/mailman/listinfo/foundry-nsp
[f-nsp] FCX and target path 8.0.10m (and an aside)
I thought I was doing the right thing by upgrading a couple of my slightly aging FCXs to target path release 8.0.10m, which tested fine on an unstacked unit with a single OSPF peering. The ones I am running on are stacks of two, each with two 10Gb/s connections to core, one OSPF peering on each. Since the upgrade, both stacks suffer packet loss every 2 minutes (just about exactly) for about 5-10 seconds, demonstrated by pinging either a host through the stack, or an interface on the stack. There are no log messages or changes in OSPF status or spanning tree activity. When it happens, of course a remote session to the box stalls for the same period. Shutting down either one of the OSPF links doesn't make a difference. CPU never changes from 1%. No errors on ints. I've used dm commands to catch packets going to CPU at about the right time and see nothing particularly alarming and certainly no flooding of anything. This only started after the upgrade to 8.0.10m on each of them. I have other FCX stacks on other code versions not exhibiting this issue. Some of the comments in this thread seem to be reflective of my issue: https://www.reddit.com/r/networking/comments/4j47uo/brocade_is_ruining_my_week_i_need_help_to/ I'm a little dismayed to get these problems on a Target Path release, which I assumed would be pretty sound. I've been eyeing a potential upgrade to something in the 8.0.30 (recommendations?), with the usual added excitement of bringing a fresh set of bugs. Before I consider reporting it, I wondered if anyone had any useful observations or suggestions. And, as an aside, I wonder how we're all getting along in our new homes for our dissociated Brocade family now. Very sad to see the assets of a once good company scattered to the four winds like this. Jethro. . . . . . . . . . . . . . . . . . . . . . . . . . Jethro R Binks, Network Manager, Information Services Directorate, University Of Strathclyde, Glasgow, UK The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. ___ foundry-nsp mailing list foundry-nsp@puck.nether.net http://puck.nether.net/mailman/listinfo/foundry-nsp