Hi Keith,

We've run into the same issues you are describing.  I'm not sure what model
of controllers you guys are running, but we have (4) 7280's setup in a
single cluster with each controller connecting with a single 40 Gbps uplink
and running a C-Build of AOS 8.0.6. We also went through a similar
experience with Aruba TAC, going through checking all of the different
multicast/broadcast settings with only minor improvements. After doing some
digging I noticed we were seeing large spikes in broadcast traffic on the
controller uplink ports during peak hours. During that time, we would also
see large spikes in interface discards on our switch ports.  Looking In
Voyance, if you go to My Account > Feeds, click on your crawler (not the IP
address link), click Crawler Details, click on ARP under the Charts
section, and look at the graph for ARP Req Broadcast Pkts, we were seeing a
ton of ARP Req Broadcast Pkts.

What we ended up doing is enabling "Monitor/police non-gratuitous ARP
attacks" on the controllers.  Once we turned it on, the number of ARP Req
Broadcast Pkts we were seeing in Voyance and the spikes in interface
discards were drastically reduced, and as a result we haven't had anymore
controller/cluster disconnects. The setting is located under Configuration
> Services > Firewall.  You may need to tweak the attack rate settings if
you start running into client issues.

FWIW, with Aruba support if you don't ask for the case to be escalated, it
won't get escalated and can drag on for months with no resolution. If the
case still doesn't seem to be progressing, make sure it gets escalated to
the ERT group.

On Fri, Dec 6, 2019 at 7:52 PM Miller, Keith C <[email protected]> wrote:

> Hello all,
>
>
>
> As many of you know, we’re an Aruba shop and we’re running multiple
> versions of 8.x in our environment. We are also a Nyansa Voyance customer
> and for those who are also Nyansa customers will probably remember back in
> October when they changed the default behavior for AP down/reboot events
> from “No Priority” to “Always P2”. Almost immediately, we began receiving
> alerts from Voyance about large amounts of APs going down at the same time.
> After looking at our controllers and other NMS tools, we realized that the
> APs were not actually going down, but the radios on the APs were
> rebootstrapping.
>
>
>
> For those unfamiliar with what rebootstrapping is, it essentially means
> that the radios of the AP rebooted, but the AP itself stayed up. This is
> typically caused by missed heartbeats and/or when an AP reconnects to a
> controller. In a clustered environment, when a controller fails, an AP
> should gracefully move to its S-AAC with little to no impact. However, in
> our case we were seeing APs not gracefully failover after missing
> heartbeats and this was causing the rebootstraps. This impacts clients and
> our users so obviously we were very concerned with what we had found. After
> opening a case with Aruba TAC, we discovered that the cluster members were
> disconnecting from each other. You can see if this is happening in your
> environment by running the “show lc-cluster heartbeat counters” command on
> one of the MDs in a cluster. You’re looking for the last column that
> indicates the last time of disconnect. For us, this has been occurring in
> multiple environments (8.3, 8.4, and 8.5) at least since we began looking
> into it back in October. We’ve sent many logs, traces, and now packet
> captures to the Aruba TAC team. At the request of TAC, we’ve changed
> heartbeat thresholds and enabled BCMC optimization on VLAN interfaces even
> though we have it enabled at the SSID level. While some of these efforts
> have slowed down the frequency of the disconnects, they are still occurring.
>
>
>
> So I’m looking to get some feedback from those that are running AOS 8.x in
> their environment. Are you seeing this problem in your environment?
>
>
>
> Lastly, if you’re experiencing this issue or you’re just interested in
> finding out more about the health of your environment, you can also verify
> if you have APs that are rebootstrapping with the “show ap debug counters”
> command. If you want to isolate a particular AP and gather more
> information, you can run the “show ap debug system-status ap-name” command.
> Here’s what it looks like when the AP doesn’t gracefully failover:
>
>
>
> Cluster Failover Information
>
> ----------------------------
>
> Date       Time     Reason (Latest 10)
>
> --------------------------------------
>
> 2019-11-25 01:10:20 Delete A-AAC:172.27.xx.xx, cluster enabled=1.
> fail-over to 172.27.xx.xx, sby status=1
>
>
>
> Thanks in advance for any and all feedback.
>
>
>
> Regards,
>
>
>
> Keith C. Miller
>
> Wireless Architect, ITS Comm. Technologies
>
> University of North Carolina Chapel Hill
>
> O: (919)962-6564 M: (803)464-2397 | [email protected]
>
> **********
> Replies to EDUCAUSE Community Group emails are sent to the entire
> community list. If you want to reply only to the person who sent the
> message, copy and paste their email address and forward the email reply.
> Additional participation and subscription information can be found at
> https://www.educause.edu/community
>

**********
Replies to EDUCAUSE Community Group emails are sent to the entire community 
list. If you want to reply only to the person who sent the message, copy and 
paste their email address and forward the email reply. Additional participation 
and subscription information can be found at https://www.educause.edu/community

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to