Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing any issues in the fall with large classrooms and delayed connection times (Aruba 8.5.0.13)

2021-09-13 Thread Davis, Jonathan Alan
Hello All, I wanted to give you an update. First, I’ll provide a

-We (UNC-CH) seem to be in a stable situation. Clients are connecting and 
staying connected. STM is consuming reasonable levels of resources.
-We are blind. In addition to disabling the MM to MC communication, we had to 
disable all SNMP. We also have a bit of automation that utilizes SNMP, and that 
is also broken.
-There were TWO separate issues, and I’ll break those out below. It’s important 
to understand that the ARP issue is a separate issue from the STM issue. In our 
first couple of days working with Aruba, even TAC did not recognize our 
symptoms as two separate issues, and this is probably the thing that frustrated 
me the most.


Issue number one: ARP
Our users complained they could not connect to the network and access resources 
for the first five to ten minutes of a class.
Apple IOS devices running version 14, and some Lenovo devices ARP their entire 
subnet after joining Wi-Fi. The Aruba controllers have security rules in place 
to prevent ARP flooding and DDOS attacks utilizing ARP. As clients joined the 
network before classes the devices would ARP the subnet, and once a threshold 
was reached, the controller would begin discarding ARP packets for all clients 
on that controller. The result was that devices would connect, get assigned an 
IP via DHCP, and then ARP to get the MAC of their default gateway. That packet 
would be discarded, and until the controller again allowed ARP to pass, clients 
weren’t able to find their gateway. Depending on the client, this usually 
resulted in them again restarting the 802.11 join process.  [Christopher 
Johnson, this is the behavior you are experiencing.]

You can see if you are being affected by running:
show datapath bwm table
and checking for contract 9 (ARP). You can also check this more specifically by 
running:
show datapath bwm type [type] contract 9
In our case, the full command was:
show datapath bwm type 0 contract 9

When we first addressed this issue, we had over 2 million drops (policed) 
packets on each controller. Our default configuration was 992pps. After 
consideration, we raised our rate to 9792 expecting that multiple clients will 
likely be ARPing the network at the same time and recognizing how large the 
subnet is… and hey, it seemed like a good idea. Since then, we average less 
than 1-3K drops at any given time, and our users are telling us they can 
connect and access the network on the first try.

We have seen no other detrimental effects of this change.

NEXT – STM
We disabled our connections between the MM and MC’s and restarted all 
controllers by controller cluster groups to ensure AP’s and Clients would stay 
connected. Once everything was restarted, we waited for students to migrate 
from ResNET to our Main Campus cluster.
We began getting the first complaints around 10am. After checking load 
distribution, we found that we had even distribution of AP’s across our 8 MC’s, 
but 90% of our clients were connected to only two of our eight controllers in 
that cluster despite our load balancing configuration. This continued to be an 
issue, and TAC confirmed that we were appropriately configured to load balance 
clients at 10%.
Despite disabling the MM to MC connections, we still had very high utilization 
by STM, and TAC decided controllers were unable to balance client connections 
due to that state.
The next step was to block SNMP on the controller firewalls. As you can all 
imagine, this was a difficult decision for us, but if clients can’t connect to 
Wi-Fi, we don’t need SNMP to tell us it’s down…the users do a great job of 
that! 
Once we disabled SNMP, STM processor usage fell to ~30-70% and clients began 
balancing appropriately across controllers.
So, as I said in my TLDR, we are flying blind, but user reports are coming in 
that the issue is much improved. Now we wait for Aruba to deliver our bug fix, 
and a bit of time for testing to ensure we don’t cause more issues.

I want to pause here and express my second large frustration with the 
situation. For the affected cluster, we are running eight 7240XM controllers, 
which according to Aruba should support 32K clients each, yet those two 
controllers were incapable of load balancing due to high STM utilization when 
each had only 8K clients.
Like many who have spoken up, we begin seeing issues as soon as client counts 
on a controller exceed 5K clients. I shudder to think what our experience would 
have been if we had half as many controllers in the cluster.

Marketecture != good design

JD
--
Jonathan Davis
Wireless Architect
The University of North Carolina at Chapel Hill
jonath...@unc.edu

From: The EDUCAUSE Wireless Issues Community Group Listserv 
 on behalf of James Andrewartha 

Date: Saturday, September 11, 2021 at 9:49 PM
To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU 
Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing 
any issues in the fall 

Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing any issues in the fall with large classrooms and delayed connection times (Aruba 8.5.0.13)

2021-09-02 Thread Davis, Jonathan Alan
“That's been my experience for years.  The network works great when there are 
no students around.  My working theory is that students emit RF interference, 
but research ethics won’t let me run the tests, so we'll never know for sure.”

It’s worse than that! They are walking bags of water which absorb the good RF, 
and their devices transmit the bad RF! It’s a conspiracy I tell ya!

We’re going to work with TAC on capturing traffic during a class that is known 
to have issues. After that, we plan to change the rebalancing threshold as well.

Thanks everyone for the feedback!

JD


From: The EDUCAUSE Wireless Issues Community Group Listserv 
 on behalf of Enfield, Chuck 

Date: Thursday, September 2, 2021 at 12:15 PM
To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU 
Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing 
any issues in the fall with large classrooms and delayed connection times 
(Aruba 8.5.0.13)
I will also add that our problems did not increase linearly with client count 
on a controller.  Below 5K there was no user impact.  Around 5K problems 
started and the severity increased quickly.  I doubt there’s anything magic 
about 5K, and the threshold will be different on every network based on a 
variety of implementation details, but I’d expect that pattern to be common.

From: The EDUCAUSE Wireless Issues Community Group Listserv 
 On Behalf Of Enfield, Chuck
Sent: Thursday, September 2, 2021 11:21 AM
To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing 
any issues in the fall with large classrooms and delayed connection times 
(Aruba 8.5.0.13)

Between 5k and 6k clients on a 7240xm is where we started seeing problems. 
Lighter loaded controllers were OK.

From: "Street, Chad A" mailto:cstr...@emory.edu>>
Sent: Thursday, September 2, 2021 11:03 AM
To: 
WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU<mailto:WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU>
Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing 
any issues in the fall with large classrooms and delayed connection times 
(Aruba 8.5.0.13)

We are a balanced cluster, notes about load below:


"I’m also noticing that there are much fewer clients on this controller, and 
that ratio doesn’t seem to be improving."

To this point, the action we took that seemed to help the most was adjusting 
our active client load balancing threshold.  We dropped it significantly to 
force clients to balance across controllers.  Once we got below ~5000 active 
clients per controller, we stopped seeing the mass client connection issues.

We still have a controller that hasn't taken significant load, but now that 
we've been running without major issues for the past few days, we're reluctant 
to touch the setting again.

From: The EDUCAUSE Wireless Issues Community Group Listserv 
mailto:WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU>> 
on behalf of Rob Harris 
mailto:robert.har...@culinary.edu>>
Sent: Thursday, September 2, 2021 10:59 AM
To: 
WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU<mailto:WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU> 
mailto:WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU>>
Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing 
any issues in the fall with large classrooms and delayed connection times 
(Aruba 8.5.0.13)


For those of you who have experienced this, what was your user load and how 
were your clusters operating (balancing, active/standby) ?



I wonder if there’s a threshold..



Thx!



From: The EDUCAUSE Wireless Issues Community Group Listserv 
mailto:WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU>> 
On Behalf Of Smith, Nayef
Sent: Thursday, September 2, 2021 10:20 AM
To: 
WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU<mailto:WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU>
Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing 
any issues in the fall with large classrooms and delayed connection times 
(Aruba 8.5.0.13)







"I’m also noticing that there are much fewer clients on this controller, and 
that ratio doesn’t seem to be improving."



To this point, the action we took that seemed to help the most was adjusting 
our active client load balancing threshold.  We dropped it significantly to 
force clients to balance across controllers.  Once we got below ~5000 active 
clients per controller, we stopped seeing the mass client connection issues.



We still have a controller that hasn't taken significant load, but now that 
we've been running without major issues for the past few days, we're reluctant 
to touch the setting again.





Nayef Z. Smith | Network Services | Voice: 404-727-6019



[cid:image001.png@01D79FF7.DF429460]



From: The EDUCAUSE Wireless Issues Community Group Listserv 
mailto:WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU>> 
on behalf of Davis, Jonathan Alan mailto:jonath...@unc.edu

Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing any issues in the fall with large classrooms and delayed connection times (Aruba 8.5.0.13)

2021-09-02 Thread Davis, Jonathan Alan
Lee, don’t you bring your bad Cisco-juju to this conversation! :-)

Now that Lee has been properly handled, this is probably a great opportunity to 
say ‘hello’ to the greater list.

Hello!

Last night, we (UNC) restarted the controller used to test the firewall policy. 
Despite Aruba’s advisory, we’ve been led to believe that restarting STM may not 
be enough, and restarting the whole controller may be required to resolve high 
STM CPU utilization.

This morning we are keeping a close eye on that controller. While STM is 
surging well past 100%, it seems to be averaging much closer to 95%.

However…
We also only have about 7,000 users connected across the cluster. It will be 
interesting to see what happens as the day progresses and students wake up and 
migrate from the ResNET cluster to the Campus cluster.
I’m also noticing that there are much fewer clients on this controller, and 
that ratio doesn’t seem to be improving.

I’ll update as we progress through this.


JD

--

Jonathan Davis

Wireless Architect

The University of North Carolina at Chapel Hill

jonath...@unc.edu

+1 336 279 3355 (Mobile)


From: The EDUCAUSE Wireless Issues Community Group Listserv 
 on behalf of Lee H Badman 

Sent: Thursday, September 2, 2021 9:06:33 AM
To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU 
Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing 
any issues in the fall with large classrooms and delayed connection times 
(Aruba 8.5.0.13)

But you tested in your lab, right? I love that one… put new code on a couple of 
APs, or even a few dozen. That’s supposed to somehow indicate what will happen 
at bigger load… and also maybe implies the vendor didn’t do their own “similar 
lab testing”…

“You should have tested before upgrading the whole environment…” how do you 
REALLY do that? And should you really have to? Just pondering the general state 
of things.

> On Sep 2, 2021, at 08:59, Enfield, Chuck  wrote:
>
> That's been my experience for years.  The network works great when there are 
> no students around.  My working theory is that students emit RF interference, 
> but research ethics won’t let me run the tests, so we'll never know for sure.
>
> -Original Message-
> From: The EDUCAUSE Wireless Issues Community Group Listserv 
>  On Behalf Of Patrick McEvilly
> Sent: Thursday, September 2, 2021 8:56 AM
> To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU
> Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else seeing 
> any issues in the fall with large classrooms and delayed connection times 
> (Aruba 8.5.0.13)
>
> Speaking from experience, I would be very concerned.  We had no issues until 
> students returned and we went downhill from there.
>
>
> On 9/2/21, 8:50 AM, "The EDUCAUSE Wireless Issues Community Group Listserv 
> on behalf of Rob Harris"  robert.har...@culinary.edu> wrote:
>
>Has anyone seen any details regarding what they consider "Large" 
> environments? We upgraded during the break, but both before and after 
> versions are affected. We didn't notice this happening before, should we be 
> concerned now?
>
>The "dropped" is 0 and the stm cpu usage is in single digits, but client 
> count is really low (they come back this weekend as well), could we be in the 
> clear?
>
>(asked the SE team and opened a tac call, same questions to them)
>
>thx
>
>-Original Message-
>From: The EDUCAUSE Wireless Issues Community Group Listserv 
>  On Behalf Of Jason Healy
>Sent: Thursday, September 2, 2021 8:45 AM
>To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU
>Subject: Re: [WIRELESS-LAN] [External] Re: [WIRELESS-LAN] Anyone else 
> seeing any issues in the fall with large classrooms and delayed connection 
> times (Aruba 8.5.0.13)
>
>CAUTION: This email originated from outside The Culinary Institute of 
> America. Do not click links or open attachments unless you recognize the 
> sender and know the content is safe.
>
>FWIW, Aruba just posted an advisory regarding this issue:
>
>Aruba Support Advisory ARUBA-SA-20210901-PLVL04, "Wi-Fi Client 
> Connectivity Failures in Large Client Environments"
>
>Good luck to those of you hit by this. My students start coming back this 
> weekend so I'll be watching this closely!
>
>Jason
>**
>Replies to EDUCAUSE Community Group emails are sent to the entire 
> community list. If you want to reply only to the person who sent the message, 
> copy and paste their email address and forward the email reply. Additional 
> participation and subscription information can be found at 
>