Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-11 Thread Arran Cudbard-Bell

> On 10 Mar 2016, at 22:36, Curtis K. Larsen  wrote:
> 
> About a year and a half ago I did pretty exhaustive testing of RADIUS load 
> with the Spirent
> traffic generator and with the assistance of PacketFence developers.  
> (PacketFence is also based
> on FreeRADIUS).  They suggested we tweak the MaxConcurrentAPI setting on our 
> test AD server.  So
> we did, but unfortunately it seemed to make no difference at all in the 
> number of authentications
> per second we could process from the load generator.
> 
> One thing we found though was that if we ran the authentications against a 
> flat file on the RADIUS
> server itself we could process six times more authentications.  The bottom 
> line is that whether it
> is SAMBA, NTLM, AD, or network latency itself I can't say - but I do know 
> that if I eliminate all
> of them performance increases dramatically.
> 
> Bottom line:  Use EAP-TLS, and avoid checking LDAP/AD except when absolutely 
> necessary.  PEAP is
> vulnerable to fake AP/MITM attacks anyway.

PEAP and TTLS are both horrifically insecure. I have a presentation on it 
coming up, i'll post the video when it's complete.

The OSX/IOS/Windows supplicants are all vulnerable to bid down attacks when 
there's no wireless profile for the network.

The server can request EAP-TTLS and they'll happily oblige, meaning you don't 
even need to crack the DES keys in MSCHAPv2.

> 
> If you must check AD all the time - get a lot of servers, load balance them, 
> monitor and graph
> authentications down to the second.  That way you'll be more likely to 
> identify the cause of an
> issue.

It doesn't help that FreeRADIUS's processing model is synchronous.  We're 
looking at fixing that, but after considering all the options it really looks 
like the only model we can adopt is using our own stack. That means adapting 
the current unlang interpreter to provide coroutine like behaviour, and 
reworking function calls in any module that performs blocking I/O.

It's not trivial, not sponsored, and there's only two full time developers so 
it's going to take a while.

-Arran


Arran Cudbard-Bell 
FreeRADIUS development team

FD31 3077 42EC 7FCD 32FE 5EE2 56CF 27F9 30A8 CAA2


**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-10 Thread Curtis K. Larsen
About a year and a half ago I did pretty exhaustive testing of RADIUS load with 
the Spirent
traffic generator and with the assistance of PacketFence developers.  
(PacketFence is also based
on FreeRADIUS).  They suggested we tweak the MaxConcurrentAPI setting on our 
test AD server.  So
we did, but unfortunately it seemed to make no difference at all in the number 
of authentications
per second we could process from the load generator.

One thing we found though was that if we ran the authentications against a flat 
file on the RADIUS
server itself we could process six times more authentications.  The bottom line 
is that whether it
is SAMBA, NTLM, AD, or network latency itself I can't say - but I do know that 
if I eliminate all
of them performance increases dramatically.

Bottom line:  Use EAP-TLS, and avoid checking LDAP/AD except when absolutely 
necessary.  PEAP is
vulnerable to fake AP/MITM attacks anyway.

If you must check AD all the time - get a lot of servers, load balance them, 
monitor and graph
authentications down to the second.  That way you'll be more likely to identify 
the cause of an
issue.

Thanks,

-- 
Curtis K. Larsen
Sr. Network Engineer
University of Utah IT/CIS



On Thu, March 10, 2016 1:44 pm, Jake Snyder wrote:
> If AD is not keeping up with the NTLM requests, giving the DCs more NTLM 
> worker threads can help
> it keep up with higher loads.
>
> Working with TAC we found specifically in the ACS logs that it was waiting 
> for Windows to respond.
>
> As far as number of devices, they weren't showing increases over earlier in 
> the week or previous
> weeks.
>
> Thanks
> Jake Snyder
>
>
> Sent from my iPhone
>
>> On Mar 10, 2016, at 12:21 PM, Matthew Newton  wrote:
>>
>> Hi,
>>
>>> On Thu, Mar 10, 2016 at 10:54:59AM -0800, Jake Snyder wrote:
>>> That's for the great info on FreeRadius.  I don't think this is
>>> the case in what I'm seeing that, which is specifically that
>>> Windows AD is not keeping up with NTLM.
>>
>> OK, that's interesting. I think the issue that others have seen on
>> this would look like that - and certainly the symptoms sound the
>> same as you described - so I'm wondering how you came to the
>> conclusion that it's AD itself rather than something between AD
>> and ACS.
>>
>> However, I'm not at all familiar with ACS - I guess it sits on a
>> member server and probably calls LsaLogonUser directly - so there
>> is the communication between the member server and the DC, though
>> I guess that /should/ be fairly slick in theory...
>>
>>> These are customers with environments that are relatively stable
>>> and have been performing well for extended periods of time with
>>> similar user counts.  These are also well below the 256 radius
>>> session limit.
>>
>> I'd throw in the consideration of student numbers as well. We
>> always hit our peak number of wireless clients in February/March
>> each year, so this is the time problems often show up. Why this
>> time of year I have no idea! Probably all the new Christmas
>> presents being connected. :)
>>
>>> The MaxConcurrentAPI raises the number of worker threads in AD
>>> so that it NTLM on the DC can keep up with the incoming
>>> requests.  Why did the performance of NTLM change recently?  I
>>> have no idea, but it appears it has.
>>
>> I believe MaxConcurrentAPI helped some people[0] who were having
>> problems with the FreeRADIUS/Samba setup as well, so again I'm not
>> entirely sure it's a pointer to AD having necessarily changed.
>>
>> Maybe reviewing all Windows patches applied to the DCs and ACS
>> servers in the last 3 months and see if anything seems relevant?
>> But I'm not sure how easy this is to do.
>>
>> It's seems very likely to me that sites are seeing a combination
>> of problems, which could be all of WLC running out of RADIUS IDs,
>> ntlm_auth/Samba as well as MaxConcurrentAPI - so it wouldn't
>> surprise me if different things seem to fix the same symptoms for
>> different sites. It's just that the ACS sites don't have the
>> ntlm_auth component of the problem, so it may have taken a few
>> more months of load before the issue reared its head!
>>
>> Cheers,
>>
>> Matthew
>>
>>
>> [0] see e.g. 
>> https://lists.freeradius.org/pipermail/freeradius-users/2015-March/075969.html
>>
>> --
>> Matthew Newton, Ph.D. 
>>
>> Systems Specialist, Infrastructure Services,
>> I.T. Services, University of Leicester, Leicester LE1 7RH, United Kingdom
>>
>> For IT help contact helpdesk extn. 2253, 
>>
>> **
>> Participation and subscription information for this EDUCAUSE Constituent 
>> Group discussion list
>> can be found at http://www.educause.edu/groups/.
>
> **
> Participation and subscription information for this EDUCAUSE Constituent 
> Group discussion list can
> be found at http://www.educause.edu/groups/.
>

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.


Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-10 Thread Jake Snyder
If AD is not keeping up with the NTLM requests, giving the DCs more NTLM worker 
threads can help it keep up with higher loads.

Working with TAC we found specifically in the ACS logs that it was waiting for 
Windows to respond.

As far as number of devices, they weren't showing increases over earlier in the 
week or previous weeks.

Thanks
Jake Snyder


Sent from my iPhone

> On Mar 10, 2016, at 12:21 PM, Matthew Newton  wrote:
> 
> Hi,
> 
>> On Thu, Mar 10, 2016 at 10:54:59AM -0800, Jake Snyder wrote:
>> That's for the great info on FreeRadius.  I don't think this is
>> the case in what I'm seeing that, which is specifically that
>> Windows AD is not keeping up with NTLM.
> 
> OK, that's interesting. I think the issue that others have seen on
> this would look like that - and certainly the symptoms sound the
> same as you described - so I'm wondering how you came to the
> conclusion that it's AD itself rather than something between AD
> and ACS.
> 
> However, I'm not at all familiar with ACS - I guess it sits on a
> member server and probably calls LsaLogonUser directly - so there
> is the communication between the member server and the DC, though
> I guess that /should/ be fairly slick in theory...
> 
>> These are customers with environments that are relatively stable
>> and have been performing well for extended periods of time with
>> similar user counts.  These are also well below the 256 radius
>> session limit.
> 
> I'd throw in the consideration of student numbers as well. We
> always hit our peak number of wireless clients in February/March
> each year, so this is the time problems often show up. Why this
> time of year I have no idea! Probably all the new Christmas
> presents being connected. :)
> 
>> The MaxConcurrentAPI raises the number of worker threads in AD
>> so that it NTLM on the DC can keep up with the incoming
>> requests.  Why did the performance of NTLM change recently?  I
>> have no idea, but it appears it has.
> 
> I believe MaxConcurrentAPI helped some people[0] who were having
> problems with the FreeRADIUS/Samba setup as well, so again I'm not
> entirely sure it's a pointer to AD having necessarily changed.
> 
> Maybe reviewing all Windows patches applied to the DCs and ACS
> servers in the last 3 months and see if anything seems relevant?
> But I'm not sure how easy this is to do.
> 
> It's seems very likely to me that sites are seeing a combination
> of problems, which could be all of WLC running out of RADIUS IDs,
> ntlm_auth/Samba as well as MaxConcurrentAPI - so it wouldn't
> surprise me if different things seem to fix the same symptoms for
> different sites. It's just that the ACS sites don't have the
> ntlm_auth component of the problem, so it may have taken a few
> more months of load before the issue reared its head!
> 
> Cheers,
> 
> Matthew
> 
> 
> [0] see e.g. 
> https://lists.freeradius.org/pipermail/freeradius-users/2015-March/075969.html
> 
> -- 
> Matthew Newton, Ph.D. 
> 
> Systems Specialist, Infrastructure Services,
> I.T. Services, University of Leicester, Leicester LE1 7RH, United Kingdom
> 
> For IT help contact helpdesk extn. 2253, 
> 
> **
> Participation and subscription information for this EDUCAUSE Constituent 
> Group discussion list can be found at http://www.educause.edu/groups/.

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.


Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-10 Thread Matthew Newton
Hi,

On Thu, Mar 10, 2016 at 10:54:59AM -0800, Jake Snyder wrote:
> That's for the great info on FreeRadius.  I don't think this is
> the case in what I'm seeing that, which is specifically that
> Windows AD is not keeping up with NTLM.

OK, that's interesting. I think the issue that others have seen on
this would look like that - and certainly the symptoms sound the
same as you described - so I'm wondering how you came to the
conclusion that it's AD itself rather than something between AD
and ACS.

However, I'm not at all familiar with ACS - I guess it sits on a
member server and probably calls LsaLogonUser directly - so there
is the communication between the member server and the DC, though
I guess that /should/ be fairly slick in theory...

> These are customers with environments that are relatively stable
> and have been performing well for extended periods of time with
> similar user counts.  These are also well below the 256 radius
> session limit.

I'd throw in the consideration of student numbers as well. We
always hit our peak number of wireless clients in February/March
each year, so this is the time problems often show up. Why this
time of year I have no idea! Probably all the new Christmas
presents being connected. :)

> The MaxConcurrentAPI raises the number of worker threads in AD
> so that it NTLM on the DC can keep up with the incoming
> requests.  Why did the performance of NTLM change recently?  I
> have no idea, but it appears it has.

I believe MaxConcurrentAPI helped some people[0] who were having
problems with the FreeRADIUS/Samba setup as well, so again I'm not
entirely sure it's a pointer to AD having necessarily changed.

Maybe reviewing all Windows patches applied to the DCs and ACS
servers in the last 3 months and see if anything seems relevant?
But I'm not sure how easy this is to do.

It's seems very likely to me that sites are seeing a combination
of problems, which could be all of WLC running out of RADIUS IDs,
ntlm_auth/Samba as well as MaxConcurrentAPI - so it wouldn't
surprise me if different things seem to fix the same symptoms for
different sites. It's just that the ACS sites don't have the
ntlm_auth component of the problem, so it may have taken a few
more months of load before the issue reared its head!

Cheers,

Matthew


[0] see e.g. 
https://lists.freeradius.org/pipermail/freeradius-users/2015-March/075969.html

-- 
Matthew Newton, Ph.D. 

Systems Specialist, Infrastructure Services,
I.T. Services, University of Leicester, Leicester LE1 7RH, United Kingdom

For IT help contact helpdesk extn. 2253, 

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.


Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-10 Thread Kitri Waterman
This exact discussion came up in a ClearPass in-depth class yesterday at 
Atmosphere/Airheads since ClearPass (based on FreeRadius) only has so many 
worker threads. Anything over a 2 sec delay between ClearPass and AD was...not 
ideal.

The class was "Adapting to Evolving User, Security and Business Needs with 
Aruba Clearpass" with Troy Arnold and Rajesh Ramireddy.

The videos should be available shortly/next week I believe. Definitely worth 
seeing even if you aren't Aruba based.


Kitri Waterman
University of Washington
ki...@uw.edu
 

On 3/10/16, 10:54 AM, "The EDUCAUSE Wireless Issues Constituent Group Listserv 
on behalf of Jake Snyder"  wrote:

>Matthew,
>That's for the great info on FreeRadius.  I don't think this is the case in 
>what I'm seeing that, which is specifically that Windows AD is not keeping up 
>with NTLM.
>
>These are customers with environments that are relatively stable and have been 
>performing well for extended periods of time with similar user counts.  These 
>are also well below the 256 radius session limit.
>
>The MaxConcurrentAPI raises the number of worker threads in AD so that it NTLM 
>on the DC can keep up with the incoming requests.  Why did the performance of 
>NTLM change recently?  I have no idea, but it appears it has.
>
>Thanks
>Jake Snyder
>
>
>Sent from my iPhone
>
>> On Mar 10, 2016, at 7:50 AM, Matthew Newton  wrote:
>> 
>> On Thu, Mar 10, 2016 at 09:14:02AM -0500, Earl Barfield wrote:
 Just wanted to throw this out to the educause community to see if others
 are seeing this.  Although this is not ultimately a problem with Higher Ed,
 the large scale RADIUS deployments in higher ed resulting in more impact
>>> 
>>> If anything (radius server, users, Active Directory, etc) slows down
>>> the auth process, then you're going to have more auth sessions in
>>> progress simultaneously.
>> 
>> This has been a well-known issue in the FreeRADIUS world for a
>> long time now. Anything that slows down the NTLM communication
>> between the RADIUS server and the AD server will eventually lead
>> to problems. It just seems to crop up more in certain
>> circumstances. With FreeRADIUS, part of the problem seemed to be
>> using Samba's ntlm_auth (which involves an exec) so I did quite a
>> bit of hacking a year ago to use a library call and avoid that,
>> which does seems to help. As does faster hardware for the RADIUS
>> servers.
>> 
>> Cisco haven't helped themselves for a long time by using a single
>> UDP source port (and therefore only 256 radius IDs) per
>> controller. Using a different source port per access point would
>> have a decent solution IMO, or even just random ephemeral ports,
>> but they've gone for some half-way solution that uses a few more
>> source ports in 8.1-something. Better than before anyway.
>> 
>> The problem exacerbates itself because when the WLC doesn't get a
>> response from a RADIUS server after a while, it will drop that
>> server and move to the next. Then all 250 or so authentications
>> in-flight (and probably half completed) will get chopped off and
>> have to start again on the next server.
>> 
>> Each hour when all the students moved between lectures we'd see 10
>> minutes of WLCs jumping to a different RADIUS server every minute
>> or so. This makes the higher-ed situation fairly unique and not
>> like business environments, where people don't tend to move around
>> in very large groups all at the same time.
>> 
>> I started to collect mailing list posts on a blog post to try and
>> collect information together if anyone's interested in reading
>> lots of different views on it! http://q.asd.me.uk/0
>> 
>> It's one of those things that if you're not looking for it,
>> though, you might not easily notice it, but just have complaints
>> about bad wireless connectivity at certain times of the day. It
>> becomes easy to see in the WLC SNMP RADIUS server not responding
>> traps, however.
>> 
>> Cheers,
>> 
>> Matthew
>> 
>> 
>> -- 
>> Matthew Newton, Ph.D. 
>> 
>> Systems Specialist, Infrastructure Services,
>> I.T. Services, University of Leicester, Leicester LE1 7RH, United Kingdom
>> 
>> For IT help contact helpdesk extn. 2253, 
>> 
>> **
>> Participation and subscription information for this EDUCAUSE Constituent 
>> Group discussion list can be found at http://www.educause.edu/groups/.
>
>**
>Participation and subscription information for this EDUCAUSE Constituent Group 
>discussion list can be found at http://www.educause.edu/groups/.

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.



Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-10 Thread Jake Snyder
Matthew,
That's for the great info on FreeRadius.  I don't think this is the case in 
what I'm seeing that, which is specifically that Windows AD is not keeping up 
with NTLM.

These are customers with environments that are relatively stable and have been 
performing well for extended periods of time with similar user counts.  These 
are also well below the 256 radius session limit.

The MaxConcurrentAPI raises the number of worker threads in AD so that it NTLM 
on the DC can keep up with the incoming requests.  Why did the performance of 
NTLM change recently?  I have no idea, but it appears it has.

Thanks
Jake Snyder


Sent from my iPhone

> On Mar 10, 2016, at 7:50 AM, Matthew Newton  wrote:
> 
> On Thu, Mar 10, 2016 at 09:14:02AM -0500, Earl Barfield wrote:
>>> Just wanted to throw this out to the educause community to see if others
>>> are seeing this.  Although this is not ultimately a problem with Higher Ed,
>>> the large scale RADIUS deployments in higher ed resulting in more impact
>> 
>> If anything (radius server, users, Active Directory, etc) slows down
>> the auth process, then you're going to have more auth sessions in
>> progress simultaneously.
> 
> This has been a well-known issue in the FreeRADIUS world for a
> long time now. Anything that slows down the NTLM communication
> between the RADIUS server and the AD server will eventually lead
> to problems. It just seems to crop up more in certain
> circumstances. With FreeRADIUS, part of the problem seemed to be
> using Samba's ntlm_auth (which involves an exec) so I did quite a
> bit of hacking a year ago to use a library call and avoid that,
> which does seems to help. As does faster hardware for the RADIUS
> servers.
> 
> Cisco haven't helped themselves for a long time by using a single
> UDP source port (and therefore only 256 radius IDs) per
> controller. Using a different source port per access point would
> have a decent solution IMO, or even just random ephemeral ports,
> but they've gone for some half-way solution that uses a few more
> source ports in 8.1-something. Better than before anyway.
> 
> The problem exacerbates itself because when the WLC doesn't get a
> response from a RADIUS server after a while, it will drop that
> server and move to the next. Then all 250 or so authentications
> in-flight (and probably half completed) will get chopped off and
> have to start again on the next server.
> 
> Each hour when all the students moved between lectures we'd see 10
> minutes of WLCs jumping to a different RADIUS server every minute
> or so. This makes the higher-ed situation fairly unique and not
> like business environments, where people don't tend to move around
> in very large groups all at the same time.
> 
> I started to collect mailing list posts on a blog post to try and
> collect information together if anyone's interested in reading
> lots of different views on it! http://q.asd.me.uk/0
> 
> It's one of those things that if you're not looking for it,
> though, you might not easily notice it, but just have complaints
> about bad wireless connectivity at certain times of the day. It
> becomes easy to see in the WLC SNMP RADIUS server not responding
> traps, however.
> 
> Cheers,
> 
> Matthew
> 
> 
> -- 
> Matthew Newton, Ph.D. 
> 
> Systems Specialist, Infrastructure Services,
> I.T. Services, University of Leicester, Leicester LE1 7RH, United Kingdom
> 
> For IT help contact helpdesk extn. 2253, 
> 
> **
> Participation and subscription information for this EDUCAUSE Constituent 
> Group discussion list can be found at http://www.educause.edu/groups/.

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.


Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-10 Thread Matthew Newton
On Thu, Mar 10, 2016 at 09:14:02AM -0500, Earl Barfield wrote:
> >Just wanted to throw this out to the educause community to see if others
> >are seeing this.  Although this is not ultimately a problem with Higher Ed,
> >the large scale RADIUS deployments in higher ed resulting in more impact
> 
> If anything (radius server, users, Active Directory, etc) slows down
> the auth process, then you're going to have more auth sessions in
> progress simultaneously.

This has been a well-known issue in the FreeRADIUS world for a
long time now. Anything that slows down the NTLM communication
between the RADIUS server and the AD server will eventually lead
to problems. It just seems to crop up more in certain
circumstances. With FreeRADIUS, part of the problem seemed to be
using Samba's ntlm_auth (which involves an exec) so I did quite a
bit of hacking a year ago to use a library call and avoid that,
which does seems to help. As does faster hardware for the RADIUS
servers.

Cisco haven't helped themselves for a long time by using a single
UDP source port (and therefore only 256 radius IDs) per
controller. Using a different source port per access point would
have a decent solution IMO, or even just random ephemeral ports,
but they've gone for some half-way solution that uses a few more
source ports in 8.1-something. Better than before anyway.

The problem exacerbates itself because when the WLC doesn't get a
response from a RADIUS server after a while, it will drop that
server and move to the next. Then all 250 or so authentications
in-flight (and probably half completed) will get chopped off and
have to start again on the next server.

Each hour when all the students moved between lectures we'd see 10
minutes of WLCs jumping to a different RADIUS server every minute
or so. This makes the higher-ed situation fairly unique and not
like business environments, where people don't tend to move around
in very large groups all at the same time.

I started to collect mailing list posts on a blog post to try and
collect information together if anyone's interested in reading
lots of different views on it! http://q.asd.me.uk/0

It's one of those things that if you're not looking for it,
though, you might not easily notice it, but just have complaints
about bad wireless connectivity at certain times of the day. It
becomes easy to see in the WLC SNMP RADIUS server not responding
traps, however.

Cheers,

Matthew


-- 
Matthew Newton, Ph.D. 

Systems Specialist, Infrastructure Services,
I.T. Services, University of Leicester, Leicester LE1 7RH, United Kingdom

For IT help contact helpdesk extn. 2253, 

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.


RE: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-09 Thread Lee H Badman
Is both, to me. Block the worst clients as they are easy to find. But also use 
exclusion with very short timer to slow the effects way down while not 
penalizing good clients with odd auth behavior. :)

Thanks for sharing all this, Jake!

Lee Badman
Network Architect/Wireless TME
Syracuse University
315.443.3003

-Original Message-
From: Jake Snyder [jsnyde...@gmail.com]
Received: Wednesday, 09 Mar 2016, 17:35
To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU [WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU]
Subject: Re: [WIRELESS-LAN] Recent Radius Meltdowns

I don't necessarily agree with the doc in all aspects.  My takeaway is that 
some failing clients can put a huge load on the RADIUS environment.  I've seen 
some clients sending 20 requests per second.  I think it's better to identify a 
client doing that through logging and block them individually rather than 
risking the exclusion.

Thanks
Jake Snyder


Sent from my iPhone

On Mar 9, 2016, at 1:53 PM, Lee H Badman 
mailto:lhbad...@syr.edu>> wrote:

I have to disagree with 120 second client exclusion timer- that in itself can 
be devastating. I recommend 5 or 10 seconds.

Lee Badman
Network Architect/Wireless TME
Syracuse University
315.443.3003

-Original Message-
From: Jake Snyder [jsnyde...@gmail.com<mailto:jsnyde...@gmail.com>]
Received: Wednesday, 09 Mar 2016, 16:05
To: 
WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU<mailto:WIRELESS-LAN@listserv.educause.edu> 
[WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU<mailto:WIRELESS-LAN@listserv.educause.edu>]
Subject: [WIRELESS-LAN] Recent Radius Meltdowns

Just wanted to throw this out to the educause community to see if others are 
seeing this.  Although this is not ultimately a problem with Higher Ed, the 
large scale RADIUS deployments in higher ed resulting in more impact

Several weeks ago we had a higher ed customer who's Radius environment started 
periodically melting down.  The customer was running Cisco Infrastructure and 
ACS 5.x on the back end.

In terms of changes, there were no recent changes to either the wireless 
network, or RADIUS environment.  The only recent change was patches applied to 
the Windows environment.

Ultimately, the cause was found to be the AD environment was taking an 
excessive time responding to NTLM authentications.  There was no ultimate fix 
found, but troubleshooting led us to the changing the MaxConcurrentAPI on the 
windows servers. which ultimately helped enough to eliminate the problem from a 
daily occurrence.

About a week later, this same customer reported to me that visiting another 
university campus that their RADIUS environment was also experiencing these 
issues.

Fast forward a couple weeks, I had a public utility customer seeing this same 
issue.  Suddenly flags went off that this is wider spread that just a couple 
Higher Ed customers.

Now i'm sitting at #ATM16 and talking with other Higher Ed engineer and a large 
retail customer, it MAY be impacting non-cisco infrastructure as well.  My 
assumption is anything performing

Below are some of the links that talk about this change to the MaxConcurentAPI. 
 I believe these two customers made changes anywhere from 2 to 20.  I know some 
of these customers are on this educause   I'm not advocating a specific value, 
i assume that different environments will need different values.


https://support.microsoft.com/en-us/kb/109626

https://blogs.technet.microsoft.com/ad/2008/09/23/updated-ntlm-and-maxconcurrentapi-concerns/

Hopefully this helps anyone who has started to see these issues in the last few 
weeks.  Also, if you're having this, please reply and let the community know 
infrastructure, radius and possibly AD environment versions.

Also, for the Cisco folks, here's a great doc that you should read.

http://www.cisco.com/c/en/us/support/docs/wireless-mobility/wireless-lan-wlan/118703-technote-wlc-00.html


** Participation and subscription information for this EDUCAUSE 
Constituent Group discussion list can be found at 
http://www.educause.edu/groups/.

** Participation and subscription information for this EDUCAUSE 
Constituent Group discussion list can be found at 
http://www.educause.edu/groups/.

** Participation and subscription information for this EDUCAUSE 
Constituent Group discussion list can be found at 
http://www.educause.edu/groups/.

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.



Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-09 Thread Jake Snyder
I don't necessarily agree with the doc in all aspects.  My takeaway is that 
some failing clients can put a huge load on the RADIUS environment.  I've seen 
some clients sending 20 requests per second.  I think it's better to identify a 
client doing that through logging and block them individually rather than 
risking the exclusion.

Thanks
Jake Snyder


Sent from my iPhone

> On Mar 9, 2016, at 1:53 PM, Lee H Badman  wrote:
> 
> I have to disagree with 120 second client exclusion timer- that in itself can 
> be devastating. I recommend 5 or 10 seconds.
> 
> Lee Badman
> Network Architect/Wireless TME
> Syracuse University
> 315.443.3003
> 
> -Original Message- 
> From: Jake Snyder [jsnyde...@gmail.com]
> Received: Wednesday, 09 Mar 2016, 16:05
> To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU [WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU]
> Subject: [WIRELESS-LAN] Recent Radius Meltdowns
> 
> Just wanted to throw this out to the educause community to see if others are 
> seeing this.  Although this is not ultimately a problem with Higher Ed, the 
> large scale RADIUS deployments in higher ed resulting in more impact
> 
> Several weeks ago we had a higher ed customer who's Radius environment 
> started periodically melting down.  The customer was running Cisco 
> Infrastructure and ACS 5.x on the back end.
> 
> In terms of changes, there were no recent changes to either the wireless 
> network, or RADIUS environment.  The only recent change was patches applied 
> to the Windows environment.
> 
> Ultimately, the cause was found to be the AD environment was taking an 
> excessive time responding to NTLM authentications.  There was no ultimate fix 
> found, but troubleshooting led us to the changing the MaxConcurrentAPI on the 
> windows servers. which ultimately helped enough to eliminate the problem from 
> a daily occurrence.
> 
> About a week later, this same customer reported to me that visiting another 
> university campus that their RADIUS environment was also experiencing these 
> issues.
> 
> Fast forward a couple weeks, I had a public utility customer seeing this same 
> issue.  Suddenly flags went off that this is wider spread that just a couple 
> Higher Ed customers.
> 
> Now i'm sitting at #ATM16 and talking with other Higher Ed engineer and a 
> large retail customer, it MAY be impacting non-cisco infrastructure as well.  
> My assumption is anything performing
> 
> Below are some of the links that talk about this change to the 
> MaxConcurentAPI.  I believe these two customers made changes anywhere from 2 
> to 20.  I know some of these customers are on this educause   I'm not 
> advocating a specific value, i assume that different environments will need 
> different values.
> 
> 
> https://support.microsoft.com/en-us/kb/109626
> 
>  
> 
> https://blogs.technet.microsoft.com/ad/2008/09/23/updated-ntlm-and-maxconcurrentapi-concerns/
> 
> 
> 
> Hopefully this helps anyone who has started to see these issues in the last 
> few weeks.  Also, if you're having this, please reply and let the community 
> know infrastructure, radius and possibly AD environment versions.
> 
> 
> Also, for the Cisco folks, here's a great doc that you should read.
> 
> 
> 
> http://www.cisco.com/c/en/us/support/docs/wireless-mobility/wireless-lan-wlan/118703-technote-wlc-00.html
> 
> 
> 
> 
> 
> ** Participation and subscription information for this EDUCAUSE 
> Constituent Group discussion list can be found at 
> http://www.educause.edu/groups/.
> ** Participation and subscription information for this EDUCAUSE 
> Constituent Group discussion list can be found at 
> http://www.educause.edu/groups/.

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.



RE: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-09 Thread Lee H Badman
I have to disagree with 120 second client exclusion timer- that in itself can 
be devastating. I recommend 5 or 10 seconds.

Lee Badman
Network Architect/Wireless TME
Syracuse University
315.443.3003

-Original Message-
From: Jake Snyder [jsnyde...@gmail.com]
Received: Wednesday, 09 Mar 2016, 16:05
To: WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU [WIRELESS-LAN@LISTSERV.EDUCAUSE.EDU]
Subject: [WIRELESS-LAN] Recent Radius Meltdowns

Just wanted to throw this out to the educause community to see if others are 
seeing this.  Although this is not ultimately a problem with Higher Ed, the 
large scale RADIUS deployments in higher ed resulting in more impact

Several weeks ago we had a higher ed customer who's Radius environment started 
periodically melting down.  The customer was running Cisco Infrastructure and 
ACS 5.x on the back end.

In terms of changes, there were no recent changes to either the wireless 
network, or RADIUS environment.  The only recent change was patches applied to 
the Windows environment.

Ultimately, the cause was found to be the AD environment was taking an 
excessive time responding to NTLM authentications.  There was no ultimate fix 
found, but troubleshooting led us to the changing the MaxConcurrentAPI on the 
windows servers. which ultimately helped enough to eliminate the problem from a 
daily occurrence.

About a week later, this same customer reported to me that visiting another 
university campus that their RADIUS environment was also experiencing these 
issues.

Fast forward a couple weeks, I had a public utility customer seeing this same 
issue.  Suddenly flags went off that this is wider spread that just a couple 
Higher Ed customers.

Now i'm sitting at #ATM16 and talking with other Higher Ed engineer and a large 
retail customer, it MAY be impacting non-cisco infrastructure as well.  My 
assumption is anything performing

Below are some of the links that talk about this change to the MaxConcurentAPI. 
 I believe these two customers made changes anywhere from 2 to 20.  I know some 
of these customers are on this educause   I'm not advocating a specific value, 
i assume that different environments will need different values.


https://support.microsoft.com/en-us/kb/109626

https://blogs.technet.microsoft.com/ad/2008/09/23/updated-ntlm-and-maxconcurrentapi-concerns/

Hopefully this helps anyone who has started to see these issues in the last few 
weeks.  Also, if you're having this, please reply and let the community know 
infrastructure, radius and possibly AD environment versions.

Also, for the Cisco folks, here's a great doc that you should read.

http://www.cisco.com/c/en/us/support/docs/wireless-mobility/wireless-lan-wlan/118703-technote-wlc-00.html


** Participation and subscription information for this EDUCAUSE 
Constituent Group discussion list can be found at 
http://www.educause.edu/groups/.

**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.



Re: [WIRELESS-LAN] Recent Radius Meltdowns

2016-03-09 Thread Holland, Ryan
Thanks, Jake. We are experiencing this as we speak.
- Patched AD servers on 2/28/16
- Noticed one radius server reporting SAMBA/NTLM slow response times on 3/2 and 
3/3
- Took that server out of service
- A second radius server reporting same issue 3/8 and today, 3/9

Aruba Controllers
Aruba ClearPass
Win2K12 AD servers

Ryan Holland
Senior Network Engineer
The Ohio State University
Office of the Chief Information Officer
Telecommunications Network Center (TNC)
320 W. 8th Ave.
Columbus, OH 43201
614-292-9906 Office
holland@osu.edu 
ocio.osu.edu

On Mar 9, 2016, at 4:05 PM, Jake Snyder 
mailto:jsnyde...@gmail.com>> wrote:

Just wanted to throw this out to the educause community to see if others are 
seeing this.  Although this is not ultimately a problem with Higher Ed, the 
large scale RADIUS deployments in higher ed resulting in more impact

Several weeks ago we had a higher ed customer who's Radius environment started 
periodically melting down.  The customer was running Cisco Infrastructure and 
ACS 5.x on the back end.

In terms of changes, there were no recent changes to either the wireless 
network, or RADIUS environment.  The only recent change was patches applied to 
the Windows environment.

Ultimately, the cause was found to be the AD environment was taking an 
excessive time responding to NTLM authentications.  There was no ultimate fix 
found, but troubleshooting led us to the changing the MaxConcurrentAPI on the 
windows servers. which ultimately helped enough to eliminate the problem from a 
daily occurrence.

About a week later, this same customer reported to me that visiting another 
university campus that their RADIUS environment was also experiencing these 
issues.

Fast forward a couple weeks, I had a public utility customer seeing this same 
issue.  Suddenly flags went off that this is wider spread that just a couple 
Higher Ed customers.

Now i'm sitting at #ATM16 and talking with other Higher Ed engineer and a large 
retail customer, it MAY be impacting non-cisco infrastructure as well.  My 
assumption is anything performing

Below are some of the links that talk about this change to the MaxConcurentAPI. 
 I believe these two customers made changes anywhere from 2 to 20.  I know some 
of these customers are on this educause   I'm not advocating a specific value, 
i assume that different environments will need different values.


https://support.microsoft.com/en-us/kb/109626

https://blogs.technet.microsoft.com/ad/2008/09/23/updated-ntlm-and-maxconcurrentapi-concerns/

Hopefully this helps anyone who has started to see these issues in the last few 
weeks.  Also, if you're having this, please reply and let the community know 
infrastructure, radius and possibly AD environment versions.

Also, for the Cisco folks, here's a great doc that you should read.


http://www.cisco.com/c/en/us/support/docs/wireless-mobility/wireless-lan-wlan/118703-technote-wlc-00.html


** Participation and subscription information for this EDUCAUSE 
Constituent Group discussion list can be found at 
http://www.educause.edu/groups/.



**
Participation and subscription information for this EDUCAUSE Constituent Group 
discussion list can be found at http://www.educause.edu/groups/.