Re: Distributed ppolicy state

2009-10-24 Thread Volker Lendecke
On Fri, Oct 23, 2009 at 02:15:40PM -0700, Howard Chu wrote:
> I'm not sure you're trying to solve the right problem yet. I'm pretty 
> unconvinced that account lockout is a good solution to anything, in 
> general. That's why I added login rate control to the latest ppolicy draft, 
> where the DSA simply starts inserting delays before responding to failed 
> authc attempts. As I see it, rate control can be managed completely within 
> a single DSA and no state ever needs to be replicated outward on any 
> particular schedule. But at the moment I haven't yet thought about how well 
> this will work in all the possible deployment scenarios.
> 
> So once again, what's important here is to analyze what are the types of 
> attacks we expect to see, and how particular defense strategies will 
> behave, and how effectively they will fend off those attacks. Until you've 
> outlined the problems, you don't have any framework for designing the 
> solution.

Just a quick comment: The way we understand NT4 is that the
failed attempts are counted locally and only the lockout is
replicated. This reduces the load a lot.

Volker


pgpOn24qwjSzu.pgp
Description: PGP signature


Re: Distributed ppolicy state

2009-10-23 Thread Howard Chu

Volker Lendecke wrote:

On Fri, Oct 23, 2009 at 02:15:40PM -0700, Howard Chu wrote:

I'm not sure you're trying to solve the right problem yet. I'm pretty
unconvinced that account lockout is a good solution to anything, in
general. That's why I added login rate control to the latest ppolicy draft,
where the DSA simply starts inserting delays before responding to failed
authc attempts. As I see it, rate control can be managed completely within
a single DSA and no state ever needs to be replicated outward on any
particular schedule. But at the moment I haven't yet thought about how well
this will work in all the possible deployment scenarios.

So once again, what's important here is to analyze what are the types of
attacks we expect to see, and how particular defense strategies will
behave, and how effectively they will fend off those attacks. Until you've
outlined the problems, you don't have any framework for designing the
solution.


Just a quick comment: The way we understand NT4 is that the
failed attempts are counted locally and only the lockout is
replicated. This reduces the load a lot.


That's correct. But it also means that in an environment with M DSAs and N 
failures before lockout, an attacker can potentially get NxM attempts before 
being stopped. With a count/lockout-only strategy the attacker can reach NxM 
in a fairly small amount of time, long before any system-wide IDS can react.

In many installations this is unacceptable.

Again, this is why IMO delaying failed login attempts is a better defense - 
limiting the number of attempts an attacker can launch limits the attacker's 
overall effectiveness. Simple lockouts allow an attacker to quickly plow thru 
one account and immediately move on to the next; they don't impede an attacker 
at all. (This is also the same strategy I use in anti-spam filters on SMTP 
servers - delay the server's response to mail coming from a blacklisted 
server, rather than rejecting it immediately, and you have effectively slowed 
the propagation rate of spam on the network.)


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/


Re: Distributed ppolicy state

2009-10-23 Thread Howard Chu

Brett @Google wrote:

On Thu, Oct 22, 2009 at 5:44 PM, Howard Chu mailto:h...@symas.com>> wrote:

In the case of a local, load-balanced cluster of replicas, where the
network latency between DSAs is very low, the natural coalescing of
updates may not occur as often. Still, it would be better if the
updates didn't happen at all. And in such an environment, where the
DSAs are so close together that latency is low, distributing reads
is still cheaper than distributing writes. So, the correct way to
implement this global state is to keep it distributed separately
during writes, and collect it during reads.


I'd think that to indicate the topology you would create some
administrative name, perhaps a simple string "sales west" or "cluster
one" to indicate a topological region, and you would specify for each
DSA which administrative name or topology it is logically part of. Then
this administrative region name + unique identifier of the principal in
question, could be used as a key to hold a simple locked / unlocked
boolean value on the replica's parent.


I'm not sure you're trying to solve the right problem yet. I'm pretty 
unconvinced that account lockout is a good solution to anything, in general. 
That's why I added login rate control to the latest ppolicy draft, where the 
DSA simply starts inserting delays before responding to failed authc attempts. 
As I see it, rate control can be managed completely within a single DSA and no 
state ever needs to be replicated outward on any particular schedule. But at 
the moment I haven't yet thought about how well this will work in all the 
possible deployment scenarios.


So once again, what's important here is to analyze what are the types of 
attacks we expect to see, and how particular defense strategies will behave, 
and how effectively they will fend off those attacks. Until you've outlined 
the problems, you don't have any framework for designing the solution.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/


Re: Distributed ppolicy state

2009-10-22 Thread Brett @Google
On Thu, Oct 22, 2009 at 5:44 PM, Howard Chu  wrote:

> In the case of a local, load-balanced cluster of replicas, where the
> network latency between DSAs is very low, the natural coalescing of updates
> may not occur as often. Still, it would be better if the updates didn't
> happen at all. And in such an environment, where the DSAs are so close
> together that latency is low, distributing reads is still cheaper than
> distributing writes. So, the correct way to implement this global state is
> to keep it distributed separately during writes, and collect it during
> reads.
>

I'd think that to indicate the topology you would create some administrative
name, perhaps a simple string "sales west" or "cluster one" to indicate a
topological region, and you would specify for each DSA which administrative
name or topology it is logically part of. Then this administrative region
name + unique identifier of the principal in question, could be used as a
key to hold a simple locked / unlocked boolean value on the replica's
parent.

Above would give course grained control with little overhead. All DSA's
could keep track of password failures locally, and report up / push up the
lock value up to it's provider only if retries have been exceeded for a
particular principal and adminsitative domain on a particular server, thus
locking all other principal use under the same administrative region. This
administrative "lock" value would be replicated downward to other DSA's in
the administrative domain, used for locking that principal on all DSA's in
each administrative domain.

Alternatively for more fine grained capture of password failure counts, the
could push a key containing administrative name + unique identifier of the
principal in question + it's replica id, with a simple count of password
failures. The value would be stored locally, but pushed up to the provider
only when the value changes, and as each consumer would have it's own
private namespace on the provider, there would be no collisions not any need
to wait for exclusive access to write to it.

The provider could aggregate these values periodically without the need for
an exclusive lock, and the aggregated value could then be replicated
downwards to the replicas for use in controlling access to accounts.

Cheers
Brett


Distributed ppolicy state

2009-10-22 Thread Howard Chu
One of the major concerns I still have with password policy is the issue of 
the overhead involved in maintaining so many policy state variables for 
authentication failure / lockout tracking. It turns what would otherwise be 
pure read operations into writes, which is already troublesome for some cases. 
But in the context of replication, the problem can be multiplied by the number 
of replicas in use. Avoiding this write magnification effect is one of the 
reasons the initial versions of the ppolicy overlay explicitly prevented its 
state updates from being replicated. Replicating these state updates for every 
authentication request simply won't scale.


Unfortunately the braindead account lockout policy really doesn't work well 
without this sort of state information.


The problem is not much different from the scaling issues we have to deal with 
in making code run well on multiprocessor / multicore machines. Having 
developed effective solutions to those problems, we ought to be able to apply 
the same thinking to this as well.


The key to excellent scaling is the so-called "shared-nothing" approach, where 
every processor just uses its own local resources and never has to synchronize 
with ( == wait for) any other processor, but for the most part it's a design 
ideal, not something you can do perfectly in practice. However, we have some 
recent examples in the slapd code where we've been able to use this approach 
to good effect.


In the connection manager, we used to handle monitoring/counter information 
(number of ops, type of ops, etc) in a single counter, which required a lot of 
locking overhead to update. We now use an array of counters per thread, and 
each thread can update its own counters for free, completely eliminating the 
locking overhead. The trick is in recognizing that this type of info is 
written far more often than it is read, so optimizing the update case is far 
more important than optimizing the query case. When someone tries to read the 
counters that are exposed in back-monitor, then we simply iterate across the 
arrays and tally up the counters then. Since there's no particular requirement 
that all the counters be read in the same instant in time, all of these 
reads/updates can be performed without locking, so again we get it for free, 
no synchronization overhead at all.


So, it should now be obvious where we should go with the replication issue...

Ideally, you want password policy enforcement rules that don't even need 
global state at all. IMO, the best approach is still to keep policy state 
private to each DSA, and this still makes sense for DSAs that are 
topologically remote. E.g., assume you have a pair of servers, each in two 
separate cities. It's unlikely that a login attempt on one server will be in 
any way connected to a simultaneous login attempt on the other server. And in 
the face of bot attack, the rate of logins will probably be high enough to 
swamp the channel between the two servers, resulting in queueing delays that 
ultimately aggregate several of the updates on the attacked server into just a 
single update on the remote server. (E.g., N separate failure updates on one 
server will coalesce into a single update on the remote server.)
Therefore, most of the time it's pointless for each server to try to 
immediately update the other with login failure info.


In the case of a local, load-balanced cluster of replicas, where the network 
latency between DSAs is very low, the natural coalescing of updates may not 
occur as often. Still, it would be better if the updates didn't happen at all. 
And in such an environment, where the DSAs are so close together that latency 
is low, distributing reads is still cheaper than distributing writes. So, the 
correct way to implement this global state is to keep it distributed 
separately during writes, and collect it during reads.


I'm looking for a way to express this in the schema and in the ppolicy draft, 
but I'm not sure how just yet. It strikes me that X.500 probably already has a 
type of distributed/collective attribute but I haven't looked yet.


Also I think we can take this a step further, but haven't thought it through 
all the way yet. If you typically have login failures coming from a single 
client, it should be sufficient to always route that client's requests to the 
same DSA, and have all of its failure tracking done locally/privately on that DSA.


At the other end, if you have an attack mounted by a number of separate 
machines, it's not clear that you must necessarily collect the state from 
every DSA on every authentication request. E.g., if you're setting a lockout 
based on the number of login failures, once the failure counter on a single 
DSA reaches the lockout threshold, it doesn't matter any more what the failure 
counter is on any other DSA, so that DSA no longer needs to look for the 
values on any other node.


If a client comes along and does a search to retrieve the polic