Daniel Margolis пишет:
>
> On Mon, May 16, 2016 at 11:42 AM, Vladimir Dubrovin
> <[email protected] <mailto:[email protected]>> wrote:
>
> There are at least 2 reasons to specify minimum cache time for
> negative results in the standard:
>
> 1. One can use this time to ratelimit requests to policy server
> and prevent DDoS against policy server without risks to affect
> clients. It's really important, because policy server is a point
> of failure for mail system.
>
>
> Can't one do this anyway? As long as the policy DNS record is not
> updated, clients should be requesting new policies over HTTPS at
> approximately the rate of new mail from never-before-seen
> senders--which is the same as the case you describe (where there's an
> HTTP-specific cache time in addition to the DNS signaling mechanism).
>
> Is the rate-limiting logic actually different if we add an
> HTTP-specific cache time? I think you're saying that if such a
> cache-time exists, we can limit each client to 1 / {cache time}
> requests per second.
We can limit client to e.g. 3 requests / cache time. Longer minimal
cache time means you can use stronger request limits.
> But in fact you could limit any client to far less than that (if
> you're doing per client rate limiting) even without the cache time;
> each client should only make a request ~every {policy expiration}! (Of
> course, you probably want to tolerate more frequent updates than that
> in case the client clears the cache, etc, but you get the point.)
Again, client will make a request ~every {policy expiration} only if
client can retrieve policy successfully. Imagine your policy server is
under DDoS and 100% of request finish with timeout, so valid clients can
not receive policy and will try again and again without any ratelimit
from clientside. How can you ratelimit DoS without blocking valid
clients, having the fact you already have the problem and request for
policy from valid client may lead to timeout? If you know valid client
can only make 1 request in 5 minutes you can easily block even highly
distributed DoS attack without affecting valid clients.
If you have no rate limits from client side and if client once falls
under this ratelimit (due to DDoS in action, server misconfiguration, or
problem on the client itself), there is a chance this client will be
under ratelimit forever, because request rates from the client may grow
with message queue while requests keep failing.
>
>
>
> 2. I'm not sure it can be applied in the case of policy server
> with MTA as a client, but it's general approach to avoid situation
> like this:
>
> Given
> - Your have single server and can handle 1000 reqests per second.
> - Your current LA (load averages) are 30% with 300 rps.
> - If LA reaches 100%, all requests begin to timeout (sadly to
> say, it's quite usual for many HTTP applications)
>
> Now imagine, you had some problems (blackout, network connectivity
> problems, etc) for e.g. 3 hours.
>
> If negative responses are not cached by client, client makes
> request for every delivery attempt for every message in queue.
>
>
>
> Lets say, client retries failed attempt every 15 minutes.It means,
> after 3 hours you will have >3000 requests per second instead of
> average 300 rps, 3 times more than your server can handle. And rps
> still growing, because all requests timeouts, so recovering is
> extremaly hard.
>
>
> Hmm. I am a little bit jetlagged and have to think this through a bit
> more. ;)
>
> So, there are two possible scenarios here where the recipient domain
> has a STS TXT record but the HTTPS endpoint is down, I think:
>
> #1: sender does not have a cached policy for the recipient
>
> #2: sender does have a cached policy for the recipient but it's out of
> date and does not validate
>
> Only in scenario #2 is a sender expected to fail message delivery.
> (Section 3.3 specifies that a policy that has not successfully been
> authenticated MUST NOT be used to reject mail; we aren't so clear
> about what to do if the TXT record exists but the policy cannot be
> fetched, but I don't think it's reasonable to expect senders to fail
> messages in the case of #1.)
Delivery may initially be failed and queue may grow because of the same
networking problem, not because of STS.
Even in case everything works as supposed and there is no growing
queues, situation may be quite dangerous, imaging: year 2020, STS is
supported by most peers and you begin to implementi STS in the large
mail system. You publish policy and publish TXT record. During thirst
fractions of second after record is published, instead of supposed one
request per peer per policy expiration time, you have nearly 1 request
per message because messages come from different peers and you can find
your server is flooded with requests and is completely unresponsive
before any successful policy reply. Same situation may return after
networking errors. Or imagine administrator to set invalid certificate
for some time while policy expiration is still short, it will again lead
to policy request per every message and lead to DoS against server even
after certificate problem is corrected.
Of cause, it mostly affects implementors with short living STS policies,
but every STS implementor is intended to start with short policy.
I believe #2 must be considered as a tmpfail (in terms of tmpfail in
SPF/DKIM/DMARC), because it can be result of temporary networking
problems and should lead to delayed, not bounced message.
>
> But in this case, as soon as the sender successfully updates the
> policy cache, all the later queued messages will have access to that
> cache.
> So I would think that only some implementations would have the
> behavior you describe (of fetching the policy, on update, once per
> every message in the sending queue, versus re-checking the cache);
> this is not a guaranteed (or particularly ideal) design.
>
> What do you think?
You're probably right, there is no need to explicitely require negative
response caching, there is a need to limit the rate of policy requests
on the client side to avoid situation where temporary problems with
policy retrieval, mistakes or initial publishing of the policy can lead
to explosive growth of policy requests. Implementation (negative
response caching or something) may be left to implementor. Request rates
in the case of unsuccessfull policy retrieval must not be 1000000 times
higher than request rates in the case of successfull policy retrieval,
so it may be something like "at least 5 minutes between policy requests
for the same domain regardless or policy request results".
--
Vladimir Dubrovin
@Mail.Ru
_______________________________________________
Uta mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/uta