> If you nevertheless observe 500 being returned in practice, this might be the
> actual thing to focus on.
Even with sub 100 requests and 4 workers, I've experienced it multiple times,
where simply because the number of cache keys got exceeded, it was throwing 500
internal server errors for new uncached requests for hours on end (The
particular instance, I have about 300 expired keys per 5 minutes)
When it happens again, I'll obviously investigate further if it's not supposed
to happen.
> an attacker can easily request the same resource several times, moving it to
> the "normal" category
Correct, an attacker can almost always find ways to do things if they want to,
I've just yet to see them being "smart" enough to request the things multiple
times.
Even if it's not an attacker, but a misconfigured application (That isn't
directly managed by whoever manage the nginx server), if an application for
example would pass through identifiers in the URI (imagine gclid or fbclid
hashes) - these types of IDs are generally unique per visitor, query strings
may differ, but we're only going to see that request once or twice in 99% of
the cases where this happens. As a result of that we do not fill the disk
because of min_uses, but we do fill the memory because it isn't cleared out
before reaching the inactive option.
So at least in use-cases like that, we'd often be able to mitigate somewhat
misconfigured applications - it's quite common within the CDN industry to see
this issue anyway. While the ones running the CDN then obviously have to reach
out to the customer and ask them to fix their application, it would be awesome
to have a more proactive approach available, that would limit the importance of
an urgent fix.
What I can hear is that you don't see the point of such feature, that's fine __
I guess the alternative is to use lua to hook into nginx for the cache
metadata/shm (probably needs a custom nginx module as well since the shm isn't
exposed in lua); Then one should be able to wipe out the keys that are useless
that way.
Best Regards,
Lucas Rolff
On 18/05/2021, 03.27, "nginx on behalf of Maxim Dounin"
wrote:
Hello!
On Mon, May 17, 2021 at 07:33:43PM +, Lucas Rolff wrote:
> Hi Maxim!
>
> > - The attack you are considering is not about "poisoning". At
> > most, it can be used to make the cache less efficient.
>
> Poisoning is probably the wrong word indeed, and since nginx
> doesn't really handle reaching the limit of keys_zone, it simply
> starts to return a 500 internal server error. So I don't think
> it's making the cache less efficient (Other than you won't be
> able to cache that much), you're ending up breaking nginx
> because when the keys_zone limit has been reached, nginx simply
> starts returning 500 internal server error for items that are
> not already in proxy_cache - if it would do an LRU/LFU on the
> keys - then yes, you could probably end up with a cache less
> efficient.
While 500 is possible in some cases, especially in configurations
with many worker processes and high request concurrency, even in
the worst case it's expected to happen at most for half of the
requests, usually much less than that. Further, cache manager
monitors the number of cache items in the keys_zone, cleaning
things in advance, making 500 almost impossible in practice.
If you nevertheless observe 500 being returned in practice, this
might be the actual thing to focus on.
[...]
> Unless nginx very recently implemented that reaching keys_zone
> limit, will start purging old cache - then no, it would still
> break the nginx for non-cached requests (returning 500 internal
> server error). If nginx has started to purge old things if the
> limit is reached, then sure the attacker would still be able to
> wipe out the cache.
Clearing old cache items when it is not possible to allocate a
cache node dates back to initial cache support in nginx 0.7.44[1].
And cache manager monitoring of the keys_zone and clearing it in
advance dates back to nginx 1.9.13 released about five years
ago[2]. Not sure any of these counts as "very recently".
> But let's say we have an "inactive" set to 24+ hours (Which is
> often used for static files) - an attack where someone would
> append random query strings - those keys would first be removed
> after 24 hours (or higher, depending on the limit) - with a
> separate flag, one could set this counter to something like 60
> seconds (So delete the key from memory if the key haven't
> reached it's min_uses within 60 seconds) - this way, you're
> still rotating those keys out *a lot* faster.
While this may be preferable for some use cases (and sounds close
to the "Segmented LRU" cache policy[3]), this certainly don't
protect from