[ https://issues.apache.org/jira/browse/HBASE-28453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ray Mattingly reassigned HBASE-28453: ------------------------------------- Assignee: Ray Mattingly > Support a middle ground between the Average and Fixed interval rate limiters > ---------------------------------------------------------------------------- > > Key: HBASE-28453 > URL: https://issues.apache.org/jira/browse/HBASE-28453 > Project: HBase > Issue Type: Improvement > Affects Versions: 2.6.0 > Reporter: Ray Mattingly > Assignee: Ray Mattingly > Priority: Major > Attachments: Screenshot 2024-03-21 at 2.08.51 PM.png, Screenshot > 2024-03-21 at 2.30.01 PM.png > > > h3. Background > HBase quotas support two rate limiters: a "fixed" and an "average" interval > rate limiter. > h4. FixedIntervalRateLimiter > The fixed interval rate limiter is simpler: it has a TimeUnit, say 1 second, > and it refills a resource allotment on the recurring interval. So you may get > 10 resources every second, and if you exhaust all 10 resources in the first > millisecond of an interval then you will need to wait 999ms to acquire even 1 > more resource. > h4. AverageIntervalRateLimiter > The average interval rate limiter, HBase's default, allows for more flexibly > timed refilling of the resource allotment. Extending our previous example, > say you have a 10 reads/sec quota and you have exhausted all 10 resources > within 1ms of the last full refill. If you request 1 more read then, rather > than returning a 999ms wait interval indicating the next full refill time, > the rate limiter will recognize that you only need to wait 99ms before 1 read > can be available. After 100ms has passed in aggregate since the last full > refill, it will support the refilling of 1/10th the limit to facilitate the > request for 1/10th the resources. > h3. The Problems with Current RateLimiters > The problem with the fixed interval rate limiter is that it is too strict > from a latency perspective. It results in quota limits to which we cannot > fully subscribe with any consistency. > The problem with the average interval rate limiter is that, in practice, it > is far too optimistic. For example, a real rate limiter might limit to > 100MB/sec of read IO per machine. Any multigets that come in will require > only a tiny fraction of this limit; for example, a 64kb block is only 0.06% > of the total. As a result, the vast majority of wait intervals end up being > tiny — like <5ms. This can actually cause an inverse of your intention, where > setting up a throttle causes a DDOS of your RPC layer via continuous > throttling and ~immediate retrying. I've discussed this problem in > https://issues.apache.org/jira/browse/HBASE-28429 and proposed a minimum wait > interval as the solution there; after some more thinking, I believe this new > rate limiter would be a less hacky solution to this deficit so I'd like to > close that Jira in favor of this one. > See the attached chart where I put in place a 10k req/sec/machine throttle > for this user at 10:43 to try to curb this high traffic, and it resulted in a > huge spike of req/sec due to the throttle/retry loop created by the > AverageIntervalRateLimiter. > h3. PartialIntervalRateLimiter as a Solution > I've implemented a RateLimiter which allows for partial chunks of the overall > interval to be refilled, by default these chunks are 10% (or 100ms of a 1s > interval). I've deployed this to a test cluster at my day job and have seen > this really help our ability to full subscribe to a quota limit without > executing superfluous retries. See the other attached chart which shows a > cluster undergoing a rolling restart from using FixedIntervalRateLimiter to > my new PartialIntervalRateLimiter and how it is then able to fully subscribe > to its allotted 25MB/sec/machine read IO quota. -- This message was sent by Atlassian Jira (v8.20.10#820010)