Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-09 Thread Jun Rao
Hi, Rajini, Thanks for the updated KIP. A few more comments. 30. Should we just account for the time in network threads in this KIP too? The issue with doing this later is that existing quotas may be too small and everyone will have to adjust them before upgrading, which is inconvenient. If we

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-09 Thread Rajini Sivaram
I have updated the KIP to use "request.percentage" quotas where the percentage is out of a total of (num.io.threads * 100). I have added the other options considered so far under "Rejected Alternatives". To address Todd's concern about per-thread quotas: Even though the quotas are out of

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-08 Thread Todd Palino
Rajini - I understand what you’re saying, but the point I’m making is that I don’t believe we need to take it into account directly. The CPU utilization of the network threads is directly proportional to the number of bytes being sent. The more bytes, the more CPU that is required for SSL (or

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-08 Thread Jun Rao
Hi, Todd, Thanks for the feedback. I just want to clarify your second point. If the limit percentage is per thread and the thread counts are changed, the absolute processing limit for existing users haven't changed and there is no need to adjust them. On the other hand, if the limit percentage

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-08 Thread Rajini Sivaram
Hi Todd, Thank you for the review. For SSL, the case that is not covered is Scenario 6 in the KIP that Ismael pointed out. For clusters with only SSL or PLAINTEXT, byte rate quotas work well, but for clusters with both SSL and PLAINTEXT, network thread utilization also needs to be taken into

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-07 Thread Todd Palino
I’ve been following this one on and off, and overall it sounds good to me. - The SSL question is a good one. However, that type of overhead should be proportional to the bytes rate, so I think that a bytes rate quota would still be a suitable way to address it. - I think it’s better to make the

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-07 Thread Becket Qin
I see. Good point about SSL. I just asked Todd to take a look. Thanks, Jiangjie (Becket) Qin On Tue, Mar 7, 2017 at 2:17 PM, Jun Rao wrote: > Hi, Jiangjie, > > Yes, I agree that byte rate already protects the network threads > indirectly. I am not sure if byte rate fully

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-07 Thread Jun Rao
Hi, Jiangjie, Yes, I agree that byte rate already protects the network threads indirectly. I am not sure if byte rate fully captures the CPU overhead in network due to SSL. So, at the high level, we can use request time limit to protect CPU and use byte rate to protect storage and network. Also,

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-07 Thread Becket Qin
Hi Rajini/Jun, The percentage based reasoning sounds good. One thing I am wondering is that if we assume the network thread are just doing the network IO, can we say bytes rate quota is already sort of network threads quota? If we take network threads into the consideration here, would that be

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-07 Thread Rajini Sivaram
Jun, Thank you for the explanation, I hadn't realized you meant percentage of the total thread pool. If everyone is OK with Jun's suggestion, I will update the KIP. Thanks, Rajini On Tue, Mar 7, 2017 at 5:08 PM, Jun Rao wrote: > Hi, Rajini, > > Let's take your example.

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-07 Thread Jun Rao
Hi, Rajini, Let's take your example. Let's say a user sets the limit to 50%. I am not sure if it's better to apply the same percentage separately to network and io thread pool. For example, for produce requests, most of the time will be spent in the io threads whereas for fetch requests, most of

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-03 Thread Rajini Sivaram
Jun, Agree about the two scenarios. But still not sure about a single quota covering both network threads and I/O threads with per-thread quota. If there are 10 I/O threads and 5 network threads and I want to assign half the quota to userA, the quota would be 750%. I imagine, internally, we

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-02 Thread Jun Rao
Hi, Rajini, Consider modeling as n * 100% unit. For 2), the question is what's causing the I/O threads to be saturated. It's unlikely that all users' utilization have increased at the same. A more likely case is that a few isolated users' utilization have increased. If so, after increasing the

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-02 Thread Rajini Sivaram
Jun, If we use request.percentage as the percentage used in a single I/O thread, the total percentage being allocated will be num.io.threads * 100 for I/O threads and num.network.threads * 100 for network threads. A single quota covering the two as a percentage wouldn't quite work if you want to

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-02 Thread Jun Rao
Another way to express an absolute limit is to use request.percentage, but treat it as the percentage used in a single request handling thread. For now, the request handling threads can be just the io threads. In the future, they can cover the network threads as well. This is similar to how top

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-01 Thread Colin McCabe
That makes sense. I didn't see that this field already existed in some of the replies-- good clarification. best, On Wed, Mar 1, 2017, at 05:41, Rajini Sivaram wrote: > Colin, > > Thank you for the feedback. Since we are reusing the existing > throttle_time_ms field for produce/fetch

Re: [DISCUSS] KIP-124: Request rate quotas

2017-03-01 Thread Rajini Sivaram
Colin, Thank you for the feedback. Since we are reusing the existing throttle_time_ms field for produce/fetch responses, changing this to microseconds would be a breaking change. Since we don't currently plan to throttle at sub-millisecond intervals, perhaps it makes sense to keep the value

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-28 Thread Colin McCabe
I noticed that the throttle_time_ms added to all the message responses is in milliseconds. Does it make sense to express this in microseconds in case we start doing more fine-grained CPU throttling later on? An int32 should still be more than enough if using microseconds. best, Colin On Fri,

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-24 Thread Jun Rao
Hi, Jay, 2. Regarding request.unit vs request.percentage. I started with request.percentage too. The reasoning for request.unit is the following. Suppose that the capacity has been reached on a broker and the admin needs to add a new user. A simple way to increase the capacity is to increase the

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-24 Thread Rajini Sivaram
Thanks, Jay. *(1) *The rename from *request.time*.percent to* io.thread*.units for the quota configuration was based on the change from percent to thread-units, since we will need different quota configuration for I/O threads and network threads if we use units. If we agree that *(2)* percent (or

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-24 Thread Jay Kreps
A couple of quick points: 1. Even though the implementation of this quota is only using io thread time, i think we should call it something like "request-time". This will give us flexibility to improve the implementation to cover network threads in the future and will avoid exposing internal

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-24 Thread Rajini Sivaram
I have updated the KIP based on the discussions so far. Regards, Rajini On Thu, Feb 23, 2017 at 11:29 PM, Rajini Sivaram wrote: > Thank you all for the feedback. > > Ismael #1. It makes sense not to throttle inter-broker requests like > LeaderAndIsr etc. The simplest

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread Rajini Sivaram
Thank you all for the feedback. Ismael #1. It makes sense not to throttle inter-broker requests like LeaderAndIsr etc. The simplest way to ensure that clients cannot use these requests to bypass quotas for DoS attacks is to ensure that ACLs prevent clients from using these requests and

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread radai
@jun: i wasnt concerned about tying up a request processing thread, but IIUC the code does still read the entire request out, which might add-up to a non-negligible amount of memory. On Thu, Feb 23, 2017 at 11:55 AM, Dong Lin wrote: > Hey Rajini, > > The current KIP says

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread Dong Lin
Hey Rajini, The current KIP says that the maximum delay will be reduced to window size if it is larger than the window size. I have a concern with this: 1) This essentially means that the user is allowed to exceed their quota over a long period of time. Can you provide an upper bound on this

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread Dong Lin
Hey Jun, Yeah you are right. I thought it wasn't because at LinkedIn it will be too much pressure on inGraph to expose those per-clientId metrics so we ended up printing them periodically to local log. Never mind if it is not a general problem. Hey Rajini, - I agree with Jay that we probably

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread Jun Rao
Hi, Ismael, For #3, typically, an admin won't configure more io threads than CPU cores, but it's possible for an admin to start with fewer io threads than cores and grow that later on. Hi, Dong, I think the throttleTime sensor on the broker tells the admin whether a user/clentId is throttled or

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread Ismael Juma
Hi Jay, Regarding 1, I definitely like the simplicity of keeping a single throttle time field in the response. The downside is that the client metrics will be more coarse grained. Regarding 3, we have `leader.imbalance.per.broker.percentage` and `log.cleaner.min.cleanable.ratio`. Ismael On

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread Jay Kreps
A few minor comments: 1. Isn't it the case that the throttling time response field should have the total time your request was throttled irrespective of the quotas that caused that. Limiting it to byte rate quota doesn't make sense, but I also I don't think we want to end up adding

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread radai
i dont think time/cpu% are easy to reason about. most user-facing quota systems i know (especially the commercial ones) focus on things users understand better - iops and bytes. as for quotas and "overhead" requests like heartbeats - on the one hand subjecting them to the quota may cause clients

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread Ismael Juma
Thanks for the KIP, Rajini. This is a welcome improvement and the KIP page covers it well. A few comments: 1. Can you expand a bit on the motivation for throttling requests that fail authorization for ClusterAction? Under what scenarios would this help? 2. I think we should rename

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-23 Thread Rajini Sivaram
Guozhang/Dong, Thank you for the feedback. Guozhang : I have updated the section on co-existence of byte rate and request time quotas. Dong: I hadn't added much detail to the metrics and sensors since they are going to be very similar to the existing metrics and sensors. To avoid confusion, I

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-22 Thread Dong Lin
Hey Rajini, I think it makes a lot of sense to use io_thread_units as metric to quota user's traffic here. LGTM overall. I have some questions regarding sensors. - Can you be more specific in the KIP what sensors will be added? For example, it will be useful to specify the name and attributes of

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-22 Thread Guozhang Wang
Made a pass over the doc, overall LGTM except a minor comment on the throttling implementation: Stated as "Request processing time throttling will be applied on top if necessary." I thought that it meant the request processing time throttling is applied first, but continue reading I found it

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-22 Thread Jun Rao
Hi, Rajini, Thanks for the updated KIP. The latest proposal looks good to me. Jun On Wed, Feb 22, 2017 at 2:19 PM, Rajini Sivaram wrote: > Jun/Roger, > > Thank you for the feedback. > > 1. I have updated the KIP to use absolute units instead of percentage. The >

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-22 Thread Rajini Sivaram
Jun/Roger, Thank you for the feedback. 1. I have updated the KIP to use absolute units instead of percentage. The property is called* io_thread_units* to align with the thread count property *num.io.threads*. When we implement network thread utilization quotas, we can add another property

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-22 Thread Roger Hoover
Great to see this KIP and the excellent discussion. To me, Jun's suggestion makes sense. If my application is allocated 1 request handler unit, then it's as if I have a Kafka broker with a single request handler thread dedicated to me. That's the most I can use, at least. That allocation

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-22 Thread Jun Rao
Hi, Rajini, Thanks for the updated KIP. A few more comments. 1. A concern of request_time_percent is that it's not an absolute value. Let's say you give a user a 10% limit. If the admin doubles the number of request handler threads, that user now actually has twice the absolute capacity. This

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-22 Thread Rajini Sivaram
Jun, Thank you for the review. I have reverted to the original KIP that throttles based on request handler utilization. At the moment, it uses percentage, but I am happy to change to a fraction (out of 1 instead of 100) if required. I have added the examples from this discussion to the KIP. Also

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-21 Thread Jun Rao
Hi, Rajini, Thanks for the proposal. The benefit of using the request processing time over the request rate is exactly what people have said. I will just expand that a bit. Consider the following case. The producer sends a produce request with a 10MB message but compressed to 100KB with gzip.

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-21 Thread Rajini Sivaram
Thank you all for the feedback. Jay: I have removed exemption for consumer heartbeat etc. Agree that protecting the cluster is more important than protecting individual apps. Have retained the exemption for StopReplicat/LeaderAndIsr etc, these are throttled only if authorization fails (so can't

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-20 Thread Becket Qin
Hey Jay, Yeah, I agree that enforcing the CPU time is a little tricky. I am thinking that maybe we can use the existing request statistics. They are already very detailed so we can probably see the approximate CPU time from it, e.g. something like (total_time - request/response_queue_time -

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-20 Thread Guozhang Wang
This is a great proposal, glad to see it happening. I am inclined to the CPU throttling, or more specifically processing time ratio instead of the request rate throttling as well. Becket has very well summed my rationales above, and one thing to add here is that the former has a good support for

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-20 Thread Jay Kreps
Hey Becket/Rajini, When I thought about it more deeply I came around to the "percent of processing time" metric too. It seems a lot closer to the thing we actually care about and need to protect. I also think this would be a very useful metric even in the absence of throttling just to debug whose

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-20 Thread Becket Qin
If the purpose of the KIP is only to protect the cluster from being overwhelmed by crazy clients and is not intended to address resource allocation problem among the clients, I am wondering if using request handling time quota (CPU time quota) is a better option. Here are the reasons: 1. request

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-20 Thread Jay Kreps
I think this proposal makes a lot of sense (especially now that it is oriented around request rate) and fills the biggest remaining gap in the multi-tenancy story. I think for intra-cluster communication (StopReplica, etc) we could avoid throttling entirely. You can secure or otherwise lock-down

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-20 Thread Rajini Sivaram
I have updated the KIP to use request rates instead of request processing time, I have removed all requests that require ClusterAction permission (LeaderAndIsr and UpdateMetdata as well in addition to stop/shutdown). But I have left Metadata request in. Quota windows which limit the maximum delay

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-20 Thread Dong Lin
Hey Rajini, Thanks for the explanation. I have some follow up questions regarding the types of requests that will be covered by this quota. Since this KIP focus only on throttling the traffic between client and broker and client never sends LeaderAndIsrRequest to broker, should we exclude

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-20 Thread Rajini Sivaram
Dong, Onur & Becket, Thank you all for the very useful feedback. The choice of request handling time as opposed to request rate was based on the observation in KAFKA-4195 that request rates may be less intuitive to configure than percentage

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-19 Thread Becket Qin
Thanks for the KIP, Rajini, If I understand correctly the proposal was essentially trying to quota the CPU usage (that is probably why time slice is used instead of request rate) while the existing quota we have is for network bandwidth. Given we are trying to throttle both CPU and Network, that

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-18 Thread Dong Lin
I realized the main concern with this proposal is how user can interpret this CPU-percentage based quota. Since this quota is exposed to user, we need to explain to user how this quota is going to impact their application performance and convince them that the quota is now too low for their

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-17 Thread Onur Karaman
Overall a big fan of the KIP. I'd have to agree with Dong. I'm not sure about the decision of using the percentage over the window as opposed to request rate. It's pretty hard to reason about. I just spoke to one of our SRE's and he agrees. Also I may have missed it, but I couldn't find

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-17 Thread Dong Lin
To correct the typo above: It seems to me that determination of request rate is not any more difficult than determination of *byte* rate as both metrics are commonly used to measure performance and provide guarantee to user. On Fri, Feb 17, 2017 at 9:40 AM, Dong Lin wrote:

Re: [DISCUSS] KIP-124: Request rate quotas

2017-02-17 Thread Dong Lin
Hey Rajini, Thanks for the KIP. I have some questions: - I am wondering why throttling based on request rate is listed as a rejected alternative. Can you provide more specific reason why it is difficult for administrators to decide request rates to allocate? It seems to me that determination of

[DISCUSS] KIP-124: Request rate quotas

2017-02-17 Thread Rajini Sivaram
Hi all, I have just created KIP-124 to introduce request rate quotas to Kafka: https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas The proposal is for a simple percentage request handling time quota that can be allocated to **, ** or **. There are a