Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Stanislav/Colin, A couple people ran into issues with auto.leader.rebalance.enable=true in https://issues.apache.org/jira/browse/KAFKA-4084 And I think KIP-491 can help solve that issue. We have implemented KIP-491 internally together with another feature called latest offset for quickly bringing up a failed empty node, and found it quite useful. Could you take a look at the comments in the ticket, re-evaluate and provide your feedbacks? Thanks, George On Tuesday, September 17, 2019, 07:56:52 AM PDT, Stanislav Kozlovski wrote: Hey Harsha, > If we want to go with making this an option and providing a tool which abstracts moving the broker to end preferred leader list , it needs to do it for all the partitions that broker is leader for. As said in the above comment a broker i.e leader for 1000 partitions we have to this for all the partitions. Instead of having a blacklist will help simplify this process and we can provide monitoring/alerts on such list. Sorry, I thought that part of the reasoning for not using reassignment was to optimize the process. > Do you mind shedding some light what issue you are talking to propose a KIP for? The issue I was talking about is the one I quoted in my previous reply. I understand that you want to have a way of running a "shallow" replica of sorts - one that is lacking the historical data but has (and continues to replicate) the latest data. That is the goal of setting the last offsets for all partitions in replication-offset-checkpoint, right? Thanks, Stanislav On Mon, Sep 16, 2019 at 3:39 PM Satish Duggana wrote: > Hi George, > Thanks for explaining the usecase for topic level preferred leader > blacklist. As I mentioned earlier, I am fine with broker level config > for now. > > ~Satish. > > > On Sat, Sep 7, 2019 at 12:29 AM George Li > wrote: > > > > Hi, > > > > Just want to ping and bubble up the discussion of KIP-491. > > > > On a large scale of Kafka clusters with thousands of brokers in many > clusters. Frequent hardware failures are common, although the > reassignments to change the preferred leaders is a workaround, it incurs > unnecessary additional work than the proposed preferred leader blacklist in > KIP-491, and hard to scale. > > > > I am wondering whether others using Kafka in a big scale running into > same problem. > > > > > > Satish, > > > > Regarding your previous question about whether there is use-case for > TopicLevel preferred leader "blacklist", I thought about one use-case: to > improve rebalance/reassignment, the large partition will usually cause > performance/stability issues, planning to change the say the New Replica > will start with Leader's latest offset(this way the replica is almost > instantly in the ISR and reassignment completed), and put this partition's > NewReplica into Preferred Leader "Blacklist" at the Topic Level config for > that partition. After sometime(retention time), this new replica has caught > up and ready to serve traffic, update/remove the TopicConfig for this > partition's preferred leader blacklist. > > > > I will update the KIP-491 later for this use case of Topic Level config > for Preferred Leader Blacklist. > > > > > > Thanks, > > George > > > > On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li < > sql_consult...@yahoo.com> wrote: > > > > Hi Colin, > > > > > In your example, I think we're comparing apples and oranges. You > started by outlining a scenario where "an empty broker... comes up... > [without] any > leadership[s]." But then you criticize using reassignment > to switch the order of preferred replicas because it "would not actually > switch the leader > automatically." If the empty broker doesn't have any > leaderships, there is nothing to be switched, right? > > > > Let me explained in details of this particular use case example for > comparing apples to apples. > > > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 > are the preferred leaders (leader count is 1000). There is a hardware > failure (disk/memory, etc.), and kafka process crashed. We swap this host > with another host but keep the same broker.id, when this new broker > coming up, it has no historical data, and we manage to have the current > last offsets of all partitions set in the replication-offset-checkpoint (if > we don't set them, it could cause crazy ReplicaFetcher pulling of > historical data from other brokers and cause cluster high latency and other > instabilities), so when Kafka is brought up, it is quickly catching up as > followers in the ISR. Note, we have auto.leader.rebalance.enable > disabled, so it's not serving any traffic as leaders (leader count = 0), > even there are 1000 partitions that this broker is the Preferred Leader. > > > > We need to make this broker not serving traffic for a few hours or days > depending on the SLA of the topic retention requirement until after it's > having enough historical data. > > > > > > * The
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hey Harsha, > If we want to go with making this an option and providing a tool which abstracts moving the broker to end preferred leader list , it needs to do it for all the partitions that broker is leader for. As said in the above comment a broker i.e leader for 1000 partitions we have to this for all the partitions. Instead of having a blacklist will help simplify this process and we can provide monitoring/alerts on such list. Sorry, I thought that part of the reasoning for not using reassignment was to optimize the process. > Do you mind shedding some light what issue you are talking to propose a KIP for? The issue I was talking about is the one I quoted in my previous reply. I understand that you want to have a way of running a "shallow" replica of sorts - one that is lacking the historical data but has (and continues to replicate) the latest data. That is the goal of setting the last offsets for all partitions in replication-offset-checkpoint, right? Thanks, Stanislav On Mon, Sep 16, 2019 at 3:39 PM Satish Duggana wrote: > Hi George, > Thanks for explaining the usecase for topic level preferred leader > blacklist. As I mentioned earlier, I am fine with broker level config > for now. > > ~Satish. > > > On Sat, Sep 7, 2019 at 12:29 AM George Li > wrote: > > > > Hi, > > > > Just want to ping and bubble up the discussion of KIP-491. > > > > On a large scale of Kafka clusters with thousands of brokers in many > clusters. Frequent hardware failures are common, although the > reassignments to change the preferred leaders is a workaround, it incurs > unnecessary additional work than the proposed preferred leader blacklist in > KIP-491, and hard to scale. > > > > I am wondering whether others using Kafka in a big scale running into > same problem. > > > > > > Satish, > > > > Regarding your previous question about whether there is use-case for > TopicLevel preferred leader "blacklist", I thought about one use-case: to > improve rebalance/reassignment, the large partition will usually cause > performance/stability issues, planning to change the say the New Replica > will start with Leader's latest offset(this way the replica is almost > instantly in the ISR and reassignment completed), and put this partition's > NewReplica into Preferred Leader "Blacklist" at the Topic Level config for > that partition. After sometime(retention time), this new replica has caught > up and ready to serve traffic, update/remove the TopicConfig for this > partition's preferred leader blacklist. > > > > I will update the KIP-491 later for this use case of Topic Level config > for Preferred Leader Blacklist. > > > > > > Thanks, > > George > > > > On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li < > sql_consult...@yahoo.com> wrote: > > > > Hi Colin, > > > > > In your example, I think we're comparing apples and oranges. You > started by outlining a scenario where "an empty broker... comes up... > [without] any > leadership[s]." But then you criticize using reassignment > to switch the order of preferred replicas because it "would not actually > switch the leader > automatically." If the empty broker doesn't have any > leaderships, there is nothing to be switched, right? > > > > Let me explained in details of this particular use case example for > comparing apples to apples. > > > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 > are the preferred leaders (leader count is 1000). There is a hardware > failure (disk/memory, etc.), and kafka process crashed. We swap this host > with another host but keep the same broker.id, when this new broker > coming up, it has no historical data, and we manage to have the current > last offsets of all partitions set in the replication-offset-checkpoint (if > we don't set them, it could cause crazy ReplicaFetcher pulling of > historical data from other brokers and cause cluster high latency and other > instabilities), so when Kafka is brought up, it is quickly catching up as > followers in the ISR. Note, we have auto.leader.rebalance.enable > disabled, so it's not serving any traffic as leaders (leader count = 0), > even there are 1000 partitions that this broker is the Preferred Leader. > > > > We need to make this broker not serving traffic for a few hours or days > depending on the SLA of the topic retention requirement until after it's > having enough historical data. > > > > > > * The traditional way using the reassignments to move this broker in > that 1000 partitions where it's the preferred leader to the end of > assignment, this is O(N) operation. and from my experience, we can't submit > all 1000 at the same time, otherwise cause higher latencies even the > reassignment in this case can complete almost instantly. After a few > hours/days whatever, this broker is ready to serve traffic, we have to run > reassignments again to restore that 1000 partitions preferred leaders for > this broker: O(N) operation. then run preferred leader
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi George, Thanks for explaining the usecase for topic level preferred leader blacklist. As I mentioned earlier, I am fine with broker level config for now. ~Satish. On Sat, Sep 7, 2019 at 12:29 AM George Li wrote: > > Hi, > > Just want to ping and bubble up the discussion of KIP-491. > > On a large scale of Kafka clusters with thousands of brokers in many > clusters. Frequent hardware failures are common, although the reassignments > to change the preferred leaders is a workaround, it incurs unnecessary > additional work than the proposed preferred leader blacklist in KIP-491, and > hard to scale. > > I am wondering whether others using Kafka in a big scale running into same > problem. > > > Satish, > > Regarding your previous question about whether there is use-case for > TopicLevel preferred leader "blacklist", I thought about one use-case: to > improve rebalance/reassignment, the large partition will usually cause > performance/stability issues, planning to change the say the New Replica will > start with Leader's latest offset(this way the replica is almost instantly in > the ISR and reassignment completed), and put this partition's NewReplica into > Preferred Leader "Blacklist" at the Topic Level config for that partition. > After sometime(retention time), this new replica has caught up and ready to > serve traffic, update/remove the TopicConfig for this partition's preferred > leader blacklist. > > I will update the KIP-491 later for this use case of Topic Level config for > Preferred Leader Blacklist. > > > Thanks, > George > > On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li > wrote: > > Hi Colin, > > > In your example, I think we're comparing apples and oranges. You started > > by outlining a scenario where "an empty broker... comes up... [without] any > > > leadership[s]." But then you criticize using reassignment to switch the > > order of preferred replicas because it "would not actually switch the > > leader > automatically." If the empty broker doesn't have any leaderships, > > there is nothing to be switched, right? > > Let me explained in details of this particular use case example for comparing > apples to apples. > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 are the > preferred leaders (leader count is 1000). There is a hardware failure > (disk/memory, etc.), and kafka process crashed. We swap this host with > another host but keep the same broker.id, when this new broker coming up, it > has no historical data, and we manage to have the current last offsets of all > partitions set in the replication-offset-checkpoint (if we don't set them, it > could cause crazy ReplicaFetcher pulling of historical data from other > brokers and cause cluster high latency and other instabilities), so when > Kafka is brought up, it is quickly catching up as followers in the ISR. > Note, we have auto.leader.rebalance.enable disabled, so it's not serving any > traffic as leaders (leader count = 0), even there are 1000 partitions that > this broker is the Preferred Leader. > > We need to make this broker not serving traffic for a few hours or days > depending on the SLA of the topic retention requirement until after it's > having enough historical data. > > > * The traditional way using the reassignments to move this broker in that > 1000 partitions where it's the preferred leader to the end of assignment, > this is O(N) operation. and from my experience, we can't submit all 1000 at > the same time, otherwise cause higher latencies even the reassignment in this > case can complete almost instantly. After a few hours/days whatever, this > broker is ready to serve traffic, we have to run reassignments again to > restore that 1000 partitions preferred leaders for this broker: O(N) > operation. then run preferred leader election O(N) again. So total 3 x O(N) > operations. The point is since the new empty broker is expected to be the > same as the old one in terms of hosting partition/leaders, it would seem > unnecessary to do reassignments (ordering of replica) during the broker > catching up time. > > > > * The new feature Preferred Leader "Blacklist": just need to put a dynamic > config to indicate that this broker should be considered leader (preferred > leader election or broker failover or unclean leader election) to the lowest > priority. NO need to run any reassignments. After a few hours/days, when this > broker is ready, remove the dynamic config, and run preferred leader election > and this broker will serve traffic for that 1000 original partitions it was > the preferred leader. So total 1 x O(N) operation. > > > If auto.leader.rebalance.enable is enabled, the Preferred Leader > "Blacklist" can be put it before Kafka is started to prevent this broker > serving traffic. In the traditional way of running reassignments, once the > broker is up, with auto.leader.rebalance.enable , if
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Stanislav, Thanks for the comments. The proposal we are making is not about optimizing Big-O but instead provide a simpler way of stopping a broker becoming leader. If we want to go with making this an option and providing a tool which abstracts moving the broker to end preferred leader list , it needs to do it for all the partitions that broker is leader for. As said in the above comment a broker i.e leader for 1000 partitions we have to this for all the partitions. Instead of having a blacklist will help simplify this process and we can provide monitoring/alerts on such list. "This sounds like a bit of a hack. If that is the concern, why not propose a KIP that addresses the specific issue?" Do you mind shedding some light what issue you are talking to propose a KIP for? Replication is a challenge when we are bringing up a new node. If you have retention period of 3 days there is honestly no way to do it via online replication without taking a hit on latency SLAs. Is your ask to find a way to fix the replication itself when we are bringing a new broker from no data. "Having a blacklist you control still seems like a workaround given that Kafka itself knows when the topic retention would allow you to switch that replica to a leader" Not sure how its making it any complicated by having a single zk path to have a list of brokers. Thanks, Harsha On Mon, Sep 09, 2019 at 3:55 PM, Stanislav Kozlovski < stanis...@confluent.io > wrote: > > > > I agree with Colin that the same result should be achievable through > proper abstraction in a tool. Even if that might be "4xO(N)" operations, > that is still not a lot - it is still classified as O(N) > > > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 are > > >> >> >> the preferred leaders (leader count is 1000). There is a hardware failure >> (disk/memory, etc.), and kafka process crashed. We swap this host with >> another host but keep the same broker. id ( http://broker.id/ ) , when this >> new broker coming up, it has no historical data, and we manage to have the >> current last offsets of all partitions set in the >> replication-offset-checkpoint (if we don't set them, it could cause crazy >> ReplicaFetcher pulling of historical data from other brokers and cause >> cluster high latency and other instabilities), so when Kafka is brought >> up, it is quickly catching up as followers in the ISR. Note, we have >> auto.leader.rebalance.enable disabled, so it's not serving any traffic as >> leaders (leader count = 0), even there are 1000 partitions that this >> broker is the Preferred Leader. We need to make this broker not serving >> traffic for a few hours or days depending on the SLA of the topic >> retention requirement until after it's having enough historical data. >> >> > > > > This sounds like a bit of a hack. If that is the concern, why not propose > a KIP that addresses the specific issue? Having a blacklist you control > still seems like a workaround given that Kafka itself knows when the topic > retention would allow you to switch that replica to a leader > > > > I really hope we can come up with a solution that avoids complicating the > controller and state machine logic further. > Could you please list out the main drawbacks of abstract this away in the > reassignments tool (or a new tool)? > > > > On Mon, Sep 9, 2019 at 7:53 AM Colin McCabe < cmccabe@ apache. org ( > cmcc...@apache.org ) > wrote: > > >> >> >> On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote: >> >> >>> >>> >>> Hi Colin, >>> Can you give us more details on why you don't want this to be part of the >>> Kafka core. You are proposing KIP-500 which will take away zookeeper and >>> writing this interim tools to change the zookeeper metadata doesn't make >>> sense to me. >>> >>> >> >> >> >> Hi Harsha, >> >> >> >> The reassignment API described in KIP-455, which will be part of Kafka >> 2.4, doesn't rely on ZooKeeper. This API will stay the same after KIP-500 >> is implemented. >> >> >>> >>> >>> As George pointed out there are >>> several benefits having it in the system itself instead of asking users to >>> hack bunch of json files to deal with outage scenario. >>> >>> >> >> >> >> In both cases, the user just has to run a shell command, right? In both >> cases, the user has to remember to undo the command later when they want >> the broker to be treated normally again. And in both cases, the user >> should probably be running an external rebalancing tool to avoid having to >> run these commands manually. :) >> >> >> >> best, >> Colin >> >> >>> >>> >>> Thanks, >>> Harsha >>> >>> >>> >>> On Fri, Sep 6, 2019 at 4:36 PM George Li < sql_consulting@ yahoo. com ( >>> sql_consult...@yahoo.com ) >>> >>> >> >> >> >> .invalid> >> >> >>> >>> >>> wrote: >>> >>> Hi Colin, Thanks for the feedback. The "separate set of
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
I agree with Colin that the same result should be achievable through proper abstraction in a tool. Even if that might be "4xO(N)" operations, that is still not a lot - it is still classified as O(N) Let's say a healthy broker hosting 3000 partitions, and of which 1000 are > the preferred leaders (leader count is 1000). There is a hardware failure > (disk/memory, etc.), and kafka process crashed. We swap this host with > another host but keep the same broker.id, when this new broker coming up, > it has no historical data, and we manage to have the current last offsets > of all partitions set in the replication-offset-checkpoint (if we don't set > them, it could cause crazy ReplicaFetcher pulling of historical data from > other brokers and cause cluster high latency and other instabilities), so > when Kafka is brought up, it is quickly catching up as followers in the > ISR. Note, we have auto.leader.rebalance.enable disabled, so it's not > serving any traffic as leaders (leader count = 0), even there are 1000 > partitions that this broker is the Preferred Leader. > We need to make this broker not serving traffic for a few hours or days > depending on the SLA of the topic retention requirement until after it's > having enough historical data. This sounds like a bit of a hack. If that is the concern, why not propose a KIP that addresses the specific issue? Having a blacklist you control still seems like a workaround given that Kafka itself knows when the topic retention would allow you to switch that replica to a leader I really hope we can come up with a solution that avoids complicating the controller and state machine logic further. Could you please list out the main drawbacks of abstract this away in the reassignments tool (or a new tool)? On Mon, Sep 9, 2019 at 7:53 AM Colin McCabe wrote: > On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote: > > Hi Colin, > > Can you give us more details on why you don't want this to be > > part of the Kafka core. You are proposing KIP-500 which will take away > > zookeeper and writing this interim tools to change the zookeeper > > metadata doesn't make sense to me. > > Hi Harsha, > > The reassignment API described in KIP-455, which will be part of Kafka > 2.4, doesn't rely on ZooKeeper. This API will stay the same after KIP-500 > is implemented. > > > As George pointed out there are > > several benefits having it in the system itself instead of asking users > > to hack bunch of json files to deal with outage scenario. > > In both cases, the user just has to run a shell command, right? In both > cases, the user has to remember to undo the command later when they want > the broker to be treated normally again. And in both cases, the user > should probably be running an external rebalancing tool to avoid having to > run these commands manually. :) > > best, > Colin > > > > > Thanks, > > Harsha > > > > On Fri, Sep 6, 2019 at 4:36 PM George Li .invalid> > > wrote: > > > > > Hi Colin, > > > > > > Thanks for the feedback. The "separate set of metadata about > blacklists" > > > in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple > in > > > the cluster. Should be easier than keeping json files? e.g. what if > we > > > first blacklist broker_id_1, then another broker_id_2 has issues, and > we > > > need to write out another json file to restore later (and in which > order)? > > > Using blacklist, we can just add the broker_id_2 to the existing one. > and > > > remove whatever broker_id returning to good state without worrying > how(the > > > ordering of putting the broker to blacklist) to restore. > > > > > > For topic level config, the blacklist will be tied to > > > topic/partition(e.g. Configs: > > > topic.preferred.leader.blacklist=0:101,102;1:103where 0 & 1 is the > > > partition#, 101,102,103 are the blacklist broker_ids), and easier to > > > update/remove, no need for external json files? > > > > > > > > > Thanks, > > > George > > > > > > On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe < > > > cmcc...@apache.org> wrote: > > > > > > One possibility would be writing a new command-line tool that would > > > deprioritize a given replica using the new KIP-455 API. Then it could > > > write out a JSON files containing the old priorities, which could be > > > restored when (or if) we needed to do so. This seems like it might be > > > simpler and easier to maintain than a separate set of metadata about > > > blacklists. > > > > > > best, > > > Colin > > > > > > > > > On Fri, Sep 6, 2019, at 11:58, George Li wrote: > > > > Hi, > > > > > > > > Just want to ping and bubble up the discussion of KIP-491. > > > > > > > > On a large scale of Kafka clusters with thousands of brokers in many > > > > clusters. Frequent hardware failures are common, although the > > > > reassignments to change the preferred leaders is a workaround, it > > > > incurs unnecessary additional work than the proposed preferred leader >
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote: > Hi Colin, > Can you give us more details on why you don't want this to be > part of the Kafka core. You are proposing KIP-500 which will take away > zookeeper and writing this interim tools to change the zookeeper > metadata doesn't make sense to me. Hi Harsha, The reassignment API described in KIP-455, which will be part of Kafka 2.4, doesn't rely on ZooKeeper. This API will stay the same after KIP-500 is implemented. > As George pointed out there are > several benefits having it in the system itself instead of asking users > to hack bunch of json files to deal with outage scenario. In both cases, the user just has to run a shell command, right? In both cases, the user has to remember to undo the command later when they want the broker to be treated normally again. And in both cases, the user should probably be running an external rebalancing tool to avoid having to run these commands manually. :) best, Colin > > Thanks, > Harsha > > On Fri, Sep 6, 2019 at 4:36 PM George Li > wrote: > > > Hi Colin, > > > > Thanks for the feedback. The "separate set of metadata about blacklists" > > in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple in > > the cluster. Should be easier than keeping json files? e.g. what if we > > first blacklist broker_id_1, then another broker_id_2 has issues, and we > > need to write out another json file to restore later (and in which order)? > > Using blacklist, we can just add the broker_id_2 to the existing one. and > > remove whatever broker_id returning to good state without worrying how(the > > ordering of putting the broker to blacklist) to restore. > > > > For topic level config, the blacklist will be tied to > > topic/partition(e.g. Configs: > > topic.preferred.leader.blacklist=0:101,102;1:103where 0 & 1 is the > > partition#, 101,102,103 are the blacklist broker_ids), and easier to > > update/remove, no need for external json files? > > > > > > Thanks, > > George > > > > On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe < > > cmcc...@apache.org> wrote: > > > > One possibility would be writing a new command-line tool that would > > deprioritize a given replica using the new KIP-455 API. Then it could > > write out a JSON files containing the old priorities, which could be > > restored when (or if) we needed to do so. This seems like it might be > > simpler and easier to maintain than a separate set of metadata about > > blacklists. > > > > best, > > Colin > > > > > > On Fri, Sep 6, 2019, at 11:58, George Li wrote: > > > Hi, > > > > > > Just want to ping and bubble up the discussion of KIP-491. > > > > > > On a large scale of Kafka clusters with thousands of brokers in many > > > clusters. Frequent hardware failures are common, although the > > > reassignments to change the preferred leaders is a workaround, it > > > incurs unnecessary additional work than the proposed preferred leader > > > blacklist in KIP-491, and hard to scale. > > > > > > I am wondering whether others using Kafka in a big scale running into > > > same problem. > > > > > > > > > Satish, > > > > > > Regarding your previous question about whether there is use-case for > > > TopicLevel preferred leader "blacklist", I thought about one > > > use-case: to improve rebalance/reassignment, the large partition will > > > usually cause performance/stability issues, planning to change the say > > > the New Replica will start with Leader's latest offset(this way the > > > replica is almost instantly in the ISR and reassignment completed), and > > > put this partition's NewReplica into Preferred Leader "Blacklist" at > > > the Topic Level config for that partition. After sometime(retention > > > time), this new replica has caught up and ready to serve traffic, > > > update/remove the TopicConfig for this partition's preferred leader > > > blacklist. > > > > > > I will update the KIP-491 later for this use case of Topic Level config > > > for Preferred Leader Blacklist. > > > > > > > > > Thanks, > > > George > > > > > >On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li > > > wrote: > > > > > > Hi Colin, > > > > > > > In your example, I think we're comparing apples and oranges. You > > started by outlining a scenario where "an empty broker... comes up... > > [without] any > leadership[s]." But then you criticize using reassignment > > to switch the order of preferred replicas because it "would not actually > > switch the leader > automatically." If the empty broker doesn't have any > > leaderships, there is nothing to be switched, right? > > > > > > Let me explained in details of this particular use case example for > > > comparing apples to apples. > > > > > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 > > > are the preferred leaders (leader count is 1000). There is a hardware > > > failure (disk/memory, etc.), and kafka process crashed. We
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Colin, Can you give us more details on why you don't want this to be part of the Kafka core. You are proposing KIP-500 which will take away zookeeper and writing this interim tools to change the zookeeper metadata doesn't make sense to me. As George pointed out there are several benefits having it in the system itself instead of asking users to hack bunch of json files to deal with outage scenario. Thanks, Harsha On Fri, Sep 6, 2019 at 4:36 PM George Li wrote: > Hi Colin, > > Thanks for the feedback. The "separate set of metadata about blacklists" > in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple in > the cluster. Should be easier than keeping json files? e.g. what if we > first blacklist broker_id_1, then another broker_id_2 has issues, and we > need to write out another json file to restore later (and in which order)? > Using blacklist, we can just add the broker_id_2 to the existing one. and > remove whatever broker_id returning to good state without worrying how(the > ordering of putting the broker to blacklist) to restore. > > For topic level config, the blacklist will be tied to > topic/partition(e.g. Configs: > topic.preferred.leader.blacklist=0:101,102;1:103where 0 & 1 is the > partition#, 101,102,103 are the blacklist broker_ids), and easier to > update/remove, no need for external json files? > > > Thanks, > George > > On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe < > cmcc...@apache.org> wrote: > > One possibility would be writing a new command-line tool that would > deprioritize a given replica using the new KIP-455 API. Then it could > write out a JSON files containing the old priorities, which could be > restored when (or if) we needed to do so. This seems like it might be > simpler and easier to maintain than a separate set of metadata about > blacklists. > > best, > Colin > > > On Fri, Sep 6, 2019, at 11:58, George Li wrote: > > Hi, > > > > Just want to ping and bubble up the discussion of KIP-491. > > > > On a large scale of Kafka clusters with thousands of brokers in many > > clusters. Frequent hardware failures are common, although the > > reassignments to change the preferred leaders is a workaround, it > > incurs unnecessary additional work than the proposed preferred leader > > blacklist in KIP-491, and hard to scale. > > > > I am wondering whether others using Kafka in a big scale running into > > same problem. > > > > > > Satish, > > > > Regarding your previous question about whether there is use-case for > > TopicLevel preferred leader "blacklist", I thought about one > > use-case: to improve rebalance/reassignment, the large partition will > > usually cause performance/stability issues, planning to change the say > > the New Replica will start with Leader's latest offset(this way the > > replica is almost instantly in the ISR and reassignment completed), and > > put this partition's NewReplica into Preferred Leader "Blacklist" at > > the Topic Level config for that partition. After sometime(retention > > time), this new replica has caught up and ready to serve traffic, > > update/remove the TopicConfig for this partition's preferred leader > > blacklist. > > > > I will update the KIP-491 later for this use case of Topic Level config > > for Preferred Leader Blacklist. > > > > > > Thanks, > > George > > > >On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li > > wrote: > > > > Hi Colin, > > > > > In your example, I think we're comparing apples and oranges. You > started by outlining a scenario where "an empty broker... comes up... > [without] any > leadership[s]." But then you criticize using reassignment > to switch the order of preferred replicas because it "would not actually > switch the leader > automatically." If the empty broker doesn't have any > leaderships, there is nothing to be switched, right? > > > > Let me explained in details of this particular use case example for > > comparing apples to apples. > > > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 > > are the preferred leaders (leader count is 1000). There is a hardware > > failure (disk/memory, etc.), and kafka process crashed. We swap this > > host with another host but keep the same broker.id, when this new > > broker coming up, it has no historical data, and we manage to have the > > current last offsets of all partitions set in > > the replication-offset-checkpoint (if we don't set them, it could cause > > crazy ReplicaFetcher pulling of historical data from other brokers and > > cause cluster high latency and other instabilities), so when Kafka is > > brought up, it is quickly catching up as followers in the ISR. Note, > > we have auto.leader.rebalance.enable disabled, so it's not serving any > > traffic as leaders (leader count = 0), even there are 1000 partitions > > that this broker is the Preferred Leader. > > > > We need to make this broker not serving traffic for a few hours or days > >
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Colin, Thanks for the feedback. The "separate set of metadata about blacklists" in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple in the cluster. Should be easier than keeping json files? e.g. what if we first blacklist broker_id_1, then another broker_id_2 has issues, and we need to write out another json file to restore later (and in which order)? Using blacklist, we can just add the broker_id_2 to the existing one. and remove whatever broker_id returning to good state without worrying how(the ordering of putting the broker to blacklist) to restore. For topic level config, the blacklist will be tied to topic/partition(e.g. Configs: topic.preferred.leader.blacklist=0:101,102;1:103 where 0 & 1 is the partition#, 101,102,103 are the blacklist broker_ids), and easier to update/remove, no need for external json files? Thanks, George On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe wrote: One possibility would be writing a new command-line tool that would deprioritize a given replica using the new KIP-455 API. Then it could write out a JSON files containing the old priorities, which could be restored when (or if) we needed to do so. This seems like it might be simpler and easier to maintain than a separate set of metadata about blacklists. best, Colin On Fri, Sep 6, 2019, at 11:58, George Li wrote: > Hi, > > Just want to ping and bubble up the discussion of KIP-491. > > On a large scale of Kafka clusters with thousands of brokers in many > clusters. Frequent hardware failures are common, although the > reassignments to change the preferred leaders is a workaround, it > incurs unnecessary additional work than the proposed preferred leader > blacklist in KIP-491, and hard to scale. > > I am wondering whether others using Kafka in a big scale running into > same problem. > > > Satish, > > Regarding your previous question about whether there is use-case for > TopicLevel preferred leader "blacklist", I thought about one > use-case: to improve rebalance/reassignment, the large partition will > usually cause performance/stability issues, planning to change the say > the New Replica will start with Leader's latest offset(this way the > replica is almost instantly in the ISR and reassignment completed), and > put this partition's NewReplica into Preferred Leader "Blacklist" at > the Topic Level config for that partition. After sometime(retention > time), this new replica has caught up and ready to serve traffic, > update/remove the TopicConfig for this partition's preferred leader > blacklist. > > I will update the KIP-491 later for this use case of Topic Level config > for Preferred Leader Blacklist. > > > Thanks, > George > > On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li > wrote: > > Hi Colin, > > > In your example, I think we're comparing apples and oranges. You started > > by outlining a scenario where "an empty broker... comes up... [without] any > > > leadership[s]." But then you criticize using reassignment to switch the > > order of preferred replicas because it "would not actually switch the > > leader > automatically." If the empty broker doesn't have any leaderships, > > there is nothing to be switched, right? > > Let me explained in details of this particular use case example for > comparing apples to apples. > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 > are the preferred leaders (leader count is 1000). There is a hardware > failure (disk/memory, etc.), and kafka process crashed. We swap this > host with another host but keep the same broker.id, when this new > broker coming up, it has no historical data, and we manage to have the > current last offsets of all partitions set in > the replication-offset-checkpoint (if we don't set them, it could cause > crazy ReplicaFetcher pulling of historical data from other brokers and > cause cluster high latency and other instabilities), so when Kafka is > brought up, it is quickly catching up as followers in the ISR. Note, > we have auto.leader.rebalance.enable disabled, so it's not serving any > traffic as leaders (leader count = 0), even there are 1000 partitions > that this broker is the Preferred Leader. > > We need to make this broker not serving traffic for a few hours or days > depending on the SLA of the topic retention requirement until after > it's having enough historical data. > > > * The traditional way using the reassignments to move this broker in > that 1000 partitions where it's the preferred leader to the end of > assignment, this is O(N) operation. and from my experience, we can't > submit all 1000 at the same time, otherwise cause higher latencies even > the reassignment in this case can complete almost instantly. After a > few hours/days whatever, this broker is ready to serve traffic, we > have to run reassignments again to restore
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
One possibility would be writing a new command-line tool that would deprioritize a given replica using the new KIP-455 API. Then it could write out a JSON files containing the old priorities, which could be restored when (or if) we needed to do so. This seems like it might be simpler and easier to maintain than a separate set of metadata about blacklists. best, Colin On Fri, Sep 6, 2019, at 11:58, George Li wrote: > Hi, > > Just want to ping and bubble up the discussion of KIP-491. > > On a large scale of Kafka clusters with thousands of brokers in many > clusters. Frequent hardware failures are common, although the > reassignments to change the preferred leaders is a workaround, it > incurs unnecessary additional work than the proposed preferred leader > blacklist in KIP-491, and hard to scale. > > I am wondering whether others using Kafka in a big scale running into > same problem. > > > Satish, > > Regarding your previous question about whether there is use-case for > TopicLevel preferred leader "blacklist", I thought about one > use-case: to improve rebalance/reassignment, the large partition will > usually cause performance/stability issues, planning to change the say > the New Replica will start with Leader's latest offset(this way the > replica is almost instantly in the ISR and reassignment completed), and > put this partition's NewReplica into Preferred Leader "Blacklist" at > the Topic Level config for that partition. After sometime(retention > time), this new replica has caught up and ready to serve traffic, > update/remove the TopicConfig for this partition's preferred leader > blacklist. > > I will update the KIP-491 later for this use case of Topic Level config > for Preferred Leader Blacklist. > > > Thanks, > George > > On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li > wrote: > > Hi Colin, > > > In your example, I think we're comparing apples and oranges. You started > > by outlining a scenario where "an empty broker... comes up... [without] any > > > leadership[s]." But then you criticize using reassignment to switch the > > order of preferred replicas because it "would not actually switch the > > leader > automatically." If the empty broker doesn't have any leaderships, > > there is nothing to be switched, right? > > Let me explained in details of this particular use case example for > comparing apples to apples. > > Let's say a healthy broker hosting 3000 partitions, and of which 1000 > are the preferred leaders (leader count is 1000). There is a hardware > failure (disk/memory, etc.), and kafka process crashed. We swap this > host with another host but keep the same broker.id, when this new > broker coming up, it has no historical data, and we manage to have the > current last offsets of all partitions set in > the replication-offset-checkpoint (if we don't set them, it could cause > crazy ReplicaFetcher pulling of historical data from other brokers and > cause cluster high latency and other instabilities), so when Kafka is > brought up, it is quickly catching up as followers in the ISR. Note, > we have auto.leader.rebalance.enable disabled, so it's not serving any > traffic as leaders (leader count = 0), even there are 1000 partitions > that this broker is the Preferred Leader. > > We need to make this broker not serving traffic for a few hours or days > depending on the SLA of the topic retention requirement until after > it's having enough historical data. > > > * The traditional way using the reassignments to move this broker in > that 1000 partitions where it's the preferred leader to the end of > assignment, this is O(N) operation. and from my experience, we can't > submit all 1000 at the same time, otherwise cause higher latencies even > the reassignment in this case can complete almost instantly. After a > few hours/days whatever, this broker is ready to serve traffic, we > have to run reassignments again to restore that 1000 partitions > preferred leaders for this broker: O(N) operation. then run preferred > leader election O(N) again. So total 3 x O(N) operations. The point > is since the new empty broker is expected to be the same as the old one > in terms of hosting partition/leaders, it would seem unnecessary to do > reassignments (ordering of replica) during the broker catching up time. > > > > * The new feature Preferred Leader "Blacklist": just need to put a > dynamic config to indicate that this broker should be considered leader > (preferred leader election or broker failover or unclean leader > election) to the lowest priority. NO need to run any reassignments. > After a few hours/days, when this broker is ready, remove the dynamic > config, and run preferred leader election and this broker will serve > traffic for that 1000 original partitions it was the preferred leader. > So total 1 x O(N) operation. > > > If
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi, Just want to ping and bubble up the discussion of KIP-491. On a large scale of Kafka clusters with thousands of brokers in many clusters. Frequent hardware failures are common, although the reassignments to change the preferred leaders is a workaround, it incurs unnecessary additional work than the proposed preferred leader blacklist in KIP-491, and hard to scale. I am wondering whether others using Kafka in a big scale running into same problem. Satish, Regarding your previous question about whether there is use-case for TopicLevel preferred leader "blacklist", I thought about one use-case: to improve rebalance/reassignment, the large partition will usually cause performance/stability issues, planning to change the say the New Replica will start with Leader's latest offset(this way the replica is almost instantly in the ISR and reassignment completed), and put this partition's NewReplica into Preferred Leader "Blacklist" at the Topic Level config for that partition. After sometime(retention time), this new replica has caught up and ready to serve traffic, update/remove the TopicConfig for this partition's preferred leader blacklist. I will update the KIP-491 later for this use case of Topic Level config for Preferred Leader Blacklist. Thanks, George On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li wrote: Hi Colin, > In your example, I think we're comparing apples and oranges. You started by > outlining a scenario where "an empty broker... comes up... [without] any > > leadership[s]." But then you criticize using reassignment to switch the > order of preferred replicas because it "would not actually switch the leader > > automatically." If the empty broker doesn't have any leaderships, there is > nothing to be switched, right? Let me explained in details of this particular use case example for comparing apples to apples. Let's say a healthy broker hosting 3000 partitions, and of which 1000 are the preferred leaders (leader count is 1000). There is a hardware failure (disk/memory, etc.), and kafka process crashed. We swap this host with another host but keep the same broker.id, when this new broker coming up, it has no historical data, and we manage to have the current last offsets of all partitions set in the replication-offset-checkpoint (if we don't set them, it could cause crazy ReplicaFetcher pulling of historical data from other brokers and cause cluster high latency and other instabilities), so when Kafka is brought up, it is quickly catching up as followers in the ISR. Note, we have auto.leader.rebalance.enable disabled, so it's not serving any traffic as leaders (leader count = 0), even there are 1000 partitions that this broker is the Preferred Leader. We need to make this broker not serving traffic for a few hours or days depending on the SLA of the topic retention requirement until after it's having enough historical data. * The traditional way using the reassignments to move this broker in that 1000 partitions where it's the preferred leader to the end of assignment, this is O(N) operation. and from my experience, we can't submit all 1000 at the same time, otherwise cause higher latencies even the reassignment in this case can complete almost instantly. After a few hours/days whatever, this broker is ready to serve traffic, we have to run reassignments again to restore that 1000 partitions preferred leaders for this broker: O(N) operation. then run preferred leader election O(N) again. So total 3 x O(N) operations. The point is since the new empty broker is expected to be the same as the old one in terms of hosting partition/leaders, it would seem unnecessary to do reassignments (ordering of replica) during the broker catching up time. * The new feature Preferred Leader "Blacklist": just need to put a dynamic config to indicate that this broker should be considered leader (preferred leader election or broker failover or unclean leader election) to the lowest priority. NO need to run any reassignments. After a few hours/days, when this broker is ready, remove the dynamic config, and run preferred leader election and this broker will serve traffic for that 1000 original partitions it was the preferred leader. So total 1 x O(N) operation. If auto.leader.rebalance.enable is enabled, the Preferred Leader "Blacklist" can be put it before Kafka is started to prevent this broker serving traffic. In the traditional way of running reassignments, once the broker is up, with auto.leader.rebalance.enable , if leadership starts going to this new empty broker, it might have to do preferred leader election after reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) reassignment only change the ordering, 1 remains as the current leader, and needs prefer leader election to change to 2 after reassignment. so potentially one more O(N) operation. I hope the above
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Colin, > In your example, I think we're comparing apples and oranges. You started by > outlining a scenario where "an empty broker... comes up... [without] any > > leadership[s]." But then you criticize using reassignment to switch the > order of preferred replicas because it "would not actually switch the leader > > automatically." If the empty broker doesn't have any leaderships, there is > nothing to be switched, right? Let me explained in details of this particular use case example for comparing apples to apples. Let's say a healthy broker hosting 3000 partitions, and of which 1000 are the preferred leaders (leader count is 1000). There is a hardware failure (disk/memory, etc.), and kafka process crashed. We swap this host with another host but keep the same broker.id, when this new broker coming up, it has no historical data, and we manage to have the current last offsets of all partitions set in the replication-offset-checkpoint (if we don't set them, it could cause crazy ReplicaFetcher pulling of historical data from other brokers and cause cluster high latency and other instabilities), so when Kafka is brought up, it is quickly catching up as followers in the ISR. Note, we have auto.leader.rebalance.enable disabled, so it's not serving any traffic as leaders (leader count = 0), even there are 1000 partitions that this broker is the Preferred Leader. We need to make this broker not serving traffic for a few hours or days depending on the SLA of the topic retention requirement until after it's having enough historical data. * The traditional way using the reassignments to move this broker in that 1000 partitions where it's the preferred leader to the end of assignment, this is O(N) operation. and from my experience, we can't submit all 1000 at the same time, otherwise cause higher latencies even the reassignment in this case can complete almost instantly. After a few hours/days whatever, this broker is ready to serve traffic, we have to run reassignments again to restore that 1000 partitions preferred leaders for this broker: O(N) operation. then run preferred leader election O(N) again. So total 3 x O(N) operations. The point is since the new empty broker is expected to be the same as the old one in terms of hosting partition/leaders, it would seem unnecessary to do reassignments (ordering of replica) during the broker catching up time. * The new feature Preferred Leader "Blacklist": just need to put a dynamic config to indicate that this broker should be considered leader (preferred leader election or broker failover or unclean leader election) to the lowest priority. NO need to run any reassignments. After a few hours/days, when this broker is ready, remove the dynamic config, and run preferred leader election and this broker will serve traffic for that 1000 original partitions it was the preferred leader. So total 1 x O(N) operation. If auto.leader.rebalance.enable is enabled, the Preferred Leader "Blacklist" can be put it before Kafka is started to prevent this broker serving traffic. In the traditional way of running reassignments, once the broker is up, with auto.leader.rebalance.enable , if leadership starts going to this new empty broker, it might have to do preferred leader election after reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) reassignment only change the ordering, 1 remains as the current leader, and needs prefer leader election to change to 2 after reassignment. so potentially one more O(N) operation. I hope the above example can show how easy to "blacklist" a broker serving leadership. For someone managing Production Kafka cluster, it's important to react fast to certain alerts and mitigate/resolve some issues. As I listed the other use cases in KIP-291, I think this feature can make the Kafka product more easier to manage/operate. > In general, using an external rebalancing tool like Cruise Control is a good > idea to keep things balanced without having deal with manual rebalancing. > > We expect more and more people who have a complex or large cluster will start > using tools like this. > > However, if you choose to do manual rebalancing, it shouldn't be that bad. > You would save the existing partition ordering before making your changes, > then> make your changes (perhaps by running a simple command line tool that > switches the order of the replicas). Then, once you felt like the broker was > ready to> serve traffic, you could just re-apply the old ordering which you > had saved. We do have our own rebalancing tool which has its own criteria like Rack diversity, disk usage, spread partitions/leaders across all brokers in the cluster per topic, leadership Bytes/BytesIn served per broker, etc. We can run reassignments. The point is whether it's really necessary, and if there is more effective, easier, safer way to do it. take another use case
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
On Wed, Aug 7, 2019, at 12:48, George Li wrote: > Hi Colin, > > Thanks for your feedbacks. Comments below: > > Even if you have a way of blacklisting an entire broker all at once, you > >still would need to run a leader election > for each partition where you > >want to move the leader off of the blacklisted broker. So the operation is > >still O(N) in > that sense-- you have to do something per partition. > > For a failed broker and swapped with an empty broker, when it comes up, > it will not have any leadership, and we would like it to remain not > having leaderships for a couple of hours or days. So there is no > preferred leader election needed which incurs O(N) operation in this > case. Putting the preferred leader blacklist would safe guard this > broker serving traffic during that time. otherwise, if another broker > fails(if this broker is the 1st, 2nd in the assignment), or someone > runs preferred leader election, this new "empty" broker can still get > leaderships. > > Also running reassignment to change the ordering of preferred leader > would not actually switch the leader automatically. e.g. (1,2,3) => > (2,3,1). unless preferred leader election is run to switch current > leader from 1 to 2. So the operation is at least 2 x O(N). and then > after the broker is back to normal, another 2 x O(N) to rollback. Hi George, Hmm. I guess I'm still on the fence about this feature. In your example, I think we're comparing apples and oranges. You started by outlining a scenario where "an empty broker... comes up... [without] any leadership[s]." But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader automatically." If the empty broker doesn't have any leaderships, there is nothing to be switched, right? > > > > In general, reassignment will get a lot easier and quicker once KIP-455 is > > implemented. > Reassignments that just change the order of preferred > > replicas for a specific partition should complete pretty much instantly. > >> I think it's simpler and easier just to have one source of truth for what > >> the preferred replica is for a partition, rather than two. So for> me, > >> the fact that the replica assignment ordering isn't changed is actually a > >> big disadvantage of this KIP. If you are a new user (or just> an > >> existing user that didn't read all of the documentation) and you just look > >> at the replica assignment, you might be confused by why> a particular > >> broker wasn't getting any leaderships, even though it appeared like it > >> should. More mechanisms mean more complexity> for users and developers > >> most of the time. > > > I would like stress the point that running reassignment to change the > ordering of the replica (putting a broker to the end of partition > assignment) is unnecessary, because after some time the broker is > caught up, it can start serving traffic and then need to run > reassignments again to "rollback" to previous states. As I mentioned in > KIP-491, this is just tedious work. In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing. We expect more and more people who have a complex or large cluster will start using tools like this. However, if you choose to do manual rebalancing, it shouldn't be that bad. You would save the existing partition ordering before making your changes, then make your changes (perhaps by running a simple command line tool that switches the order of the replicas). Then, once you felt like the broker was ready to serve traffic, you could just re-apply the old ordering which you had saved. > > I agree this might introduce some complexities for users/developers. > But if this feature is good, and well documented, it is good for the > kafka product/community. Just like KIP-460 enabling unclean leader > election to override TopicLevel/Broker Level config of > `unclean.leader.election.enable` > > > I agree that it would be nice if we could treat some brokers differently > > for the purposes of placing replicas, selecting leaders, etc. > Right now, > > we don't have any way of implementing that without forking the broker. I > > would support a new PlacementPolicy class that> would close this gap. But > > I don't think this KIP is flexible enough to fill this role. For example, > > it can't prevent users from creating> new single-replica topics that get > > put on the "bad" replica. Perhaps we should reopen the discussion> about > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces > > Creating topic with single-replica is beyond what KIP-491 is trying to > achieve. The user needs to take responsibility of doing that. I do see > some Samza clients notoriously creating single-replica topics and that > got flagged by
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Colin, Thanks for your feedbacks. Comments below: > Even if you have a way of blacklisting an entire broker all at once, you >still would need to run a leader election > for each partition where you want >to move the leader off of the blacklisted broker. So the operation is still >O(N) in > that sense-- you have to do something per partition. For a failed broker and swapped with an empty broker, when it comes up, it will not have any leadership, and we would like it to remain not having leaderships for a couple of hours or days. So there is no preferred leader election needed which incurs O(N) operation in this case. Putting the preferred leader blacklist would safe guard this broker serving traffic during that time. otherwise, if another broker fails(if this broker is the 1st, 2nd in the assignment), or someone runs preferred leader election, this new "empty" broker can still get leaderships. Also running reassignment to change the ordering of preferred leader would not actually switch the leader automatically. e.g. (1,2,3) => (2,3,1). unless preferred leader election is run to switch current leader from 1 to 2. So the operation is at least 2 x O(N). and then after the broker is back to normal, another 2 x O(N) to rollback. > In general, reassignment will get a lot easier and quicker once KIP-455 is > implemented. > Reassignments that just change the order of preferred > replicas for a specific partition should complete pretty much instantly. >> I think it's simpler and easier just to have one source of truth for what >> the preferred replica is for a partition, rather than two. So for> me, the >> fact that the replica assignment ordering isn't changed is actually a big >> disadvantage of this KIP. If you are a new user (or just> an existing user >> that didn't read all of the documentation) and you just look at the replica >> assignment, you might be confused by why> a particular broker wasn't getting >> any leaderships, even though it appeared like it should. More mechanisms >> mean more complexity> for users and developers most of the time. I would like stress the point that running reassignment to change the ordering of the replica (putting a broker to the end of partition assignment) is unnecessary, because after some time the broker is caught up, it can start serving traffic and then need to run reassignments again to "rollback" to previous states. As I mentioned in KIP-491, this is just tedious work. I agree this might introduce some complexities for users/developers. But if this feature is good, and well documented, it is good for the kafka product/community. Just like KIP-460 enabling unclean leader election to override TopicLevel/Broker Level config of `unclean.leader.election.enable` > I agree that it would be nice if we could treat some brokers differently for > the purposes of placing replicas, selecting leaders, etc. > Right now, we > don't have any way of implementing that without forking the broker. I would > support a new PlacementPolicy class that> would close this gap. But I don't > think this KIP is flexible enough to fill this role. For example, it can't > prevent users from creating> new single-replica topics that get put on the > "bad" replica. Perhaps we should reopen the discussion> about > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces Creating topic with single-replica is beyond what KIP-491 is trying to achieve. The user needs to take responsibility of doing that. I do see some Samza clients notoriously creating single-replica topics and that got flagged by alerts, because a single broker down/maintenance will cause offline partitions. For KIP-491 preferred leader "blacklist", the single-replica will still serve as leaders, because there is no other alternative replica to be chosen as leader. Even with a new PlacementPolicy for topic creation/partition expansion, it still needs the blacklist info (e.g. a zk path node, or broker level/topic level config) to "blacklist" the broker to be preferred leader? Would it be the same as KIP-491 is introducing? Thanks, George On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe wrote: On Fri, Aug 2, 2019, at 20:02, George Li wrote: > Hi Colin, > Thanks for looking into this KIP. Sorry for the late response. been busy. > > If a cluster has MAMY topic partitions, moving this "blacklist" broker > to the end of replica list is still a rather "big" operation, involving > submitting reassignments. The KIP-491 way of blacklist is much > simpler/easier and can undo easily without changing the replica > assignment ordering. Hi George, Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election for each partition where you want to move the leader off of the blacklisted broker. So the operation is still O(N) in that sense-- you
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
On Fri, Aug 2, 2019, at 20:02, George Li wrote: > Hi Colin, > Thanks for looking into this KIP. Sorry for the late response. been busy. > > If a cluster has MAMY topic partitions, moving this "blacklist" broker > to the end of replica list is still a rather "big" operation, involving > submitting reassignments. The KIP-491 way of blacklist is much > simpler/easier and can undo easily without changing the replica > assignment ordering. Hi George, Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election for each partition where you want to move the leader off of the blacklisted broker. So the operation is still O(N) in that sense-- you have to do something per partition. In general, reassignment will get a lot easier and quicker once KIP-455 is implemented. Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly. I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two. So for me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP. If you are a new user (or just an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why a particular broker wasn't getting any leaderships, even though it appeared like it should. More mechanisms mean more complexity for users and developers most of the time. > Major use case for me, a failed broker got swapped with new hardware, > and starts up as empty (with latest offset of all partitions), the SLA > of retention is 1 day, so before this broker is up to be in-sync for 1 > day, we would like to blacklist this broker from serving traffic. after > 1 day, the blacklist is removed and run preferred leader election. > This way, no need to run reassignments before/after. This is the > "temporary" use-case. What if we just add an option to the reassignment tool to generate a plan to move all the leaders off of a specific broker? The tool could also run a leader election as well. That would be a simple way of doing this without adding new mechanisms or broker-side configurations, etc. > > There are use-cases that this Preferred Leader "blacklist" can be > somewhat permanent, as I explained in the AWS data center instances Vs. > on-premises data center bare metal machines (heterogenous hardware), > that the AWS broker_ids will be blacklisted. So new topics created, > or existing topic expansion would not make them serve traffic even they > could be the preferred leader. I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc. Right now, we don't have any way of implementing that without forking the broker. I would support a new PlacementPolicy class that would close this gap. But I don't think this KIP is flexible enough to fill this role. For example, it can't prevent users from creating new single-replica topics that get put on the "bad" replica. Perhaps we should reopen the discussion about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces regards, Colin > > Please let me know there are more question. > > > Thanks, > George > > On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe > wrote: > > We still want to give the "blacklisted" broker the leadership if > nobody else is available. Therefore, isn't putting a broker on the > blacklist pretty much the same as moving it to the last entry in the > replicas list and then triggering a preferred leader election? > > If we want this to be undone after a certain amount of time, or under > certain conditions, that seems like something that would be more > effectively done by an external system, rather than putting all these > policies into Kafka. > > best, > Colin > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote: > > Hi Satish, > > Thanks for the reviews and feedbacks. > > > > > > The following is the requirements this KIP is trying to accomplish: > > > This can be moved to the"Proposed changes" section. > > > > Updated the KIP-491. > > > > > >>The logic to determine the priority/order of which broker should be > > > preferred leader should be modified. The broker in the preferred leader > > > blacklist should be moved to the end (lowest priority) when > > > determining leadership. > > > > > > I believe there is no change required in the ordering of the preferred > > > replica list. Brokers in the preferred leader blacklist are skipped > > > until other brokers int he list are unavailable. > > > > Yes. partition assignment remained the same, replica & ordering. The > > blacklist logic can be optimized during implementation. > > > > > >>The blacklist can be at the broker level.
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi George, Thanks for addressing the comments. I do not have any more questions. On Wed, Aug 7, 2019 at 11:08 AM George Li wrote: > > Hi Colin, Satish, Stanislav, > > Did I answer all your comments/concerns for KIP-491 ? Please let me know if > you have more questions regarding this feature. I would like to start coding > soon. I hope this feature can get into the open source trunk so every time we > upgrade Kafka in our environment, we don't need to cherry pick this. > > BTW, I have added below in KIP-491 for auto.leader.rebalance.enable behavior > with the new Preferred Leader "Blacklist". > > "When auto.leader.rebalance.enable is enabled. The broker(s) in the > preferred leader "blacklist" should be excluded from being elected leaders. " > > > Thanks, > George > > On Friday, August 2, 2019, 08:02:07 PM PDT, George Li > wrote: > > Hi Colin, > Thanks for looking into this KIP. Sorry for the late response. been busy. > > If a cluster has MAMY topic partitions, moving this "blacklist" broker to the > end of replica list is still a rather "big" operation, involving submitting > reassignments. The KIP-491 way of blacklist is much simpler/easier and can > undo easily without changing the replica assignment ordering. > Major use case for me, a failed broker got swapped with new hardware, and > starts up as empty (with latest offset of all partitions), the SLA of > retention is 1 day, so before this broker is up to be in-sync for 1 day, we > would like to blacklist this broker from serving traffic. after 1 day, the > blacklist is removed and run preferred leader election. This way, no need to > run reassignments before/after. This is the "temporary" use-case. > > There are use-cases that this Preferred Leader "blacklist" can be somewhat > permanent, as I explained in the AWS data center instances Vs. on-premises > data center bare metal machines (heterogenous hardware), that the AWS > broker_ids will be blacklisted. So new topics created, or existing topic > expansion would not make them serve traffic even they could be the preferred > leader. > > Please let me know there are more question. > > > Thanks, > George > > On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe > wrote: > > We still want to give the "blacklisted" broker the leadership if nobody else > is available. Therefore, isn't putting a broker on the blacklist pretty much > the same as moving it to the last entry in the replicas list and then > triggering a preferred leader election? > > If we want this to be undone after a certain amount of time, or under certain > conditions, that seems like something that would be more effectively done by > an external system, rather than putting all these policies into Kafka. > > best, > Colin > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote: > > Hi Satish, > > Thanks for the reviews and feedbacks. > > > > > > The following is the requirements this KIP is trying to accomplish: > > > This can be moved to the"Proposed changes" section. > > > > Updated the KIP-491. > > > > > >>The logic to determine the priority/order of which broker should be > > > preferred leader should be modified. The broker in the preferred leader > > > blacklist should be moved to the end (lowest priority) when > > > determining leadership. > > > > > > I believe there is no change required in the ordering of the preferred > > > replica list. Brokers in the preferred leader blacklist are skipped > > > until other brokers int he list are unavailable. > > > > Yes. partition assignment remained the same, replica & ordering. The > > blacklist logic can be optimized during implementation. > > > > > >>The blacklist can be at the broker level. However, there might be use > > > >>cases > > > where a specific topic should blacklist particular brokers, which > > > would be at the > > > Topic level Config. For this use cases of this KIP, it seems that broker > > > level > > > blacklist would suffice. Topic level preferred leader blacklist might > > > be future enhancement work. > > > > > > I agree that the broker level preferred leader blacklist would be > > > sufficient. Do you have any use cases which require topic level > > > preferred blacklist? > > > > > > > > I don't have any concrete use cases for Topic level preferred leader > > blacklist. One scenarios I can think of is when a broker has high CPU > > usage, trying to identify the big topics (High MsgIn, High BytesIn, > > etc), then try to move the leaders away from this broker, before doing > > an actual reassignment to change its preferred leader, try to put this > > preferred_leader_blacklist in the Topic Level config, and run preferred > > leader election, and see whether CPU decreases for this broker, if > > yes, then do the reassignments to change the preferred leaders to be > > "permanent" (the topic may have many partitions like 256 that has quite > > a few of them having this broker as preferred leader). So this
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Colin, Satish, Stanislav, Did I answer all your comments/concerns for KIP-491 ? Please let me know if you have more questions regarding this feature. I would like to start coding soon. I hope this feature can get into the open source trunk so every time we upgrade Kafka in our environment, we don't need to cherry pick this. BTW, I have added below in KIP-491 for auto.leader.rebalance.enable behavior with the new Preferred Leader "Blacklist". "When auto.leader.rebalance.enable is enabled. The broker(s) in the preferred leader "blacklist" should be excluded from being elected leaders. " Thanks, George On Friday, August 2, 2019, 08:02:07 PM PDT, George Li wrote: Hi Colin, Thanks for looking into this KIP. Sorry for the late response. been busy. If a cluster has MAMY topic partitions, moving this "blacklist" broker to the end of replica list is still a rather "big" operation, involving submitting reassignments. The KIP-491 way of blacklist is much simpler/easier and can undo easily without changing the replica assignment ordering. Major use case for me, a failed broker got swapped with new hardware, and starts up as empty (with latest offset of all partitions), the SLA of retention is 1 day, so before this broker is up to be in-sync for 1 day, we would like to blacklist this broker from serving traffic. after 1 day, the blacklist is removed and run preferred leader election. This way, no need to run reassignments before/after. This is the "temporary" use-case. There are use-cases that this Preferred Leader "blacklist" can be somewhat permanent, as I explained in the AWS data center instances Vs. on-premises data center bare metal machines (heterogenous hardware), that the AWS broker_ids will be blacklisted. So new topics created, or existing topic expansion would not make them serve traffic even they could be the preferred leader. Please let me know there are more question. Thanks, George On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe wrote: We still want to give the "blacklisted" broker the leadership if nobody else is available. Therefore, isn't putting a broker on the blacklist pretty much the same as moving it to the last entry in the replicas list and then triggering a preferred leader election? If we want this to be undone after a certain amount of time, or under certain conditions, that seems like something that would be more effectively done by an external system, rather than putting all these policies into Kafka. best, Colin On Fri, Jul 19, 2019, at 18:23, George Li wrote: > Hi Satish, > Thanks for the reviews and feedbacks. > > > > The following is the requirements this KIP is trying to accomplish: > > This can be moved to the"Proposed changes" section. > > Updated the KIP-491. > > > >>The logic to determine the priority/order of which broker should be > > preferred leader should be modified. The broker in the preferred leader > > blacklist should be moved to the end (lowest priority) when > > determining leadership. > > > > I believe there is no change required in the ordering of the preferred > > replica list. Brokers in the preferred leader blacklist are skipped > > until other brokers int he list are unavailable. > > Yes. partition assignment remained the same, replica & ordering. The > blacklist logic can be optimized during implementation. > > > >>The blacklist can be at the broker level. However, there might be use > > >>cases > > where a specific topic should blacklist particular brokers, which > > would be at the > > Topic level Config. For this use cases of this KIP, it seems that broker > > level > > blacklist would suffice. Topic level preferred leader blacklist might > > be future enhancement work. > > > > I agree that the broker level preferred leader blacklist would be > > sufficient. Do you have any use cases which require topic level > > preferred blacklist? > > > > I don't have any concrete use cases for Topic level preferred leader > blacklist. One scenarios I can think of is when a broker has high CPU > usage, trying to identify the big topics (High MsgIn, High BytesIn, > etc), then try to move the leaders away from this broker, before doing > an actual reassignment to change its preferred leader, try to put this > preferred_leader_blacklist in the Topic Level config, and run preferred > leader election, and see whether CPU decreases for this broker, if > yes, then do the reassignments to change the preferred leaders to be > "permanent" (the topic may have many partitions like 256 that has quite > a few of them having this broker as preferred leader). So this Topic > Level config is an easy way of doing trial and check the result. > > > > You can add the below workaround as an item in the rejected alternatives > > section > > "Reassigning all the topic/partitions which the intended broker is a > > replica for." > > Updated the KIP-491. > > > >
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Colin, Thanks for looking into this KIP. Sorry for the late response. been busy. If a cluster has MAMY topic partitions, moving this "blacklist" broker to the end of replica list is still a rather "big" operation, involving submitting reassignments. The KIP-491 way of blacklist is much simpler/easier and can undo easily without changing the replica assignment ordering. Major use case for me, a failed broker got swapped with new hardware, and starts up as empty (with latest offset of all partitions), the SLA of retention is 1 day, so before this broker is up to be in-sync for 1 day, we would like to blacklist this broker from serving traffic. after 1 day, the blacklist is removed and run preferred leader election. This way, no need to run reassignments before/after. This is the "temporary" use-case. There are use-cases that this Preferred Leader "blacklist" can be somewhat permanent, as I explained in the AWS data center instances Vs. on-premises data center bare metal machines (heterogenous hardware), that the AWS broker_ids will be blacklisted. So new topics created, or existing topic expansion would not make them serve traffic even they could be the preferred leader. Please let me know there are more question. Thanks, George On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe wrote: We still want to give the "blacklisted" broker the leadership if nobody else is available. Therefore, isn't putting a broker on the blacklist pretty much the same as moving it to the last entry in the replicas list and then triggering a preferred leader election? If we want this to be undone after a certain amount of time, or under certain conditions, that seems like something that would be more effectively done by an external system, rather than putting all these policies into Kafka. best, Colin On Fri, Jul 19, 2019, at 18:23, George Li wrote: > Hi Satish, > Thanks for the reviews and feedbacks. > > > > The following is the requirements this KIP is trying to accomplish: > > This can be moved to the"Proposed changes" section. > > Updated the KIP-491. > > > >>The logic to determine the priority/order of which broker should be > > preferred leader should be modified. The broker in the preferred leader > > blacklist should be moved to the end (lowest priority) when > > determining leadership. > > > > I believe there is no change required in the ordering of the preferred > > replica list. Brokers in the preferred leader blacklist are skipped > > until other brokers int he list are unavailable. > > Yes. partition assignment remained the same, replica & ordering. The > blacklist logic can be optimized during implementation. > > > >>The blacklist can be at the broker level. However, there might be use > > >>cases > > where a specific topic should blacklist particular brokers, which > > would be at the > > Topic level Config. For this use cases of this KIP, it seems that broker > > level > > blacklist would suffice. Topic level preferred leader blacklist might > > be future enhancement work. > > > > I agree that the broker level preferred leader blacklist would be > > sufficient. Do you have any use cases which require topic level > > preferred blacklist? > > > > I don't have any concrete use cases for Topic level preferred leader > blacklist. One scenarios I can think of is when a broker has high CPU > usage, trying to identify the big topics (High MsgIn, High BytesIn, > etc), then try to move the leaders away from this broker, before doing > an actual reassignment to change its preferred leader, try to put this > preferred_leader_blacklist in the Topic Level config, and run preferred > leader election, and see whether CPU decreases for this broker, if > yes, then do the reassignments to change the preferred leaders to be > "permanent" (the topic may have many partitions like 256 that has quite > a few of them having this broker as preferred leader). So this Topic > Level config is an easy way of doing trial and check the result. > > > > You can add the below workaround as an item in the rejected alternatives > > section > > "Reassigning all the topic/partitions which the intended broker is a > > replica for." > > Updated the KIP-491. > > > > Thanks, > George > > On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana > wrote: > > Thanks for the KIP. I have put my comments below. > > This is a nice improvement to avoid cumbersome maintenance. > > >> The following is the requirements this KIP is trying to accomplish: > The ability to add and remove the preferred leader deprioritized > list/blacklist. e.g. new ZK path/node or new dynamic config. > > This can be moved to the"Proposed changes" section. > > >>The logic to determine the priority/order of which broker should be > preferred leader should be modified. The broker in the preferred leader > blacklist should be moved to the end (lowest priority) when >
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
We still want to give the "blacklisted" broker the leadership if nobody else is available. Therefore, isn't putting a broker on the blacklist pretty much the same as moving it to the last entry in the replicas list and then triggering a preferred leader election? If we want this to be undone after a certain amount of time, or under certain conditions, that seems like something that would be more effectively done by an external system, rather than putting all these policies into Kafka. best, Colin On Fri, Jul 19, 2019, at 18:23, George Li wrote: > Hi Satish, > Thanks for the reviews and feedbacks. > > > > The following is the requirements this KIP is trying to accomplish: > > This can be moved to the"Proposed changes" section. > > Updated the KIP-491. > > > >>The logic to determine the priority/order of which broker should be > > preferred leader should be modified. The broker in the preferred leader > > blacklist should be moved to the end (lowest priority) when > > determining leadership. > > > > I believe there is no change required in the ordering of the preferred > > replica list. Brokers in the preferred leader blacklist are skipped > > until other brokers int he list are unavailable. > > Yes. partition assignment remained the same, replica & ordering. The > blacklist logic can be optimized during implementation. > > > >>The blacklist can be at the broker level. However, there might be use > > >>cases > > where a specific topic should blacklist particular brokers, which > > would be at the > > Topic level Config. For this use cases of this KIP, it seems that broker > > level > > blacklist would suffice. Topic level preferred leader blacklist might > > be future enhancement work. > > > > I agree that the broker level preferred leader blacklist would be > > sufficient. Do you have any use cases which require topic level > > preferred blacklist? > > > > I don't have any concrete use cases for Topic level preferred leader > blacklist. One scenarios I can think of is when a broker has high CPU > usage, trying to identify the big topics (High MsgIn, High BytesIn, > etc), then try to move the leaders away from this broker, before doing > an actual reassignment to change its preferred leader, try to put this > preferred_leader_blacklist in the Topic Level config, and run preferred > leader election, and see whether CPU decreases for this broker, if > yes, then do the reassignments to change the preferred leaders to be > "permanent" (the topic may have many partitions like 256 that has quite > a few of them having this broker as preferred leader). So this Topic > Level config is an easy way of doing trial and check the result. > > > > You can add the below workaround as an item in the rejected alternatives > > section > > "Reassigning all the topic/partitions which the intended broker is a > > replica for." > > Updated the KIP-491. > > > > Thanks, > George > > On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana > wrote: > > Thanks for the KIP. I have put my comments below. > > This is a nice improvement to avoid cumbersome maintenance. > > >> The following is the requirements this KIP is trying to accomplish: > The ability to add and remove the preferred leader deprioritized > list/blacklist. e.g. new ZK path/node or new dynamic config. > > This can be moved to the"Proposed changes" section. > > >>The logic to determine the priority/order of which broker should be > preferred leader should be modified. The broker in the preferred leader > blacklist should be moved to the end (lowest priority) when > determining leadership. > > I believe there is no change required in the ordering of the preferred > replica list. Brokers in the preferred leader blacklist are skipped > until other brokers int he list are unavailable. > > >>The blacklist can be at the broker level. However, there might be use cases > where a specific topic should blacklist particular brokers, which > would be at the > Topic level Config. For this use cases of this KIP, it seems that broker level > blacklist would suffice. Topic level preferred leader blacklist might > be future enhancement work. > > I agree that the broker level preferred leader blacklist would be > sufficient. Do you have any use cases which require topic level > preferred blacklist? > > You can add the below workaround as an item in the rejected alternatives > section > "Reassigning all the topic/partitions which the intended broker is a > replica for." > > Thanks, > Satish. > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski > wrote: > > > > Hey George, > > > > Thanks for the KIP, it's an interesting idea. > > > > I was wondering whether we could achieve the same thing via the > > kafka-reassign-partitions tool. As you had also said in the JIRA, it is > > true that this is currently very tedious with the tool. My thoughts are > > that we could improve the tool and give it the notion of
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Satish, Thanks for the reviews and feedbacks. > > The following is the requirements this KIP is trying to accomplish: > This can be moved to the"Proposed changes" section. Updated the KIP-491. > >>The logic to determine the priority/order of which broker should be > preferred leader should be modified. The broker in the preferred leader > blacklist should be moved to the end (lowest priority) when > determining leadership. > > I believe there is no change required in the ordering of the preferred > replica list. Brokers in the preferred leader blacklist are skipped > until other brokers int he list are unavailable. Yes. partition assignment remained the same, replica & ordering. The blacklist logic can be optimized during implementation. > >>The blacklist can be at the broker level. However, there might be use cases > where a specific topic should blacklist particular brokers, which > would be at the > Topic level Config. For this use cases of this KIP, it seems that broker level > blacklist would suffice. Topic level preferred leader blacklist might > be future enhancement work. > > I agree that the broker level preferred leader blacklist would be > sufficient. Do you have any use cases which require topic level > preferred blacklist? I don't have any concrete use cases for Topic level preferred leader blacklist. One scenarios I can think of is when a broker has high CPU usage, trying to identify the big topics (High MsgIn, High BytesIn, etc), then try to move the leaders away from this broker, before doing an actual reassignment to change its preferred leader, try to put this preferred_leader_blacklist in the Topic Level config, and run preferred leader election, and see whether CPU decreases for this broker, if yes, then do the reassignments to change the preferred leaders to be "permanent" (the topic may have many partitions like 256 that has quite a few of them having this broker as preferred leader). So this Topic Level config is an easy way of doing trial and check the result. > You can add the below workaround as an item in the rejected alternatives > section > "Reassigning all the topic/partitions which the intended broker is a > replica for." Updated the KIP-491. Thanks, George On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana wrote: Thanks for the KIP. I have put my comments below. This is a nice improvement to avoid cumbersome maintenance. >> The following is the requirements this KIP is trying to accomplish: The ability to add and remove the preferred leader deprioritized list/blacklist. e.g. new ZK path/node or new dynamic config. This can be moved to the"Proposed changes" section. >>The logic to determine the priority/order of which broker should be preferred leader should be modified. The broker in the preferred leader blacklist should be moved to the end (lowest priority) when determining leadership. I believe there is no change required in the ordering of the preferred replica list. Brokers in the preferred leader blacklist are skipped until other brokers int he list are unavailable. >>The blacklist can be at the broker level. However, there might be use cases where a specific topic should blacklist particular brokers, which would be at the Topic level Config. For this use cases of this KIP, it seems that broker level blacklist would suffice. Topic level preferred leader blacklist might be future enhancement work. I agree that the broker level preferred leader blacklist would be sufficient. Do you have any use cases which require topic level preferred blacklist? You can add the below workaround as an item in the rejected alternatives section "Reassigning all the topic/partitions which the intended broker is a replica for." Thanks, Satish. On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski wrote: > > Hey George, > > Thanks for the KIP, it's an interesting idea. > > I was wondering whether we could achieve the same thing via the > kafka-reassign-partitions tool. As you had also said in the JIRA, it is > true that this is currently very tedious with the tool. My thoughts are > that we could improve the tool and give it the notion of a "blacklisted > preferred leader". > This would have some benefits like: > - more fine-grained control over the blacklist. we may not want to > blacklist all the preferred leaders, as that would make the blacklisted > broker a follower of last resort which is not very useful. In the cases of > an underpowered AWS machine or a controller, you might overshoot and make > the broker very underutilized if you completely make it leaderless. > - is not permanent. If we are to have a blacklist leaders config, > rebalancing tools would also need to know about it and manipulate/respect > it to achieve a fair balance. > It seems like both problems are tied to balancing partitions, it's just > that KIP-491's use case wants to balance them against other factors in a > more nuanced way. It makes
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi Stanislav, Thanks for taking time to do the review and feedbacks. The Preferred Leader "Blacklist" feature is meant to be temporary in most use cases listed (I will explain a case which might need to be "permanent" below). It's a quick/easy way for the on-call engineer to take away leaderships of a problem broker and mitigate kafka cluster production issues. The reassignment/rebalance is expensive especially involving moving a replica to a different broker. Even same replicas but changing the preferred leader ordering, it will require running reassignments (batching, staggering running in Production), and when the issue is resolved (e.g. empty broker caught-up with retention time, the broker having hardware issues with poor performance is replaced, controller switched, etc.), need to run reassignments again (either rollback previous reassignments or run rebalance to generate a new plan). As you see, this reassignment approach is more tedious. If there is a Preferred Leader blacklist of a broker, it can be simply added and removed to take effect. Below are some answers to your questions. > - more fine-grained control over the blacklist. we may not want to > blacklist all the preferred leaders, as that would make the blacklisted > broker a follower of last resort which is not very useful. In the cases of > an underpowered AWS machine or a controller, you might overshoot and make > the broker very underutilized if you completely make it leaderless. The current proposed changes in KIP-491 is to have the Preferred Leader Blacklist at the broker level, as it seems that it can satisfy most use-cases listed. A fine-grained control feature can be added if there is a need to have preferred leader blacklist at the Topic Level (e.g. have a new topic config at the topic level). > - is not permanent. If we are to have a blacklist leaders config, > rebalancing tools would also need to know about it and manipulate/respect > it to achieve a fair balance. > It seems like both problems are tied to balancing partitions, it's just > that KIP-491's use case wants to balance them against other factors in a > more nuanced way. It makes sense to have both be done from the same place Most of the use case, the preferred leader blacklist is temporary. One case I could think of that will be somewhat permanent is the Cross Data Center less powerful AWS instances case. For some critical data which needs protection against data loss because of the whole DC failure. We have 1 on-premise data center, and 2 AWS data centers. The topic/partition replicas are spread to these 3 DCs. The Preferred Leader Blacklist will be somewhat permanent in this case. Even we run reassignments to move all preferred leaders to the On-Premises brokers for existing topics, there is always new topics created and existing topics partitions getting expanded for capacity growth. The new partitions' preferred leaders are not guaranteed to be the on-premises brokers. The topic management (new/expand) code needs some info about blacklist leaders, which is missing now. With the Preferred Leader Blacklist in-place, we can make sure the AWS DC instances broker will not be serving traffic normally, unless the on-prem brokers is down. It's a better safe guard for better performance. > To make note of the motivation section: > > Avoid bouncing broker in order to lose its leadership > The recommended way to make a broker lose its leadership is to run a > reassignment on its partitions Understood. This new preferred leader blacklist feature is trying to improve and make it easier/cleaner/quicker to do it. > > The cross-data center cluster has AWS cloud instances which have less > computing power > We recommend running Kafka on homogeneous machines. It would be cool if the > system supported more flexibility in that regard but that is more nuanced > and a preferred leader blacklist may not be the best first approach to the > issue We are aware of recommendation of not having heterogeneous hardware in the kafka cluster, but it this case, it's more cost-efficient to use AWS than spawning a new on-premise DC nearby with low latency. > Adding a new config which can fundamentally change the way replication is > done is complex, both for the system (the replication code is complex > enough) and the user. Users would have another potential config that could > backfire on them - e.g if left forgotten. Actually, this new proposed new dynamic config (e.g. preferred_leader_blacklist) should not affect replication code at all. It will just provide more information when leadership is determined (moving the brokers in the blacklist to the lowest priority) during preferred leader election or a failed broker with its leaders going to other live brokers. Just like any other configs, the users need to understand what the config exactly is and need to add/remove config accordingly to the issues/situations
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Thanks for the KIP. I have put my comments below. This is a nice improvement to avoid cumbersome maintenance. >> The following is the requirements this KIP is trying to accomplish: The ability to add and remove the preferred leader deprioritized list/blacklist. e.g. new ZK path/node or new dynamic config. This can be moved to the"Proposed changes" section. >>The logic to determine the priority/order of which broker should be preferred leader should be modified. The broker in the preferred leader blacklist should be moved to the end (lowest priority) when determining leadership. I believe there is no change required in the ordering of the preferred replica list. Brokers in the preferred leader blacklist are skipped until other brokers int he list are unavailable. >>The blacklist can be at the broker level. However, there might be use cases where a specific topic should blacklist particular brokers, which would be at the Topic level Config. For this use cases of this KIP, it seems that broker level blacklist would suffice. Topic level preferred leader blacklist might be future enhancement work. I agree that the broker level preferred leader blacklist would be sufficient. Do you have any use cases which require topic level preferred blacklist? You can add the below workaround as an item in the rejected alternatives section "Reassigning all the topic/partitions which the intended broker is a replica for." Thanks, Satish. On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski wrote: > > Hey George, > > Thanks for the KIP, it's an interesting idea. > > I was wondering whether we could achieve the same thing via the > kafka-reassign-partitions tool. As you had also said in the JIRA, it is > true that this is currently very tedious with the tool. My thoughts are > that we could improve the tool and give it the notion of a "blacklisted > preferred leader". > This would have some benefits like: > - more fine-grained control over the blacklist. we may not want to > blacklist all the preferred leaders, as that would make the blacklisted > broker a follower of last resort which is not very useful. In the cases of > an underpowered AWS machine or a controller, you might overshoot and make > the broker very underutilized if you completely make it leaderless. > - is not permanent. If we are to have a blacklist leaders config, > rebalancing tools would also need to know about it and manipulate/respect > it to achieve a fair balance. > It seems like both problems are tied to balancing partitions, it's just > that KIP-491's use case wants to balance them against other factors in a > more nuanced way. It makes sense to have both be done from the same place > > To make note of the motivation section: > > Avoid bouncing broker in order to lose its leadership > The recommended way to make a broker lose its leadership is to run a > reassignment on its partitions > > The cross-data center cluster has AWS cloud instances which have less > computing power > We recommend running Kafka on homogeneous machines. It would be cool if the > system supported more flexibility in that regard but that is more nuanced > and a preferred leader blacklist may not be the best first approach to the > issue > > Adding a new config which can fundamentally change the way replication is > done is complex, both for the system (the replication code is complex > enough) and the user. Users would have another potential config that could > backfire on them - e.g if left forgotten. > > Could you think of any downsides to implementing this functionality (or a > variation of it) in the kafka-reassign-partitions.sh tool? > One downside I can see is that we would not have it handle new partitions > created after the "blacklist operation". As a first iteration I think that > may be acceptable > > Thanks, > Stanislav > > On Fri, Jul 19, 2019 at 3:20 AM George Li > wrote: > > > Hi, > > > > Pinging the list for the feedbacks of this KIP-491 ( > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982 > > ) > > > > > > Thanks, > > George > > > > On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li < > > sql_consult...@yahoo.com.INVALID> wrote: > > > > Hi, > > > > I have created KIP-491 ( > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982) > > for putting a broker to the preferred leader blacklist or deprioritized > > list so when determining leadership, it's moved to the lowest priority for > > some of the listed use-cases. > > > > Please provide your comments/feedbacks. > > > > Thanks, > > George > > > > > > > > - Forwarded Message - From: Jose Armando Garcia Sancio (JIRA) < > > j...@apache.org>To: "sql_consult...@yahoo.com" > > Sent: > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented] > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list) > > > > [ > >
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hey George, Thanks for the KIP, it's an interesting idea. I was wondering whether we could achieve the same thing via the kafka-reassign-partitions tool. As you had also said in the JIRA, it is true that this is currently very tedious with the tool. My thoughts are that we could improve the tool and give it the notion of a "blacklisted preferred leader". This would have some benefits like: - more fine-grained control over the blacklist. we may not want to blacklist all the preferred leaders, as that would make the blacklisted broker a follower of last resort which is not very useful. In the cases of an underpowered AWS machine or a controller, you might overshoot and make the broker very underutilized if you completely make it leaderless. - is not permanent. If we are to have a blacklist leaders config, rebalancing tools would also need to know about it and manipulate/respect it to achieve a fair balance. It seems like both problems are tied to balancing partitions, it's just that KIP-491's use case wants to balance them against other factors in a more nuanced way. It makes sense to have both be done from the same place To make note of the motivation section: > Avoid bouncing broker in order to lose its leadership The recommended way to make a broker lose its leadership is to run a reassignment on its partitions > The cross-data center cluster has AWS cloud instances which have less computing power We recommend running Kafka on homogeneous machines. It would be cool if the system supported more flexibility in that regard but that is more nuanced and a preferred leader blacklist may not be the best first approach to the issue Adding a new config which can fundamentally change the way replication is done is complex, both for the system (the replication code is complex enough) and the user. Users would have another potential config that could backfire on them - e.g if left forgotten. Could you think of any downsides to implementing this functionality (or a variation of it) in the kafka-reassign-partitions.sh tool? One downside I can see is that we would not have it handle new partitions created after the "blacklist operation". As a first iteration I think that may be acceptable Thanks, Stanislav On Fri, Jul 19, 2019 at 3:20 AM George Li wrote: > Hi, > > Pinging the list for the feedbacks of this KIP-491 ( > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982 > ) > > > Thanks, > George > > On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li < > sql_consult...@yahoo.com.INVALID> wrote: > > Hi, > > I have created KIP-491 ( > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982) > for putting a broker to the preferred leader blacklist or deprioritized > list so when determining leadership, it's moved to the lowest priority for > some of the listed use-cases. > > Please provide your comments/feedbacks. > > Thanks, > George > > > > - Forwarded Message - From: Jose Armando Garcia Sancio (JIRA) < > j...@apache.org>To: "sql_consult...@yahoo.com" Sent: > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented] > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list) > > [ > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881511#comment-16881511 > ] > > Jose Armando Garcia Sancio commented on KAFKA-8638: > --- > > Thanks for feedback and clear use cases [~sql_consulting]. > > > Preferred Leader Blacklist (deprioritized list) > > --- > > > >Key: KAFKA-8638 > >URL: https://issues.apache.org/jira/browse/KAFKA-8638 > >Project: Kafka > > Issue Type: Improvement > > Components: config, controller, core > >Affects Versions: 1.1.1, 2.3.0, 2.2.1 > >Reporter: GEORGE LI > >Assignee: GEORGE LI > >Priority: Major > > > > Currently, the kafka preferred leader election will pick the broker_id > in the topic/partition replica assignments in a priority order when the > broker is in ISR. The preferred leader is the broker id in the first > position of replica. There are use-cases that, even the first broker in the > replica assignment is in ISR, there is a need for it to be moved to the end > of ordering (lowest priority) when deciding leadership during preferred > leader election. > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the > preferred leader. When preferred leadership is run, it will pick 1 as the > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in > ISR, we would like it to be moved to the end of ordering (lowest priority) > when deciding leadership during preferred leader election. Below is a list > of use cases: > > * (If broker_id 1 is a swapped
Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi, Pinging the list for the feedbacks of this KIP-491 (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982) Thanks, George On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li wrote: Hi, I have created KIP-491 (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982) for putting a broker to the preferred leader blacklist or deprioritized list so when determining leadership, it's moved to the lowest priority for some of the listed use-cases. Please provide your comments/feedbacks. Thanks, George - Forwarded Message - From: Jose Armando Garcia Sancio (JIRA) To: "sql_consult...@yahoo.com" Sent: Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented] (KAFKA-8638) Preferred Leader Blacklist (deprioritized list) [ https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881511#comment-16881511 ] Jose Armando Garcia Sancio commented on KAFKA-8638: --- Thanks for feedback and clear use cases [~sql_consulting]. > Preferred Leader Blacklist (deprioritized list) > --- > > Key: KAFKA-8638 > URL: https://issues.apache.org/jira/browse/KAFKA-8638 > Project: Kafka > Issue Type: Improvement > Components: config, controller, core > Affects Versions: 1.1.1, 2.3.0, 2.2.1 > Reporter: GEORGE LI > Assignee: GEORGE LI > Priority: Major > > Currently, the kafka preferred leader election will pick the broker_id in the > topic/partition replica assignments in a priority order when the broker is in > ISR. The preferred leader is the broker id in the first position of replica. > There are use-cases that, even the first broker in the replica assignment is > in ISR, there is a need for it to be moved to the end of ordering (lowest > priority) when deciding leadership during preferred leader election. > Let’s use topic/partition replica (1,2,3) as an example. 1 is the preferred > leader. When preferred leadership is run, it will pick 1 as the leader if > it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not in ISR, > then pick 3 as the leader. There are use cases that, even 1 is in ISR, we > would like it to be moved to the end of ordering (lowest priority) when > deciding leadership during preferred leader election. Below is a list of use > cases: > * (If broker_id 1 is a swapped failed host and brought up with last segments > or latest offset without historical data (There is another effort on this), > it's better for it to not serve leadership till it's caught-up. > * The cross-data center cluster has AWS instances which have less computing > power than the on-prem bare metal machines. We could put the AWS broker_ids > in Preferred Leader Blacklist, so on-prem brokers can be elected leaders, > without changing the reassignments ordering of the replicas. > * If the broker_id 1 is constantly losing leadership after some time: > "Flapping". we would want to exclude 1 to be a leader unless all other > brokers of this topic/partition are offline. The “Flapping” effect was seen > in the past when 2 or more brokers were bad, when they lost leadership > constantly/quickly, the sets of partition replicas they belong to will see > leadership constantly changing. The ultimate solution is to swap these bad > hosts. But for quick mitigation, we can also put the bad hosts in the > Preferred Leader Blacklist to move the priority of its being elected as > leaders to the lowest. > * If the controller is busy serving an extra load of metadata requests and > other tasks. we would like to put the controller's leaders to other brokers > to lower its CPU load. currently bouncing to lose leadership would not work > for Controller, because after the bounce, the controller fails over to > another broker. > * Avoid bouncing broker in order to lose its leadership: it would be good if > we have a way to specify which broker should be excluded from serving > traffic/leadership (without changing the replica assignment ordering by > reassignments, even though that's quick), and run preferred leader election. > A bouncing broker will cause temporary URP, and sometimes other issues. Also > a bouncing of broker (e.g. broker_id 1) can temporarily lose all its > leadership, but if another broker (e.g. broker_id 2) fails or gets bounced, > some of its leaderships will likely failover to broker_id 1 on a replica with > 3 brokers. If broker_id 1 is in the blacklist, then in such a scenario even > broker_id 2 offline, the 3rd broker can take leadership. > The current work-around of the above is to change the topic/partition's > replica reassignments to move the broker_id 1 from the first position to the > last position and run preferred leader
[DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)
Hi, I have created KIP-491 (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982) for putting a broker to the preferred leader blacklist or deprioritized list so when determining leadership, it's moved to the lowest priority for some of the listed use-cases. Please provide your comments/feedbacks. Thanks, George - Forwarded Message - From: Jose Armando Garcia Sancio (JIRA) To: "sql_consult...@yahoo.com" Sent: Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented] (KAFKA-8638) Preferred Leader Blacklist (deprioritized list) [ https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881511#comment-16881511 ] Jose Armando Garcia Sancio commented on KAFKA-8638: --- Thanks for feedback and clear use cases [~sql_consulting]. > Preferred Leader Blacklist (deprioritized list) > --- > > Key: KAFKA-8638 > URL: https://issues.apache.org/jira/browse/KAFKA-8638 > Project: Kafka > Issue Type: Improvement > Components: config, controller, core > Affects Versions: 1.1.1, 2.3.0, 2.2.1 > Reporter: GEORGE LI > Assignee: GEORGE LI > Priority: Major > > Currently, the kafka preferred leader election will pick the broker_id in the > topic/partition replica assignments in a priority order when the broker is in > ISR. The preferred leader is the broker id in the first position of replica. > There are use-cases that, even the first broker in the replica assignment is > in ISR, there is a need for it to be moved to the end of ordering (lowest > priority) when deciding leadership during preferred leader election. > Let’s use topic/partition replica (1,2,3) as an example. 1 is the preferred > leader. When preferred leadership is run, it will pick 1 as the leader if > it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not in ISR, > then pick 3 as the leader. There are use cases that, even 1 is in ISR, we > would like it to be moved to the end of ordering (lowest priority) when > deciding leadership during preferred leader election. Below is a list of use > cases: > * (If broker_id 1 is a swapped failed host and brought up with last segments > or latest offset without historical data (There is another effort on this), > it's better for it to not serve leadership till it's caught-up. > * The cross-data center cluster has AWS instances which have less computing > power than the on-prem bare metal machines. We could put the AWS broker_ids > in Preferred Leader Blacklist, so on-prem brokers can be elected leaders, > without changing the reassignments ordering of the replicas. > * If the broker_id 1 is constantly losing leadership after some time: > "Flapping". we would want to exclude 1 to be a leader unless all other > brokers of this topic/partition are offline. The “Flapping” effect was seen > in the past when 2 or more brokers were bad, when they lost leadership > constantly/quickly, the sets of partition replicas they belong to will see > leadership constantly changing. The ultimate solution is to swap these bad > hosts. But for quick mitigation, we can also put the bad hosts in the > Preferred Leader Blacklist to move the priority of its being elected as > leaders to the lowest. > * If the controller is busy serving an extra load of metadata requests and > other tasks. we would like to put the controller's leaders to other brokers > to lower its CPU load. currently bouncing to lose leadership would not work > for Controller, because after the bounce, the controller fails over to > another broker. > * Avoid bouncing broker in order to lose its leadership: it would be good if > we have a way to specify which broker should be excluded from serving > traffic/leadership (without changing the replica assignment ordering by > reassignments, even though that's quick), and run preferred leader election. > A bouncing broker will cause temporary URP, and sometimes other issues. Also > a bouncing of broker (e.g. broker_id 1) can temporarily lose all its > leadership, but if another broker (e.g. broker_id 2) fails or gets bounced, > some of its leaderships will likely failover to broker_id 1 on a replica with > 3 brokers. If broker_id 1 is in the blacklist, then in such a scenario even > broker_id 2 offline, the 3rd broker can take leadership. > The current work-around of the above is to change the topic/partition's > replica reassignments to move the broker_id 1 from the first position to the > last position and run preferred leader election. e.g. (1, 2, 3) => (2, 3, 1). > This changes the replica reassignments, and we need to keep track of the > original one and restore if things change (e.g. controller fails over to > another broker, the swapped empty