Re: Kraft controller readiness checks

2024-04-22 Thread Francesco Burato
I’ll join Dima with the thanks, Luke. This seems to be indeed a good way of 
enforcing safe restarts.

Thanks,

Frank

--
Francesco Burato | Software Development Engineer | Adobe | 
bur...@adobe.com<mailto:bur...@adobe.com>  | c. +44 747 
9029370


From: Dima Brodsky 
Date: Monday, 22 April 2024 at 05:16
To: users@kafka.apache.org 
Subject: Re: Kraft controller readiness checks
EXTERNAL: Use caution when clicking on links or opening attachments.


Thanks Luke, this helps for our use case.  It does not cover the buildout
of a new cluster where there are no brokers, but that should be remedied by
kip 919 which looks to be resolved in 3.7.0.

ttyl
Dima


On Sun, Apr 21, 2024 at 9:06 PM Luke Chen  wrote:

> Hi Frank,
>
> About your question:
> > Unless this is already available but not well publicised in the
> documentation, ideally there should be protocol working on the controller
> ports that answers to operational questions like “are metadata partitions
> in sync?”, “has the current controller converged with other members of the
> quorum?”.
>
> I'm sorry that KRaft controller is using raft protocol, so there is no such
> "in-sync replica" definition like data replication protocol. What we did
> for our check is described here
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstrimzi%2Fproposals%2Fblob%2Fmain%2F060-kafka-roller-kraft.md%23the-new-quorum-check&data=05%7C02%7Cburato%40adobe.com%7Cfcdb0c1c9c954ab4087608dc62830b68%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638493562178459069%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9ww%2F2%2B%2BnKB3NLfaRE99pzN%2FD9Q5cjZeTFvJmNXKeBwY%3D&reserved=0<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md#the-new-quorum-check>
> >.
> In short, we use `controller.quorum.fetch.timeout.ms` and
> `replicaLastCaughtUpTimestamp` to determine if it's safe to roll this
> controller pod.
>
> Hope this helps.
>
> Thank you.
> Luke
>
>
>
>
> On Fri, Apr 19, 2024 at 5:06 PM Francesco Burato  >
> wrote:
>
> > Hi Luke,
> >
> > Thanks for the answers. I understand what you are describing in terms of
> > rationale for using just the availability of the controller port to
> > determine the readiness of the controller, but that is not fully
> satisfying
> > under an operational perspective, at least based on the lack of
> sufficient
> > documentation on the matter. Based on my understanding of kraft, which I
> > admit is not considerable, the controllers will host the cluster metadata
> > partitions on disk and make them available for the brokers. So,
> presumably,
> > one of the purposes of the controllers is to ensure that the metadata
> > partitions are properly replicated. Hence, what happens even in a non k8s
> > environment all controllers go down? What sort of outage does the wider
> > cluster experience in that circumstance?
> >
> > A complete outage on the controllers is of course an extreme scenario,
> but
> > a more likely one is that a disk of the controller goes offline and needs
> > to be replaced. In this scenario, the controller will have to
> re-construct
> > from scratch the cluster metadata from the other controllers in the
> quorum
> > but it presumably cannot participate to the quorum until the metadata
> > partitions are fully replicated. Based on this assumption, the mere
> > availability of the controller port does not necessarily mean that I can
> > safely shut down another controller because replication has not completed
> > yet.
> >
> > As I mentioned earlier, I don’t know the details of kraft in sufficient
> > details to evaluate if my assumptions are warranted, but the official
> > documentation does not seem to go in much detail on how to safely
> operate a
> > cluster in kraft mode while it provides very good information on how to
> > safely operate a ZK cluster by highlighting that the URP and leader
> > elections must be kept under control during restarts.
> >
> > Unless this is already available but not well publicised in the
> > documentation, ideally there should be protocol working on the controller
> > ports that answers to operational questions like “are metadata partitions
> > in sync?”, “has the current controller converged with other members of
> the
> > quorum?”.
> >
> > Goes without saying that if any of these topics are properly covered
> > anywhere in the docs, more than happy to be RTFMed to the right place.
> >
> > As for the other points you raise: we have a very particular set-up for
> > our kaf

Re: Kraft controller readiness checks

2024-04-21 Thread Dima Brodsky
Thanks Luke, this helps for our use case.  It does not cover the buildout
of a new cluster where there are no brokers, but that should be remedied by
kip 919 which looks to be resolved in 3.7.0.

ttyl
Dima


On Sun, Apr 21, 2024 at 9:06 PM Luke Chen  wrote:

> Hi Frank,
>
> About your question:
> > Unless this is already available but not well publicised in the
> documentation, ideally there should be protocol working on the controller
> ports that answers to operational questions like “are metadata partitions
> in sync?”, “has the current controller converged with other members of the
> quorum?”.
>
> I'm sorry that KRaft controller is using raft protocol, so there is no such
> "in-sync replica" definition like data replication protocol. What we did
> for our check is described here
> <
> https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md#the-new-quorum-check
> >.
> In short, we use `controller.quorum.fetch.timeout.ms` and
> `replicaLastCaughtUpTimestamp` to determine if it's safe to roll this
> controller pod.
>
> Hope this helps.
>
> Thank you.
> Luke
>
>
>
>
> On Fri, Apr 19, 2024 at 5:06 PM Francesco Burato  >
> wrote:
>
> > Hi Luke,
> >
> > Thanks for the answers. I understand what you are describing in terms of
> > rationale for using just the availability of the controller port to
> > determine the readiness of the controller, but that is not fully
> satisfying
> > under an operational perspective, at least based on the lack of
> sufficient
> > documentation on the matter. Based on my understanding of kraft, which I
> > admit is not considerable, the controllers will host the cluster metadata
> > partitions on disk and make them available for the brokers. So,
> presumably,
> > one of the purposes of the controllers is to ensure that the metadata
> > partitions are properly replicated. Hence, what happens even in a non k8s
> > environment all controllers go down? What sort of outage does the wider
> > cluster experience in that circumstance?
> >
> > A complete outage on the controllers is of course an extreme scenario,
> but
> > a more likely one is that a disk of the controller goes offline and needs
> > to be replaced. In this scenario, the controller will have to
> re-construct
> > from scratch the cluster metadata from the other controllers in the
> quorum
> > but it presumably cannot participate to the quorum until the metadata
> > partitions are fully replicated. Based on this assumption, the mere
> > availability of the controller port does not necessarily mean that I can
> > safely shut down another controller because replication has not completed
> > yet.
> >
> > As I mentioned earlier, I don’t know the details of kraft in sufficient
> > details to evaluate if my assumptions are warranted, but the official
> > documentation does not seem to go in much detail on how to safely
> operate a
> > cluster in kraft mode while it provides very good information on how to
> > safely operate a ZK cluster by highlighting that the URP and leader
> > elections must be kept under control during restarts.
> >
> > Unless this is already available but not well publicised in the
> > documentation, ideally there should be protocol working on the controller
> > ports that answers to operational questions like “are metadata partitions
> > in sync?”, “has the current controller converged with other members of
> the
> > quorum?”.
> >
> > Goes without saying that if any of these topics are properly covered
> > anywhere in the docs, more than happy to be RTFMed to the right place.
> >
> > As for the other points you raise: we have a very particular set-up for
> > our kafka clusters that makes the circumstance you highlight not a
> problem.
> > In particular, our consumer and producers are all internal in a namespace
> > and can connect to non-ready brokers. Given the URP script checks for the
> > global URP state rather than just the URP state for the individual
> broker,
> > that means that as long as even one broker is marked as ready, that means
> > the entire cluster is safe. With the ordered rotation imposed by
> > statefulset parallel rolling restart, together with the URP readiness
> check
> > and the PDB, we are guaranteed not to cause any problem read or write
> > errors. Rotations are rather long, but we don’t really care about speed.
> >
> > Thanks,
> >
> > Frank
> >
> > --
> > Francesco Burato | Software Development Engineer | Adobe |
> > bur...@adobe.com<mailto:bur...@adobe.com>  |

Re: Kraft controller readiness checks

2024-04-21 Thread Luke Chen
Hi Frank,

About your question:
> Unless this is already available but not well publicised in the
documentation, ideally there should be protocol working on the controller
ports that answers to operational questions like “are metadata partitions
in sync?”, “has the current controller converged with other members of the
quorum?”.

I'm sorry that KRaft controller is using raft protocol, so there is no such
"in-sync replica" definition like data replication protocol. What we did
for our check is described here
<https://github.com/strimzi/proposals/blob/main/060-kafka-roller-kraft.md#the-new-quorum-check>.
In short, we use `controller.quorum.fetch.timeout.ms` and
`replicaLastCaughtUpTimestamp` to determine if it's safe to roll this
controller pod.

Hope this helps.

Thank you.
Luke




On Fri, Apr 19, 2024 at 5:06 PM Francesco Burato 
wrote:

> Hi Luke,
>
> Thanks for the answers. I understand what you are describing in terms of
> rationale for using just the availability of the controller port to
> determine the readiness of the controller, but that is not fully satisfying
> under an operational perspective, at least based on the lack of sufficient
> documentation on the matter. Based on my understanding of kraft, which I
> admit is not considerable, the controllers will host the cluster metadata
> partitions on disk and make them available for the brokers. So, presumably,
> one of the purposes of the controllers is to ensure that the metadata
> partitions are properly replicated. Hence, what happens even in a non k8s
> environment all controllers go down? What sort of outage does the wider
> cluster experience in that circumstance?
>
> A complete outage on the controllers is of course an extreme scenario, but
> a more likely one is that a disk of the controller goes offline and needs
> to be replaced. In this scenario, the controller will have to re-construct
> from scratch the cluster metadata from the other controllers in the quorum
> but it presumably cannot participate to the quorum until the metadata
> partitions are fully replicated. Based on this assumption, the mere
> availability of the controller port does not necessarily mean that I can
> safely shut down another controller because replication has not completed
> yet.
>
> As I mentioned earlier, I don’t know the details of kraft in sufficient
> details to evaluate if my assumptions are warranted, but the official
> documentation does not seem to go in much detail on how to safely operate a
> cluster in kraft mode while it provides very good information on how to
> safely operate a ZK cluster by highlighting that the URP and leader
> elections must be kept under control during restarts.
>
> Unless this is already available but not well publicised in the
> documentation, ideally there should be protocol working on the controller
> ports that answers to operational questions like “are metadata partitions
> in sync?”, “has the current controller converged with other members of the
> quorum?”.
>
> Goes without saying that if any of these topics are properly covered
> anywhere in the docs, more than happy to be RTFMed to the right place.
>
> As for the other points you raise: we have a very particular set-up for
> our kafka clusters that makes the circumstance you highlight not a problem.
> In particular, our consumer and producers are all internal in a namespace
> and can connect to non-ready brokers. Given the URP script checks for the
> global URP state rather than just the URP state for the individual broker,
> that means that as long as even one broker is marked as ready, that means
> the entire cluster is safe. With the ordered rotation imposed by
> statefulset parallel rolling restart, together with the URP readiness check
> and the PDB, we are guaranteed not to cause any problem read or write
> errors. Rotations are rather long, but we don’t really care about speed.
>
> Thanks,
>
> Frank
>
> --
> Francesco Burato | Software Development Engineer | Adobe |
> bur...@adobe.com<mailto:bur...@adobe.com>  | c. +44 747
> 9029370
>
>
> From: Luke Chen 
> Date: Friday, 19 April 2024 at 05:21
> To: users@kafka.apache.org 
> Subject: Re: Kraft controller readiness checks
> EXTERNAL: Use caution when clicking on links or opening attachments.
>
>
> Hello Frank,
>
> That's a good question.
> I think we all know there is no "correct" answer for this question. But I
> can share with you what our team did for it.
>
> Readiness: controller is listening on the controller.listener.names
>
> The rationale behind it is:
> 1. The last step for the controller node startup is to wait until all the
> SocketServer ports to be open, and the Acceptors to be started, and the
> control

Re: Kraft controller readiness checks

2024-04-19 Thread Francesco Burato
Hi Luke,

Thanks for the answers. I understand what you are describing in terms of 
rationale for using just the availability of the controller port to determine 
the readiness of the controller, but that is not fully satisfying under an 
operational perspective, at least based on the lack of sufficient documentation 
on the matter. Based on my understanding of kraft, which I admit is not 
considerable, the controllers will host the cluster metadata partitions on disk 
and make them available for the brokers. So, presumably, one of the purposes of 
the controllers is to ensure that the metadata partitions are properly 
replicated. Hence, what happens even in a non k8s environment all controllers 
go down? What sort of outage does the wider cluster experience in that 
circumstance?

A complete outage on the controllers is of course an extreme scenario, but a 
more likely one is that a disk of the controller goes offline and needs to be 
replaced. In this scenario, the controller will have to re-construct from 
scratch the cluster metadata from the other controllers in the quorum but it 
presumably cannot participate to the quorum until the metadata partitions are 
fully replicated. Based on this assumption, the mere availability of the 
controller port does not necessarily mean that I can safely shut down another 
controller because replication has not completed yet.

As I mentioned earlier, I don’t know the details of kraft in sufficient details 
to evaluate if my assumptions are warranted, but the official documentation 
does not seem to go in much detail on how to safely operate a cluster in kraft 
mode while it provides very good information on how to safely operate a ZK 
cluster by highlighting that the URP and leader elections must be kept under 
control during restarts.

Unless this is already available but not well publicised in the documentation, 
ideally there should be protocol working on the controller ports that answers 
to operational questions like “are metadata partitions in sync?”, “has the 
current controller converged with other members of the quorum?”.

Goes without saying that if any of these topics are properly covered anywhere 
in the docs, more than happy to be RTFMed to the right place.

As for the other points you raise: we have a very particular set-up for our 
kafka clusters that makes the circumstance you highlight not a problem. In 
particular, our consumer and producers are all internal in a namespace and can 
connect to non-ready brokers. Given the URP script checks for the global URP 
state rather than just the URP state for the individual broker, that means that 
as long as even one broker is marked as ready, that means the entire cluster is 
safe. With the ordered rotation imposed by statefulset parallel rolling 
restart, together with the URP readiness check and the PDB, we are guaranteed 
not to cause any problem read or write errors. Rotations are rather long, but 
we don’t really care about speed.

Thanks,

Frank

--
Francesco Burato | Software Development Engineer | Adobe | 
bur...@adobe.com<mailto:bur...@adobe.com>  | c. +44 747 
9029370


From: Luke Chen 
Date: Friday, 19 April 2024 at 05:21
To: users@kafka.apache.org 
Subject: Re: Kraft controller readiness checks
EXTERNAL: Use caution when clicking on links or opening attachments.


Hello Frank,

That's a good question.
I think we all know there is no "correct" answer for this question. But I
can share with you what our team did for it.

Readiness: controller is listening on the controller.listener.names

The rationale behind it is:
1. The last step for the controller node startup is to wait until all the
SocketServer ports to be open, and the Acceptors to be started, and the
controller port is one of them.
2. This controller listener is used to talk to other controllers (voters)
to form the raft quorum, so if it is not open and listening, the controller
is basically not working at all.
3. The controller listener is also used for brokers (observers) to get the
updated raft quorum info and fetch metadata.

Compared with Zookeeper cluster, which is the KRaft quorum is trying to
replace with, the liveness/readiness probe that recommended in Kubernetes
tutorial
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftutorials%2Fstateful-application%2Fzookeeper%2F%23testing-for-liveness&data=05%7C02%7Cburato%40adobe.com%7C5a585c16947d4ff879eb08dc60282e9b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638490972901227302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ytYDx6DdtK1%2BzIT1HqQF%2BSV5FBZv%2Bb5963hJZBdqABU%3D&reserved=0<https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#testing-for-liveness>>
is also doing "ruok" check for the pod. And the handler for this "ruok"
command
<https://nam04.safelinks.protection.o

Re: Kraft controller readiness checks

2024-04-18 Thread Luke Chen
Hello Frank,

That's a good question.
I think we all know there is no "correct" answer for this question. But I
can share with you what our team did for it.

Readiness: controller is listening on the controller.listener.names

The rationale behind it is:
1. The last step for the controller node startup is to wait until all the
SocketServer ports to be open, and the Acceptors to be started, and the
controller port is one of them.
2. This controller listener is used to talk to other controllers (voters)
to form the raft quorum, so if it is not open and listening, the controller
is basically not working at all.
3. The controller listener is also used for brokers (observers) to get the
updated raft quorum info and fetch metadata.

Compared with Zookeeper cluster, which is the KRaft quorum is trying to
replace with, the liveness/readiness probe that recommended in Kubernetes
tutorial

is also doing "ruok" check for the pod. And the handler for this "ruok"
command

in the Zookeeper server side, is returning "imok" directly, which means
it's just doing connection check only. So we think this check makes sense.

Here's our design proposal

for the Liveness and Readiness probes in a KRaft Kafka cluster, FYI.
But again, I still think there's no "correct" answer for it. If you have
any better ideas, please let us know.

However, I have some suggestions for your readiness probe for brokers:

> our brokers are configured to use a script which marks the containers as
unready if under-replicated partitions exist. With this readiness check and
a pod disruption budget of the minimum in sync replica - 1

I understand it works well, but it has some drawbacks, and the biggest
issue I can think of is: it's possible to cause unavailability in some
partitions.
For example: 3 brokers in the cluster: 0, 1, 2, and 10 topic partitions are
hosted in broker 0.
a. Broker 0 is shutting down, all partitions in broker 0 are becoming
follower.
b. Broker 0 is starting up, all the followers are trying to catch up with
the leader.
c. 9 out of 10 partitions are caught up and joined ISR group. At this
point, this pod is still unready because there's still 1 partition is under
replicated.
d. Some of the partitions in broker 0 are becoming leader, for example,
auto leader rebalance is triggered.
e. For the leader partitions in broker 0 are now unavailable because the
pod is not in ready state, it cannot serve incoming requests.

In our team, we use the brokerState metric value = RUNNING state for
readiness probe. In KRaft mode, the broker will enter RUNNING state after
the broker has caught up with the controller for metadata, and start to
serve requests from clients. We think that makes more senses.
Again, for more details, you can check the design proposal

for the Liveness and Readiness probes in a KRaft Kafka cluster.

Finally, I saw you didn't have operators for Kafka clusters.
I don't know how you manage all these kafka clusters manually, but there
must be some cumbersome operations, like rolling pods.
Let's say now you want to roll the pods 1 by 1, which pod will you go
first?
And which pod goes last?
Will you do any check before rolling?
How much time does it take for each rolling?
...

I'm just listing some of the problems they might have. So I would recommend
deploying an operator to help manage the kafka clusters.
This is our design proposal

for Kafka roller in operator for KRaft. FYI.

And now, I'm totally biased, but Stirmzi
 provides an fully
open-source operator to manager kafka cluster on Kubernetes.
Welcome to try it (hopefully it will help you manage kafka clusters), join
the community to ask questions, join discussions, or contribute to it.

Thank you.
Luke













On Fri, Apr 19, 2024 at 4:19 AM Francesco Burato 
wrote:

> Hello,
>
> I have a question regarding the deployment of Kafka using Kraft
> controllers in a Kubernetes environment. Our current Kafka cluster is
> deployed on K8S clusters as statefulsets without operators and our brokers
> are configured to use a script which marks the containers as unready if
> under-replicated partitions exist. With this readiness check and a pod
> disruption budget of the minimum in sync replica - 1, we are able to
> perform rollout restarts of our brokers automatically without ever
> producing consumers and producers errors.
>
> We have started the processes of transitioning to Kraft and based on the
> recommended deployment strategy we are g

Kraft controller readiness checks

2024-04-18 Thread Francesco Burato
Hello,

I have a question regarding the deployment of Kafka using Kraft controllers in 
a Kubernetes environment. Our current Kafka cluster is deployed on K8S clusters 
as statefulsets without operators and our brokers are configured to use a 
script which marks the containers as unready if under-replicated partitions 
exist. With this readiness check and a pod disruption budget of the minimum in 
sync replica - 1, we are able to perform rollout restarts of our brokers 
automatically without ever producing consumers and producers errors.

We have started the processes of transitioning to Kraft and based on the 
recommended deployment strategy we are going to define dedicated nodes as 
controllers instead of using combined servers. However, defining nodes as 
controller does not seem to allow to use the same strategy for readiness check 
as the kafka-topics.sh does not appear to be executable on controller brokers.

The question is: what is a reliable readiness check that can be used for Kraft 
controllers that ensures that rollout restart can be performed safely?

Thanks,

Frank

--
Francesco Burato | Software Development Engineer | Adobe | 
bur...@adobe.com