[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2019-05-27 Thread Jiangjie Qin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848809#comment-16848809
 ] 

Jiangjie Qin commented on KAFKA-6029:
-

[~hachikuji] Yes, I think the issue should have been resolved. There are 
actually two scenarios mentioned in this ticket.

The original one is more of an issue that the controller messages are processed 
at different time in different brokers. This should have been addressed by 
[KIP-291|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-291%3A+Separating+controller+connections+and+requests+from+the+data+plane]].

Another issue was that shutting down brokers are incorrectly added back to ISR 
by the leader, which is addressed by 
[KIP-320|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-320%3A+Allow+fetchers+to+detect+and+handle+log+truncation]]
 as you mentioned.

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Sub-task
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2019-05-20 Thread Jason Gustafson (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844388#comment-16844388
 ] 

Jason Gustafson commented on KAFKA-6029:


I think we can actually resolve this as a unintended benefit of 
[KIP-320|https://cwiki.apache.org/confluence/display/KAFKA/KIP-320%3A+Allow+fetchers+to+detect+and+handle+log+truncation].
 When the controller shrinks the ISR, it bumps the epoch. The bumped epoch 
prevents the shutting down follower from being added back to the ISR. The 
controller may still send a LeaderAndIsr request to the shutting down broker 
with the updated epoch, but the shutting down broker will not restart the 
fetcher.

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Sub-task
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2019-05-15 Thread Colin P. McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840876#comment-16840876
 ] 

Colin P. McCabe commented on KAFKA-6029:


This sounds like a good improvement.  I can take the JIRA, if nobody else is 
working on it.

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Sub-task
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2019-03-13 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791910#comment-16791910
 ] 

Jun Rao commented on KAFKA-6029:


[~hzxa21], do you still plan to work on this? Thanks.

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Sub-task
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2019-02-17 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770518#comment-16770518
 ] 

Matthias J. Sax commented on KAFKA-6029:


Moving all major/minor/trivial tickets that are not merged yet out of 2.2 
release.

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Sub-task
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Major
> Fix For: 2.2.0
>
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2017-10-10 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199794#comment-16199794
 ] 

Jun Rao commented on KAFKA-6029:


[~becket_qin], yes, that's the rough idea.

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
> Fix For: 1.1.0
>
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2017-10-10 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199706#comment-16199706
 ] 

Jiangjie Qin commented on KAFKA-6029:
-

[~junrao] Good point. That seems more likely to happen. Just to check if I 
understand correctly. Are you suggesting the following solution?
1. Let each broker have an epoch which changes on restart.
2. During controlled shtudown, the controller will send LeaderAndIsrRequest 
with the new ISR + shutting down broker with epoch.
3. Add the broker epoch to the FetchRequest so the each follower will send 
FetchRequest with their broker epoch.
4. If the leader sees a fetch request from a broker that matches the shutting 
down broker and epoch it will not add it back to the ISR.
5. After the broker restarts, the leaders will see a new broker epoch and add 
the restarted broker back to ISR.



> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
> Fix For: 1.1.0
>
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2017-10-10 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199314#comment-16199314
 ] 

Jun Rao commented on KAFKA-6029:


One of the issues that can lead to this is that the follower and the leader may 
receive LeaderAndIsrRequests at different times. So, if the leader receives a 
LeaderAndIsrRequest with reduced ISR due to controlled shutdown of the 
follower, but the follower continues to fetch (since it hasn't received the 
LeaderAndIsrRequest yet), the follower will by added back to ISR. Then, when 
the follower shuts down, we have to wait for replica.lag.time.max.ms for the 
follower to be dropped out ISR. 

Onur and I discussed this a bit. One way to improve this is for the 
LeaderAndIsrRequest to indicate that a replica is about to go down such that 
the leader doesn't add it back to ISR. That indication could be piggy-backed on 
a broker epoch, which is needed in 
https://issues.apache.org/jira/browse/KAFKA-1120.

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
> Fix For: 1.1.0
>
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2017-10-09 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197573#comment-16197573
 ] 

Ismael Juma commented on KAFKA-6029:


cc [~junrao] [~onurkaraman]

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
> Fix For: 1.1.0
>
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)