[jira] [Commented] (KAFKA-16310) ListOffsets doesn't report the offset with maxTimestamp anymore

2024-03-04 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823280#comment-17823280
 ] 

Jason Gustafson commented on KAFKA-16310:
-

[~chia7712] Thanks for the analysis. Ironically, my original approach was to 
fix the compressed path to match the uncompressed path so that we passed along 
the actual offset of max timestamp. I ended up doing the opposite though. If 
either you or [~showuon] have time to submit a fix, I am happy to review. 

> ListOffsets doesn't report the offset with maxTimestamp anymore
> ---
>
> Key: KAFKA-16310
> URL: https://issues.apache.org/jira/browse/KAFKA-16310
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Emanuele Sabellico
>Priority: Blocker
> Fix For: 3.7.1
>
>
> Updated: This is confirmed a regression issue in v3.7.0. 
> The impact of this issue is that when there is a batch containing records 
> with timestamp not in order, the offset of the timestamp will be wrong.(ex: 
> the timestamp for t0 should be mapping to offset 10, but will get offset 12.. 
> etc). It'll cause the time index is putting the wrong offset, so the result 
> will be unexpected. 
> ===
> The last offset is reported instead.
> A test in librdkafka (0081/do_test_ListOffsets) is failing an it's checking 
> that the offset with the max timestamp is the middle one and not the last 
> one. The tests is passing with 3.6.0 and previous versions
> This is the test:
> [https://github.com/confluentinc/librdkafka/blob/a6d85bdbc1023b1a5477b8befe516242c3e182f6/tests/0081-admin.c#L4989]
>  
> there are three messages, with timestamps:
> {noformat}
> t0 + 100
> t0 + 400
> t0 + 250{noformat}
> and indices 0,1,2. 
> then a ListOffsets with RD_KAFKA_OFFSET_SPEC_MAX_TIMESTAMP is done.
> it should return offset 1 but in 3.7.0 and trunk is returning offset 2
> Even after 5 seconds from producing it's still returning 2 as the offset with 
> max timestamp.
> ProduceRequest and ListOffsets were sent to the same broker (2), the leader 
> didn't change.
> {code:java}
> %7|1709134230.019|SEND|0081_admin#producer-3| 
> [thrd:localhost:39951/bootstrap]: localhost:39951/2: Sent ProduceRequest (v7, 
> 206 bytes @ 0, CorrId 2) %7|1709134230.020|RECV|0081_admin#producer-3| 
> [thrd:localhost:39951/bootstrap]: localhost:39951/2: Received ProduceResponse 
> (v7, 95 bytes, CorrId 2, rtt 1.18ms) 
> %7|1709134230.020|MSGSET|0081_admin#producer-3| 
> [thrd:localhost:39951/bootstrap]: localhost:39951/2: 
> rdkafkatest_rnd22e8d8ec45b53f98_do_test_ListOffsets [0]: MessageSet with 3 
> message(s) (MsgId 0, BaseSeq -1) delivered {code}
> {code:java}
> %7|1709134235.021|SEND|0081_admin#producer-2| 
> [thrd:localhost:39951/bootstrap]: localhost:39951/2: Sent ListOffsetsRequest 
> (v7, 103 bytes @ 0, CorrId 7) %7|1709134235.022|RECV|0081_admin#producer-2| 
> [thrd:localhost:39951/bootstrap]: localhost:39951/2: Received 
> ListOffsetsResponse (v7, 88 bytes, CorrId 7, rtt 0.54ms){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16179) NPE handle ApiVersions during controller failover

2024-01-22 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809591#comment-17809591
 ] 

Jason Gustafson commented on KAFKA-16179:
-

It is probably not a blocker. It looks like it is a race condition on 
controller shutdown. The impact is probably a failed connection on a shutting 
down controller.

> NPE handle ApiVersions during controller failover
> -
>
> Key: KAFKA-16179
> URL: https://issues.apache.org/jira/browse/KAFKA-16179
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> {code:java}
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.kafka.metadata.publisher.FeaturesPublisher.features()" because 
> the return value of "kafka.server.ControllerServer.featuresPublisher()" is 
> null
>   at 
> kafka.server.ControllerServer.$anonfun$startup$12(ControllerServer.scala:242)
>   at 
> kafka.server.SimpleApiVersionManager.features(ApiVersionManager.scala:122)
>   at 
> kafka.server.SimpleApiVersionManager.apiVersionResponse(ApiVersionManager.scala:111)
>   at 
> kafka.server.ControllerApis.$anonfun$handleApiVersionsRequest$3(ControllerApis.scala:566)
>   at 
> kafka.server.ControllerApis.$anonfun$handleApiVersionsRequest$3$adapted(ControllerApis.scala:566
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16179) NPE handle ApiVersions during controller failover

2024-01-19 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-16179:
---

 Summary: NPE handle ApiVersions during controller failover
 Key: KAFKA-16179
 URL: https://issues.apache.org/jira/browse/KAFKA-16179
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


{code:java}
java.lang.NullPointerException: Cannot invoke 
"org.apache.kafka.metadata.publisher.FeaturesPublisher.features()" because the 
return value of "kafka.server.ControllerServer.featuresPublisher()" is null
at 
kafka.server.ControllerServer.$anonfun$startup$12(ControllerServer.scala:242)
at 
kafka.server.SimpleApiVersionManager.features(ApiVersionManager.scala:122)
at 
kafka.server.SimpleApiVersionManager.apiVersionResponse(ApiVersionManager.scala:111)
at 
kafka.server.ControllerApis.$anonfun$handleApiVersionsRequest$3(ControllerApis.scala:566)
at 
kafka.server.ControllerApis.$anonfun$handleApiVersionsRequest$3$adapted(ControllerApis.scala:566
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16012) Incomplete range assignment in consumer

2023-12-14 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796889#comment-17796889
 ] 

Jason Gustafson commented on KAFKA-16012:
-

This appears to be a regression introduced from KIP-951. I have a test case 
which reproduces the problem: 
[https://github.com/hachikuji/kafka/commit/489b81f81b401a28a604efcfce5059047558fe3e.]
 Basically we are losing some of the partition state after invoking 
`Metadata.updatePartitionLeadership`. And I do see this in the logs just before 
the assignment:
{code:java}
[2023-12-13 04:58:55,738] DEBUG [Consumer 
clientId=consumer-grouped-transactions-test-consumer-group-1, 
groupId=grouped-transactions-test-consumer-group] For input-topic-1, received 
error FENCED_LEADER_EPOCH, with leaderIdAndEpoch LeaderIdAndEpoch(leaderId=2, 
leaderEpoch=3) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2023-12-13 04:58:55,738] DEBUG [Consumer 
clientId=consumer-grouped-transactions-test-consumer-group-1, 
groupId=grouped-transactions-test-consumer-group] For input-topic-1, as the 
leader was updated, position will be validated. 
(org.apache.kafka.clients.consumer.internals.AbstractFetch) {code}

> Incomplete range assignment in consumer
> ---
>
> Key: KAFKA-16012
> URL: https://issues.apache.org/jira/browse/KAFKA-16012
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 3.7.0
>
>
> We were looking into test failures here: 
> https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1702475525--jolshan--kafka-15784--7cad567675/2023-12-13--001./2023-12-13–001./report.html.
>  
> Here is the first failure in the report:
> {code:java}
> 
> test_id:    
> kafkatest.tests.core.group_mode_transactions_test.GroupModeTransactionsTest.test_transactions.failure_mode=clean_bounce.bounce_target=brokers
> status:     FAIL
> run time:   3 minutes 4.950 seconds
>     TimeoutError('Consumer consumed only 88223 out of 10 messages in 
> 90s') {code}
>  
> We traced the failure to an apparent bug during the last rebalance before the 
> group became empty. The last remaining instance seems to receive an 
> incomplete assignment which prevents it from completing expected consumption 
> on some partitions. Here is the rebalance from the coordinator's perspective:
> {code:java}
> server.log.2023-12-13-04:[2023-12-13 04:58:56,987] INFO [GroupCoordinator 3]: 
> Stabilized group grouped-transactions-test-consumer-group generation 5 
> (__consumer_offsets-2) with 1 members 
> (kafka.coordinator.group.GroupCoordinator)
> server.log.2023-12-13-04:[2023-12-13 04:58:56,990] INFO [GroupCoordinator 3]: 
> Assignment received from leader 
> consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
>  for group grouped-transactions-test-consumer-group for generation 5. The 
> group has 1 members, 0 of which are static. 
> (kafka.coordinator.group.GroupCoordinator) {code}
> The group is down to one member in generation 5. In the previous generation, 
> the consumer in question reported this assignment:
> {code:java}
> // Gen 4: we've got partitions 0-4
> [2023-12-13 04:58:52,631] DEBUG [Consumer 
> clientId=consumer-grouped-transactions-test-consumer-group-1, 
> groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete 
> with generation 4 and memberId 
> consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
>  (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2023-12-13 04:58:52,631] INFO [Consumer 
> clientId=consumer-grouped-transactions-test-consumer-group-1, 
> groupId=grouped-transactions-test-consumer-group] Notifying assignor about 
> the new Assignment(partitions=[input-topic-0, input-topic-1, input-topic-2, 
> input-topic-3, input-topic-4]) 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code}
> However, in generation 5, we seem to be assigned only one partition:
> {code:java}
> // Gen 5: Now we have only partition 1? But aren't we the last member in the 
> group?
> [2023-12-13 04:58:56,954] DEBUG [Consumer 
> clientId=consumer-grouped-transactions-test-consumer-group-1, 
> groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete 
> with generation 5 and memberId 
> consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
>  (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2023-12-13 04:58:56,955] INFO [Consumer 
> clientId=consumer-grouped-transactions-test-consumer-group-1, 
> groupId=grouped-transactions-test-consumer-group] Notifying assignor about 
> the new 

[jira] [Created] (KAFKA-16012) Incomplete range assignment in consumer

2023-12-13 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-16012:
---

 Summary: Incomplete range assignment in consumer
 Key: KAFKA-16012
 URL: https://issues.apache.org/jira/browse/KAFKA-16012
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
 Fix For: 3.7.0


We were looking into test failures here: 
https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1702475525--jolshan--kafka-15784--7cad567675/2023-12-13--001./2023-12-13–001./report.html.
 

Here is the first failure in the report:
{code:java}

test_id:    
kafkatest.tests.core.group_mode_transactions_test.GroupModeTransactionsTest.test_transactions.failure_mode=clean_bounce.bounce_target=brokers
status:     FAIL
run time:   3 minutes 4.950 seconds


    TimeoutError('Consumer consumed only 88223 out of 10 messages in 90s') 
{code}
 

We traced the failure to an apparent bug during the last rebalance before the 
group became empty. The last remaining instance seems to receive an incomplete 
assignment which prevents it from completing expected consumption on some 
partitions. Here is the rebalance from the coordinator's perspective:
{code:java}
server.log.2023-12-13-04:[2023-12-13 04:58:56,987] INFO [GroupCoordinator 3]: 
Stabilized group grouped-transactions-test-consumer-group generation 5 
(__consumer_offsets-2) with 1 members (kafka.coordinator.group.GroupCoordinator)
server.log.2023-12-13-04:[2023-12-13 04:58:56,990] INFO [GroupCoordinator 3]: 
Assignment received from leader 
consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
 for group grouped-transactions-test-consumer-group for generation 5. The group 
has 1 members, 0 of which are static. 
(kafka.coordinator.group.GroupCoordinator) {code}
The group is down to one member in generation 5. In the previous generation, 
the consumer in question reported this assignment:
{code:java}
// Gen 4: we've got partitions 0-4
[2023-12-13 04:58:52,631] DEBUG [Consumer 
clientId=consumer-grouped-transactions-test-consumer-group-1, 
groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete with 
generation 4 and memberId 
consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
 (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2023-12-13 04:58:52,631] INFO [Consumer 
clientId=consumer-grouped-transactions-test-consumer-group-1, 
groupId=grouped-transactions-test-consumer-group] Notifying assignor about the 
new Assignment(partitions=[input-topic-0, input-topic-1, input-topic-2, 
input-topic-3, input-topic-4]) 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code}
However, in generation 5, we seem to be assigned only one partition:
{code:java}
// Gen 5: Now we have only partition 1? But aren't we the last member in the 
group?
[2023-12-13 04:58:56,954] DEBUG [Consumer 
clientId=consumer-grouped-transactions-test-consumer-group-1, 
groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete with 
generation 5 and memberId 
consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
 (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2023-12-13 04:58:56,955] INFO [Consumer 
clientId=consumer-grouped-transactions-test-consumer-group-1, 
groupId=grouped-transactions-test-consumer-group] Notifying assignor about the 
new Assignment(partitions=[input-topic-1]) 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code}
The assignment type is range from the JoinGroup for generation 5. The decoded 
metadata sent by the consumer is this:
{code:java}
Subscription(topics=[input-topic], ownedPartitions=[], groupInstanceId=null, 
generationId=4, rackId=null) {code}
Here is the decoded assignment from the SyncGroup:
{code:java}
Assignment(partitions=[input-topic-1]) {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15828) Protect clients from broker hostname reuse

2023-11-14 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-15828:
---

 Summary: Protect clients from broker hostname reuse
 Key: KAFKA-15828
 URL: https://issues.apache.org/jira/browse/KAFKA-15828
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


In some environments such as k8s, brokers may be assigned to nodes dynamically 
from an available pool. When a cluster is rolling, it is possible for the 
client to see the same node advertised for different broker IDs in a short 
period of time. For example, kafka-1 might be initially assigned to node1. 
Before the client is able to establish a connection, it could be that kafka-3 
is now on node1 instead. Currently there is no protection in the client or in 
the protocol for this scenario. If the connection succeeds, the client will 
assume it has a good connection to kafka-1. Until something disrupts the 
connection, it will continue under this assumption even if the hostname for 
kafka-1 changes.

We have observed this scenario in practice. The client connected to the wrong 
broker through stale hostname information. It was unable to produce data 
because of persistent NOT_LEADER errors. The only way to recover in the end was 
by restarting the client to force a reconnection.

We have discussed a couple potential solutions to this problem:
 # Let the client be smarter managing the connection/hostname mapping. When it 
detects that a hostname has changed, it should force a disconnect to ensure it 
connects to the right node.
 # We can modify the protocol to verify that the client has connected to the 
intended broker. For example, we can add a field to ApiVersions to indicate the 
intended broker ID. The broker receiving the request can return an error if its 
ID does not match that in the request.

Are there alternatives? 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15221) Potential race condition between requests from rebooted followers

2023-10-11 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17774161#comment-17774161
 ] 

Jason Gustafson commented on KAFKA-15221:
-

We merged the fix to trunk. Given that it is a rare case which has existed for 
some time, we decided to target 3.7 forward and not cherry pick for older 
releases.

> Potential race condition between requests from rebooted followers
> -
>
> Key: KAFKA-15221
> URL: https://issues.apache.org/jira/browse/KAFKA-15221
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.5.0
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Blocker
> Fix For: 3.7.0
>
>
> When the leader processes the fetch request, it does not acquire locks when 
> updating the replica fetch state. Then there can be a race between the fetch 
> requests from a rebooted follower.
> T0, broker 1 sends a fetch to broker 0(leader). At the moment, broker 1 is 
> not in ISR.
> T1, broker 1 crashes.
> T2 broker 1 is back online and receives a new broker epoch. Also, it sends a 
> new Fetch request.
> T3 broker 0 receives the old fetch requests and decides to expand the ISR.
> T4 Right before broker 0 starts to fill the AlterPartitoin request, the new 
> fetch request comes in and overwrites the fetch state. Then broker 0 uses the 
> new broker epoch on the AlterPartition request.
> In this way, the AlterPartition request can get around KIP-903 and wrongly 
> update the ISR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15221) Potential race condition between requests from rebooted followers

2023-10-11 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-15221.
-
Fix Version/s: (was: 3.5.2)
   Resolution: Fixed

> Potential race condition between requests from rebooted followers
> -
>
> Key: KAFKA-15221
> URL: https://issues.apache.org/jira/browse/KAFKA-15221
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.5.0
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Blocker
> Fix For: 3.7.0
>
>
> When the leader processes the fetch request, it does not acquire locks when 
> updating the replica fetch state. Then there can be a race between the fetch 
> requests from a rebooted follower.
> T0, broker 1 sends a fetch to broker 0(leader). At the moment, broker 1 is 
> not in ISR.
> T1, broker 1 crashes.
> T2 broker 1 is back online and receives a new broker epoch. Also, it sends a 
> new Fetch request.
> T3 broker 0 receives the old fetch requests and decides to expand the ISR.
> T4 Right before broker 0 starts to fill the AlterPartitoin request, the new 
> fetch request comes in and overwrites the fetch state. Then broker 0 uses the 
> new broker epoch on the AlterPartition request.
> In this way, the AlterPartition request can get around KIP-903 and wrongly 
> update the ISR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14694) RPCProducerIdManager should not wait for a new block

2023-06-22 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14694.
-
Fix Version/s: 3.6.0
   Resolution: Fixed

> RPCProducerIdManager should not wait for a new block
> 
>
> Key: KAFKA-14694
> URL: https://issues.apache.org/jira/browse/KAFKA-14694
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jeff Kim
>Assignee: Jeff Kim
>Priority: Major
> Fix For: 3.6.0
>
>
> RPCProducerIdManager initiates an async request to the controller to grab a 
> block of producer IDs and then blocks waiting for a response from the 
> controller.
> This is done in the request handler threads while holding a global lock. This 
> means that if many producers are requesting producer IDs and the controller 
> is slow to respond, many threads can get stuck waiting for the lock.
> This may also be a deadlock concern under the following scenario:
> if the controller has 1 request handler thread (1 chosen for simplicity) and 
> receives an InitProducerId request, it may deadlock.
> basically any time the controller has N InitProducerId requests where N >= # 
> of request handler threads has the potential to deadlock.
> consider this:
> 1. the request handler thread tries to handle an InitProducerId request to 
> the controller by forwarding an AllocateProducerIds request.
> 2. the request handler thread then waits on the controller response (timed 
> poll on nextProducerIdBlock)
> 3. the controller's request handler threads need to pick this request up, and 
> handle it, but the controller's request handler threads are blocked waiting 
> for the forwarded AllocateProducerIds response.
>  
> We should not block while waiting for a new block and instead return 
> immediately to free the request handler threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14831) Illegal state errors should be fatal in transactional producer

2023-03-21 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14831:
---

 Summary: Illegal state errors should be fatal in transactional 
producer
 Key: KAFKA-14831
 URL: https://issues.apache.org/jira/browse/KAFKA-14831
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


In KAFKA-14830, the producer hit an illegal state error. The error was 
propagated to the `Sender` thread and logged, but the producer otherwise 
continued on. It would be better to make illegal state errors fatal since 
continuing to write to transactions when the internal state is inconsistent may 
cause incorrect and unpredictable behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14830) Illegal state error in transactional producer

2023-03-21 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14830:
---

 Summary: Illegal state error in transactional producer
 Key: KAFKA-14830
 URL: https://issues.apache.org/jira/browse/KAFKA-14830
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.1.2
Reporter: Jason Gustafson


We have seen the following illegal state error in the producer:
{code:java}
[Producer clientId=client-id2, transactionalId=transactional-id] Transiting to 
abortable error state due to org.apache.kafka.common.errors.TimeoutException: 
Expiring 1 record(s) for topic-0:120027 ms has passed since batch creation
[Producer clientId=client-id2, transactionalId=transactional-id] Transiting to 
abortable error state due to org.apache.kafka.common.errors.TimeoutException: 
Expiring 1 record(s) for topic-1:120026 ms has passed since batch creation
[Producer clientId=client-id2, transactionalId=transactional-id] Aborting 
incomplete transaction
[Producer clientId=client-id2, transactionalId=transactional-id] Invoking 
InitProducerId with current producer ID and epoch 
ProducerIdAndEpoch(producerId=191799, epoch=0) in order to bump the epoch
[Producer clientId=client-id2, transactionalId=transactional-id] ProducerId set 
to 191799 with epoch 1
[Producer clientId=client-id2, transactionalId=transactional-id] Transiting to 
abortable error state due to org.apache.kafka.common.errors.NetworkException: 
Disconnected from node 4
[Producer clientId=client-id2, transactionalId=transactional-id] Transiting to 
abortable error state due to org.apache.kafka.common.errors.TimeoutException: 
The request timed out.
[Producer clientId=client-id2, transactionalId=transactional-id] Uncaught error 
in request completion:
java.lang.IllegalStateException: TransactionalId transactional-id: Invalid 
transition attempted from state READY to state ABORTABLE_ERROR
        at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:1089)
        at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionToAbortableError(TransactionManager.java:508)
        at 
org.apache.kafka.clients.producer.internals.TransactionManager.maybeTransitionToErrorState(TransactionManager.java:734)
        at 
org.apache.kafka.clients.producer.internals.TransactionManager.handleFailedBatch(TransactionManager.java:739)
        at 
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:753)
        at 
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:743)
        at 
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:695)
        at 
org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:634)
        at 
org.apache.kafka.clients.producer.internals.Sender.lambda$null$1(Sender.java:575)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
        at 
org.apache.kafka.clients.producer.internals.Sender.lambda$handleProduceResponse$2(Sender.java:562)
        at java.base/java.lang.Iterable.forEach(Iterable.java:75)
        at 
org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:562)
        at 
org.apache.kafka.clients.producer.internals.Sender.lambda$sendProduceRequest$5(Sender.java:836)
        at 
org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
        at 
org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:583)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:575)
        at 
org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:328)
        at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243)
        at java.base/java.lang.Thread.run(Thread.java:829)
 {code}
The producer hits timeouts which cause it to abort an active transaction. After 
aborting, the producer bumps its epoch, which transitions it back to the 
`READY` state. Following this, there are two errors for inflight requests, 
which cause an illegal state transition to `ABORTABLE_ERROR`. But how could the 
transaction ABORT complete if there were still inflight requests? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14664) Raft idle ratio is inaccurate

2023-02-15 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14664:

Fix Version/s: 3.5.0

> Raft idle ratio is inaccurate
> -
>
> Key: KAFKA-14664
> URL: https://issues.apache.org/jira/browse/KAFKA-14664
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.5.0
>
>
> The `poll-idle-ratio-avg` metric is intended to track how idle the raft IO 
> thread is. When completely idle, it should measure 1. When saturated, it 
> should measure 0. The problem with the current measurements is that they are 
> treated equally with respect to time. For example, say we poll twice with the 
> following durations:
> Poll 1: 2s
> Poll 2: 0s
> Assume that the busy time is negligible, so 2s passes overall.
> In the first measurement, 2s is spent waiting, so we compute and record a 
> ratio of 1.0. In the second measurement, no time passes, and we record 0.0. 
> The idle ratio is then computed as the average of these two values (1.0 + 0.0 
> / 2 = 0.5), which suggests that the process was busy for 1s, which 
> overestimates the true busy time.
> Instead, we should sum up the time waiting over the full interval. 2s passes 
> total here and 2s is idle, so we should compute 1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14664) Raft idle ratio is inaccurate

2023-02-15 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14664.
-
Resolution: Fixed

> Raft idle ratio is inaccurate
> -
>
> Key: KAFKA-14664
> URL: https://issues.apache.org/jira/browse/KAFKA-14664
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.5.0
>
>
> The `poll-idle-ratio-avg` metric is intended to track how idle the raft IO 
> thread is. When completely idle, it should measure 1. When saturated, it 
> should measure 0. The problem with the current measurements is that they are 
> treated equally with respect to time. For example, say we poll twice with the 
> following durations:
> Poll 1: 2s
> Poll 2: 0s
> Assume that the busy time is negligible, so 2s passes overall.
> In the first measurement, 2s is spent waiting, so we compute and record a 
> ratio of 1.0. In the second measurement, no time passes, and we record 0.0. 
> The idle ratio is then computed as the average of these two values (1.0 + 0.0 
> / 2 = 0.5), which suggests that the process was busy for 1s, which 
> overestimates the true busy time.
> Instead, we should sum up the time waiting over the full interval. 2s passes 
> total here and 2s is idle, so we should compute 1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14664) Raft idle ratio is inaccurate

2023-02-15 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14664:

Affects Version/s: 3.3.2
   3.3.1
   3.4.0
   3.3.0

> Raft idle ratio is inaccurate
> -
>
> Key: KAFKA-14664
> URL: https://issues.apache.org/jira/browse/KAFKA-14664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.5.0
>
>
> The `poll-idle-ratio-avg` metric is intended to track how idle the raft IO 
> thread is. When completely idle, it should measure 1. When saturated, it 
> should measure 0. The problem with the current measurements is that they are 
> treated equally with respect to time. For example, say we poll twice with the 
> following durations:
> Poll 1: 2s
> Poll 2: 0s
> Assume that the busy time is negligible, so 2s passes overall.
> In the first measurement, 2s is spent waiting, so we compute and record a 
> ratio of 1.0. In the second measurement, no time passes, and we record 0.0. 
> The idle ratio is then computed as the average of these two values (1.0 + 0.0 
> / 2 = 0.5), which suggests that the process was busy for 1s, which 
> overestimates the true busy time.
> Instead, we should sum up the time waiting over the full interval. 2s passes 
> total here and 2s is idle, so we should compute 1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-6793) Unnecessary warning log message

2023-02-13 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-6793.

Fix Version/s: 3.5.0
   Resolution: Fixed

We resolved this by changing the logging to info level. It may still be useful 
in some cases, but most of the time it the "warning" is spurious.

> Unnecessary warning log message 
> 
>
> Key: KAFKA-6793
> URL: https://issues.apache.org/jira/browse/KAFKA-6793
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.1.0
>Reporter: Anna O
>Assignee: Philip Nee
>Priority: Minor
> Fix For: 3.5.0
>
>
> When upgraded KafkaStreams from 0.11.0.2 to 1.1.0 the following warning log 
> started to appear:
> level: WARN
> logger: org.apache.kafka.clients.consumer.ConsumerConfig
> message: The configuration 'admin.retries' was supplied but isn't a known 
> config.
> The config is not explicitly supplied to the streams.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13972) Reassignment cancellation causes stray replicas

2023-02-06 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-13972.
-
Resolution: Fixed

> Reassignment cancellation causes stray replicas
> ---
>
> Key: KAFKA-13972
> URL: https://issues.apache.org/jira/browse/KAFKA-13972
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.4.1
>
>
> A stray replica is one that is left behind on a broker after the partition 
> has been reassigned to other brokers or the partition has been deleted. We 
> found one case where this can happen is after a cancelled reassignment. When 
> a reassignment is cancelled, the controller sends `StopReplica` requests to 
> any of the adding replicas, but it does not necessarily bump the leader 
> epoch. Following 
> [KIP-570|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-570%3A+Add+leader+epoch+in+StopReplicaRequest],]
>  brokers will ignore `StopReplica` requests if the leader epoch matches the 
> current partition leader epoch. So we need to bump the epoch whenever we need 
> to ensure that `StopReplica` will be received.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14672) Producer queue time does not reflect batches expired in the accumulator

2023-02-01 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17683183#comment-17683183
 ] 

Jason Gustafson commented on KAFKA-14672:
-

[~kirktrue] Nope. Please do.

> Producer queue time does not reflect batches expired in the accumulator
> ---
>
> Key: KAFKA-14672
> URL: https://issues.apache.org/jira/browse/KAFKA-14672
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Kirk True
>Priority: Major
>
> The producer exposes two metrics for the time a record has spent in the 
> accumulator waiting to be drained:
>  * `record-queue-time-avg`
>  * `record-queue-time-max`
> The metric is only updated when a batch is ready to send to a broker. It is 
> also possible for a batch to be expired before it can be sent, but in this 
> case, the metric is not updated. This seems surprising and makes the queue 
> time misleading. The only metric I could find that does reflect batch 
> expirations in the accumulator is the generic `record-error-rate`. It would 
> make sense to let the queue-time metrics record the time spent in the queue 
> regardless of the outcome of the record send attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14672) Producer queue time does not reflect batches expired in the accumulator

2023-02-01 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14672:

Description: 
The producer exposes two metrics for the time a record has spent in the 
accumulator waiting to be drained:
 * `record-queue-time-avg`
 * `record-queue-time-max`

The metric is only updated when a batch is ready to send to a broker. It is 
also possible for a batch to be expired before it can be sent, but in this 
case, the metric is not updated. This seems surprising and makes the queue time 
misleading. The only metric I could find that does reflect batch expirations in 
the accumulator is the generic `record-error-rate`. It would make sense to let 
the queue-time metrics record the time spent in the queue regardless of the 
outcome of the record send attempt.

  was:
The producer exposes two metrics for the time a record has spent in the 
accumulator waiting to be drained:
 * `record-queue-time-avg`
 * `record-queue-time-max`

The metric is only updated when a batch is drained for sending to the broker. 
It is also possible for a batch to be expired before it can be drained, but in 
this case, the metric is not updated. This seems surprising and makes the queue 
time misleading. The only metric I could find that does reflect batch 
expirations in the accumulator is the generic `record-error-rate`. It would 
make sense to let the queue-time metrics record the time spent in the queue 
regardless of the outcome of the record send attempt.


> Producer queue time does not reflect batches expired in the accumulator
> ---
>
> Key: KAFKA-14672
> URL: https://issues.apache.org/jira/browse/KAFKA-14672
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> The producer exposes two metrics for the time a record has spent in the 
> accumulator waiting to be drained:
>  * `record-queue-time-avg`
>  * `record-queue-time-max`
> The metric is only updated when a batch is ready to send to a broker. It is 
> also possible for a batch to be expired before it can be sent, but in this 
> case, the metric is not updated. This seems surprising and makes the queue 
> time misleading. The only metric I could find that does reflect batch 
> expirations in the accumulator is the generic `record-error-rate`. It would 
> make sense to let the queue-time metrics record the time spent in the queue 
> regardless of the outcome of the record send attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14672) Producer queue time does not reflect batches expired in the accumulator

2023-02-01 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14672:
---

 Summary: Producer queue time does not reflect batches expired in 
the accumulator
 Key: KAFKA-14672
 URL: https://issues.apache.org/jira/browse/KAFKA-14672
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


The producer exposes two metrics for the time a record has spent in the 
accumulator waiting to be drained:
 * `record-queue-time-avg`
 * `record-queue-time-max`

The metric is only updated when a batch is drained for sending to the broker. 
It is also possible for a batch to be expired before it can be drained, but in 
this case, the metric is not updated. This seems surprising and makes the queue 
time misleading. The only metric I could find that does reflect batch 
expirations in the accumulator is the generic `record-error-rate`. It would 
make sense to let the queue-time metrics record the time spent in the queue 
regardless of the outcome of the record send attempt.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14664) Raft idle ratio is inaccurate

2023-01-31 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14664:

Description: 
The `poll-idle-ratio-avg` metric is intended to track how idle the raft IO 
thread is. When completely idle, it should measure 1. When saturated, it should 
measure 0. The problem with the current measurements is that they are treated 
equally with respect to time. For example, say we poll twice with the following 
durations:

Poll 1: 2s

Poll 2: 0s

Assume that the busy time is negligible, so 2s passes overall.

In the first measurement, 2s is spent waiting, so we compute and record a ratio 
of 1.0. In the second measurement, no time passes, and we record 0.0. The idle 
ratio is then computed as the average of these two values (1.0 + 0.0 / 2 = 
0.5), which suggests that the process was busy for 1s, which overestimates the 
true busy time.

Instead, we should sum up the time waiting over the full interval. 2s passes 
total here and 2s is idle, so we should compute 1.0.

  was:
The `poll-idle-ratio-avg` metric is intended to track how idle the raft IO 
thread is. When completely idle, it should measure 1. When saturated, it should 
measure 0. The problem with the current measurements is that they are treated 
equally with respect to time. For example, say we poll twice with the following 
durations:

Poll 1: 2s

Poll 2: 0s

Assume that the busy time is negligible, so 2s passes overall.

In the first measurement, 2s is spent waiting, so we compute and record a ratio 
of 1.0. In the second measurement, no time passes, and we record 0.0. The idle 
ratio is then computed as the average of these two values (1.0 + 0.0 / 2 = 
0.5), which suggests that the process was busy for 1s. 

Instead, we should sum up the time waiting over the full interval. 2s passes 
total here and 2s is idle, so we should compute 1.0.


> Raft idle ratio is inaccurate
> -
>
> Key: KAFKA-14664
> URL: https://issues.apache.org/jira/browse/KAFKA-14664
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>
> The `poll-idle-ratio-avg` metric is intended to track how idle the raft IO 
> thread is. When completely idle, it should measure 1. When saturated, it 
> should measure 0. The problem with the current measurements is that they are 
> treated equally with respect to time. For example, say we poll twice with the 
> following durations:
> Poll 1: 2s
> Poll 2: 0s
> Assume that the busy time is negligible, so 2s passes overall.
> In the first measurement, 2s is spent waiting, so we compute and record a 
> ratio of 1.0. In the second measurement, no time passes, and we record 0.0. 
> The idle ratio is then computed as the average of these two values (1.0 + 0.0 
> / 2 = 0.5), which suggests that the process was busy for 1s, which 
> overestimates the true busy time.
> Instead, we should sum up the time waiting over the full interval. 2s passes 
> total here and 2s is idle, so we should compute 1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14664) Raft idle ratio is inaccurate

2023-01-30 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14664:
---

 Summary: Raft idle ratio is inaccurate
 Key: KAFKA-14664
 URL: https://issues.apache.org/jira/browse/KAFKA-14664
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson


The `poll-idle-ratio-avg` metric is intended to track how idle the raft IO 
thread is. When completely idle, it should measure 1. When saturated, it should 
measure 0. The problem with the current measurements is that they are treated 
equally with respect to time. For example, say we poll twice with the following 
durations:

Poll 1: 2s

Poll 2: 0s

Assume that the busy time is negligible, so 2s passes overall.

In the first measurement, 2s is spent waiting, so we compute and record a ratio 
of 1.0. In the second measurement, no time passes, and we record 0.0. The idle 
ratio is then computed as the average of these two values (1.0 + 0.0 / 2 = 
0.5), which suggests that the process was busy for 1s. 

Instead, we should sum up the time waiting over the full interval. 2s passes 
total here and 2s is idle, so we should compute 1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14644) Process should stop after failure in raft IO thread

2023-01-25 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14644.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

> Process should stop after failure in raft IO thread
> ---
>
> Key: KAFKA-14644
> URL: https://issues.apache.org/jira/browse/KAFKA-14644
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
> Fix For: 3.5.0
>
>
> We have seen a few cases where an unexpected error in the Raft IO thread 
> causes the process to enter a zombie state where it is no longer 
> participating in the raft quorum. In this state, a controller can no longer 
> become leader or help in elections, and brokers can no longer update 
> metadata. It may be better to stop the process in this case since there is no 
> way to recover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14648) Do not fail clients if bootstrap servers is not immediately resolvable

2023-01-24 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14648:
---

 Summary: Do not fail clients if bootstrap servers is not 
immediately resolvable
 Key: KAFKA-14648
 URL: https://issues.apache.org/jira/browse/KAFKA-14648
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


In dynamic environments, such as system tests, there is sometimes a delay 
between when a client is initialized and when the configured bootstrap servers 
become available in DNS. Currently clients will fail immediately if none of the 
bootstrap servers can resolve. It would be more convenient for these 
environments to provide a grace period to give more time for initialization. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14644) Process should stop after failure in raft IO thread

2023-01-20 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14644:
---

 Summary: Process should stop after failure in raft IO thread
 Key: KAFKA-14644
 URL: https://issues.apache.org/jira/browse/KAFKA-14644
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


We have seen a few cases where an unexpected error in the Raft IO thread causes 
the process to enter a zombie state where it is no longer participating in the 
raft quorum. In this state, a controller can no longer become leader or help in 
elections, and brokers can no longer update metadata. It may be better to stop 
the process in this case since there is no way to recover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-13972) Reassignment cancellation causes stray replicas

2023-01-18 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678390#comment-17678390
 ] 

Jason Gustafson commented on KAFKA-13972:
-

The patch still needs to be picked into 3.4 after we have finished with 3.4.0.

> Reassignment cancellation causes stray replicas
> ---
>
> Key: KAFKA-13972
> URL: https://issues.apache.org/jira/browse/KAFKA-13972
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.4.1
>
>
> A stray replica is one that is left behind on a broker after the partition 
> has been reassigned to other brokers or the partition has been deleted. We 
> found one case where this can happen is after a cancelled reassignment. When 
> a reassignment is cancelled, the controller sends `StopReplica` requests to 
> any of the adding replicas, but it does not necessarily bump the leader 
> epoch. Following 
> [KIP-570|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-570%3A+Add+leader+epoch+in+StopReplicaRequest],]
>  brokers will ignore `StopReplica` requests if the leader epoch matches the 
> current partition leader epoch. So we need to bump the epoch whenever we need 
> to ensure that `StopReplica` will be received.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-13972) Reassignment cancellation causes stray replicas

2023-01-18 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-13972:

Fix Version/s: 3.4.1

> Reassignment cancellation causes stray replicas
> ---
>
> Key: KAFKA-13972
> URL: https://issues.apache.org/jira/browse/KAFKA-13972
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.4.1
>
>
> A stray replica is one that is left behind on a broker after the partition 
> has been reassigned to other brokers or the partition has been deleted. We 
> found one case where this can happen is after a cancelled reassignment. When 
> a reassignment is cancelled, the controller sends `StopReplica` requests to 
> any of the adding replicas, but it does not necessarily bump the leader 
> epoch. Following 
> [KIP-570|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-570%3A+Add+leader+epoch+in+StopReplicaRequest],]
>  brokers will ignore `StopReplica` requests if the leader epoch matches the 
> current partition leader epoch. So we need to bump the epoch whenever we need 
> to ensure that `StopReplica` will be received.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14612) Topic config records written to log even when topic creation fails

2023-01-12 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14612.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

> Topic config records written to log even when topic creation fails
> --
>
> Key: KAFKA-14612
> URL: https://issues.apache.org/jira/browse/KAFKA-14612
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Reporter: Jason Gustafson
>Assignee: Andrew Grant
>Priority: Major
> Fix For: 3.4.0
>
>
> Config records are added when handling a `CreateTopics` request here: 
> [https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L549.]
>  If the subsequent validations fail and the topic is not created, these 
> records will still be written to the log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14618) Off by one error in generated snapshot IDs causes misaligned fetching

2023-01-12 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14618:
---

 Summary: Off by one error in generated snapshot IDs causes 
misaligned fetching
 Key: KAFKA-14618
 URL: https://issues.apache.org/jira/browse/KAFKA-14618
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson
 Fix For: 3.4.0


We implemented new snapshot generation logic here: 
[https://github.com/apache/kafka/pull/12983.] A few days prior to this patch 
getting merged, we had changed the `RaftClient` API to pass the _exclusive_ 
offset when generating snapshots instead of the inclusive offset: 
[https://github.com/apache/kafka/pull/12981.] Unfortunately, the new snapshot 
generation logic was not updated accordingly. The consequence of this is that 
the state on replicas can get out of sync. In the best case, the followers fail 
replication because the offset after loading a snapshot is no longer aligned on 
a batch boundary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14612) Topic config records written to log even when topic creation fails

2023-01-10 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14612:
---

 Summary: Topic config records written to log even when topic 
creation fails
 Key: KAFKA-14612
 URL: https://issues.apache.org/jira/browse/KAFKA-14612
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Reporter: Jason Gustafson


Config records are added when handling a `CreateTopics` request here: 
[https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L549.]
 If the subsequent validations fail and the topic is not created, these records 
will still be written to the log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14417) Producer doesn't handle REQUEST_TIMED_OUT for InitProducerIdRequest, treats as fatal error

2022-12-08 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14417.
-
Fix Version/s: 4.0.0
   3.3.2
   Resolution: Fixed

> Producer doesn't handle REQUEST_TIMED_OUT for InitProducerIdRequest, treats 
> as fatal error
> --
>
> Key: KAFKA-14417
> URL: https://issues.apache.org/jira/browse/KAFKA-14417
> Project: Kafka
>  Issue Type: Task
>Affects Versions: 3.1.0, 3.0.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Blocker
> Fix For: 4.0.0, 3.3.2
>
>
> In TransactionManager we have a handler for InitProducerIdRequests 
> [https://github.com/apache/kafka/blob/19286449ee20f85cc81860e13df14467d4ce287c/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#LL1276C14-L1276C14]
> However, we have the potential to return a REQUEST_TIMED_OUT error in 
> RPCProducerIdManager when the BrokerToControllerChannel manager times out: 
> [https://github.com/apache/kafka/blob/19286449ee20f85cc81860e13df14467d4ce287c/core/src/main/scala/kafka/coordinator/transaction/ProducerIdManager.scala#L236]
>  
> or when the poll returns null: 
> [https://github.com/apache/kafka/blob/19286449ee20f85cc81860e13df14467d4ce287c/core/src/main/scala/kafka/coordinator/transaction/ProducerIdManager.scala#L170]
> Since REQUEST_TIMED_OUT is not handled by the producer, we treat it as a 
> fatal error and the producer fails. With the default of idempotent producers, 
> this can cause more issues.
> See this stack trace from 3.0:
> {code:java}
> ERROR [Producer clientId=console-producer] Aborting producer batches due to 
> fatal error (org.apache.kafka.clients.producer.internals.Sender)
> org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; The request timed out.
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1390)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1294)
>   at 
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
>   at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:658)
>   at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:650)
>   at 
> org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:418)
>   at 
> org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:316)
>   at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:256)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Seems like the commit that introduced the changes was this one: 
> [https://github.com/apache/kafka/commit/72d108274c98dca44514007254552481c731c958]
>  so we are vulnerable when the server code is ibp 3.0 and beyond.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14439) Specify returned errors for various APIs and versions

2022-12-02 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642726#comment-17642726
 ] 

Jason Gustafson edited comment on KAFKA-14439 at 12/2/22 11:52 PM:
---

Yeah, we've had so many compatibility breaks due to error code usage. Putting 
the errors into the spec would also enable better enforcement. One option could 
be something like this:
{code:java}
{ 
  "name": "ErrorCode", "type": "enum16", "versions": "0+",
  "about": "The error code,",
  "values": [
{"value": 0, "versions": "0+", "about": "The operation completed 
successfully"},
{"value": 1, "versions": "1+", "about": "The requested offset was out of 
range"},
    {"value": 4, "versions": "1+", "about": "Invalid fetch size"},
  ]
} {code}
Here "enum16" indicates a 2-byte enumeration where the values are provided in 
the `values` field. New values can only be added with a version bump.


was (Author: hachikuji):
Yeah, we've had so many compatibility breaks due to error code usage. Putting 
the errors into the spec would also enable better enforcement. One option could 
be something like this:
{code:java}
{ 
  "name": "ErrorCode", "type": "enum16", "versions": "0+",
  "about": "The error code,",
  "values": [
{"value": 0, "versions": "0+", "about": "The operation completed 
successfully"},
{"value": 1, "versions": "1+", "about": "The requested offset was out of 
range"},
  ]
} {code}
Here "enum16" indicates a 2-byte enumeration where the values are provided in 
the `values` field. New values can only be added with a version bump.

> Specify returned errors for various APIs and versions
> -
>
> Key: KAFKA-14439
> URL: https://issues.apache.org/jira/browse/KAFKA-14439
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Priority: Major
>
> Kafka is known for supporting various clients and being compatible across 
> different versions. But one thing that is a bit unclear is what errors each 
> response can send. 
> Knowing what errors can come from each version helps those who implement 
> clients have a more defined spec for what errors they need to handle. When 
> new errors are added, it is clearer to the clients that changes need to be 
> made.
> It also helps contributors get a better understanding about how clients are 
> expected to react and potentially find and prevent gaps like the one found in 
> https://issues.apache.org/jira/browse/KAFKA-14417
> I briefly synced offline with [~hachikuji] about this and he suggested maybe 
> adding values for the error codes in the schema definitions of APIs that 
> specify the error codes and what versions they are returned on. One idea was 
> creating some enum type to accomplish this. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14438) Stop supporting empty consumer groupId

2022-12-02 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14438:

Priority: Blocker  (was: Major)

> Stop supporting empty consumer groupId
> --
>
> Key: KAFKA-14438
> URL: https://issues.apache.org/jira/browse/KAFKA-14438
> Project: Kafka
>  Issue Type: Task
>  Components: consumer
>Reporter: Philip Nee
>Priority: Blocker
> Fix For: 4.0.0
>
>
> Currently, a warning message is logged upon using an empty consumer groupId. 
> In the next major release, we should drop the support of empty ("") consumer 
> groupId.
>  
> cc [~hachikuji] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-14439) Specify returned errors for various APIs and versions

2022-12-02 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642726#comment-17642726
 ] 

Jason Gustafson edited comment on KAFKA-14439 at 12/2/22 11:44 PM:
---

Yeah, we've had so many compatibility breaks due to error code usage. Putting 
the errors into the spec would also enable better enforcement. One option could 
be something like this:
{code:java}
{ 
  "name": "ErrorCode", "type": "enum16", "versions": "0+",
  "about": "The error code,",
  "values": [
{"value": 0, "versions": "0+", "about": "The operation completed 
successfully"},
{"value": 1, "versions": "1+", "about": "The requested offset was out of 
range"},
  ]
} {code}
Here "enum16" indicates a 2-byte enumeration where the values are provided in 
the `values` field. New values can only be added with a version bump.


was (Author: hachikuji):
Yeah, we've had so many compatibility breaks due to error code usage. Putting 
the errors into the spec would also enable better enforcement. One option could 
be something like this:
{code:java}
{ 
  "name": "ErrorCode", "type": "enum16", "versions": "0+",
  "about": "The error code,",
  "values": [
{"value": 0, "versions": "0+", "about": "The operation completed 
successfully"},
{"value": 1, "versions": "1+", about": "The requested offset was out of 
range"},
  ]
} {code}
Here "enum16" indicates a 2-byte enumeration where the values are provided in 
the `values` field. New values can only be added with a version bump.

> Specify returned errors for various APIs and versions
> -
>
> Key: KAFKA-14439
> URL: https://issues.apache.org/jira/browse/KAFKA-14439
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Priority: Major
>
> Kafka is known for supporting various clients and being compatible across 
> different versions. But one thing that is a bit unclear is what errors each 
> response can send. 
> Knowing what errors can come from each version helps those who implement 
> clients have a more defined spec for what errors they need to handle. When 
> new errors are added, it is clearer to the clients that changes need to be 
> made.
> It also helps contributors get a better understanding about how clients are 
> expected to react and potentially find and prevent gaps like the one found in 
> https://issues.apache.org/jira/browse/KAFKA-14417
> I briefly synced offline with [~hachikuji] about this and he suggested maybe 
> adding values for the error codes in the schema definitions of APIs that 
> specify the error codes and what versions they are returned on. One idea was 
> creating some enum type to accomplish this. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14439) Specify returned errors for various APIs and versions

2022-12-02 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642726#comment-17642726
 ] 

Jason Gustafson commented on KAFKA-14439:
-

Yeah, we've had so many compatibility breaks due to error code usage. Putting 
the errors into the spec would also enable better enforcement. One option could 
be something like this:
{code:java}
{ 
  "name": "ErrorCode", "type": "enum16", "versions": "0+",
  "about": "The error code,",
  "values": [
{"value": 0, "versions": "0+", "about": "The operation completed 
successfully"},
{"value": 1, "versions": "1+", about": "The requested offset was out of 
range"},
  ]
} {code}
Here "enum16" indicates a 2-byte enumeration where the values are provided in 
the `values` field. New values can only be added with a version bump.

> Specify returned errors for various APIs and versions
> -
>
> Key: KAFKA-14439
> URL: https://issues.apache.org/jira/browse/KAFKA-14439
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Priority: Major
>
> Kafka is known for supporting various clients and being compatible across 
> different versions. But one thing that is a bit unclear is what errors each 
> response can send. 
> Knowing what errors can come from each version helps those who implement 
> clients have a more defined spec for what errors they need to handle. When 
> new errors are added, it is clearer to the clients that changes need to be 
> made.
> It also helps contributors get a better understanding about how clients are 
> expected to react and potentially find and prevent gaps like the one found in 
> https://issues.apache.org/jira/browse/KAFKA-14417
> I briefly synced offline with [~hachikuji] about this and he suggested maybe 
> adding values for the error codes in the schema definitions of APIs that 
> specify the error codes and what versions they are returned on. One idea was 
> creating some enum type to accomplish this. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14379) consumer should refresh preferred read replica on update metadata

2022-12-02 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14379:

Priority: Critical  (was: Major)

> consumer should refresh preferred read replica on update metadata
> -
>
> Key: KAFKA-14379
> URL: https://issues.apache.org/jira/browse/KAFKA-14379
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jeff Kim
>Assignee: Jeff Kim
>Priority: Critical
> Fix For: 3.4.0
>
>
> The consumer (fetcher) refreshes the preferred read replica only on three 
> conditions:
>  # the consumer receives an OFFSET_OUT_OF_RANGE error
>  # the follower does not exist in the client's metadata (i.e., offline)
>  # after metadata.max.age.ms (5 min default)
> For other errors, it will continue to reach to the possibly unavailable 
> follower and only after 5 minutes will it refresh the preferred read replica 
> and go back to the leader.
> A specific example is when a partition is reassigned. the consumer will get 
> NOT_LEADER_OR_FOLLOWER which triggers a metadata update but the preferred 
> read replica will not be refreshed as the follower is still online. it will 
> continue to reach out to the old follower until the preferred read replica 
> expires.
> the consumer can instead refresh its preferred read replica whenever it makes 
> a metadata update request. so when the consumer receives i.e. 
> NOT_LEADER_OR_FOLLOWER it can find the new preferred read replica without 
> waiting for the expiration.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14379) consumer should refresh preferred read replica on update metadata

2022-12-02 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14379:

Fix Version/s: 3.4.0

> consumer should refresh preferred read replica on update metadata
> -
>
> Key: KAFKA-14379
> URL: https://issues.apache.org/jira/browse/KAFKA-14379
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jeff Kim
>Assignee: Jeff Kim
>Priority: Major
> Fix For: 3.4.0
>
>
> The consumer (fetcher) refreshes the preferred read replica only on three 
> conditions:
>  # the consumer receives an OFFSET_OUT_OF_RANGE error
>  # the follower does not exist in the client's metadata (i.e., offline)
>  # after metadata.max.age.ms (5 min default)
> For other errors, it will continue to reach to the possibly unavailable 
> follower and only after 5 minutes will it refresh the preferred read replica 
> and go back to the leader.
> A specific example is when a partition is reassigned. the consumer will get 
> NOT_LEADER_OR_FOLLOWER which triggers a metadata update but the preferred 
> read replica will not be refreshed as the follower is still online. it will 
> continue to reach out to the old follower until the preferred read replica 
> expires.
> the consumer can instead refresh its preferred read replica whenever it makes 
> a metadata update request. so when the consumer receives i.e. 
> NOT_LEADER_OR_FOLLOWER it can find the new preferred read replica without 
> waiting for the expiration.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14397) Idempotent producer may bump epoch and reset sequence numbers prematurely

2022-11-16 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14397:

Description: 
Suppose that idempotence is enabled in the producer and we send the following 
single-record batches to a partition leader:
 * A: epoch=0, seq=0
 * B: epoch=0, seq=1
 * C: epoch=0, seq=2

The partition leader receives all 3 of these batches and commits them to the 
log. However, the connection is lost before the `Produce` responses are 
received by the client. Subsequent retries by the producer all fail to be 
delivered.

It is possible in this scenario for the first batch `A` to reach the delivery 
timeout before the subsequence batches. This triggers the following check: 
[https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L642.]
 Depending whether retries are exhausted, we may adjust sequence numbers.

The intuition behind this check is that if retries have not been exhausted, 
then we saw a fatal error and the batch could not have been written to the log. 
Hence we should bump the epoch and adjust the sequence numbers of the pending 
batches since they are presumed to be doomed to failure. So in this case, 
batches B and C might get reset with the bumped epoch:
 * B: epoch=1, seq=0
 * C: epoch=1, seq=1

If the producer is able to reach the partition leader before these batches are 
expired locally, then they may get written and committed to the log. This can 
result in duplicates.

The root of the issue is that this logic does not account for expiration of the 
delivery timeout. When the delivery timeout is reached, the number of retries 
is still likely much lower than the max allowed number of retries (which is 
`Int.MaxValue` by default).

  was:
Suppose that idempotence is enabled in the producer and we send the following 
single-record batches to a partition leader:
 * A: epoch=0, seq=0
 * B: epoch=0, seq=1
 * C: epoch=0, seq=2

The partition leader receives all 3 of these batches and commits them to the 
log. However, the connection is lost before the `Produce` responses are 
received by the client. Subsequent retries by the producer all fail to be 
delivered.

It is possible in this scenario for the first batch `A` to reach the delivery 
timeout before the subsequence batches. This triggers the following check: 
[https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L642.]
 Depending whether retries are exhausted, we may adjust sequence numbers.

The intuition behind this check is that if retries have not been exhausted, 
then we saw a fatal error and the batch could not have been written to the log. 
Hence we should bump the epoch and adjust the sequence numbers of the pending 
batches since they are presumed to be doomed to failure. So in this case, 
batches B and C might get reset with the bumped epoch:
 * B: epoch=1, seq=0
 * C: epoch=1, seq=1

This can result in duplicate records in the log.

The root of the issue is that this logic does not account for expiration of the 
delivery timeout. When the delivery timeout is reached, the number of retries 
is still likely much lower than the max allowed number of retries (which is 
`Int.MaxValue` by default).


> Idempotent producer may bump epoch and reset sequence numbers prematurely
> -
>
> Key: KAFKA-14397
> URL: https://issues.apache.org/jira/browse/KAFKA-14397
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>
> Suppose that idempotence is enabled in the producer and we send the following 
> single-record batches to a partition leader:
>  * A: epoch=0, seq=0
>  * B: epoch=0, seq=1
>  * C: epoch=0, seq=2
> The partition leader receives all 3 of these batches and commits them to the 
> log. However, the connection is lost before the `Produce` responses are 
> received by the client. Subsequent retries by the producer all fail to be 
> delivered.
> It is possible in this scenario for the first batch `A` to reach the delivery 
> timeout before the subsequence batches. This triggers the following check: 
> [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L642.]
>  Depending whether retries are exhausted, we may adjust sequence numbers.
> The intuition behind this check is that if retries have not been exhausted, 
> then we saw a fatal error and the batch could not have been written to the 
> log. Hence we should bump the epoch and adjust the sequence numbers of the 
> pending batches since they are presumed to be doomed to failure. So in this 
> case, batches B and C might get reset with the bumped epoch:
>  

[jira] [Created] (KAFKA-14397) Idempotent producer may bump epoch and reset sequence numbers prematurely

2022-11-16 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14397:
---

 Summary: Idempotent producer may bump epoch and reset sequence 
numbers prematurely
 Key: KAFKA-14397
 URL: https://issues.apache.org/jira/browse/KAFKA-14397
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson


Suppose that idempotence is enabled in the producer and we send the following 
single-record batches to a partition leader:
 * A: epoch=0, seq=0
 * B: epoch=0, seq=1
 * C: epoch=0, seq=2

The partition leader receives all 3 of these batches and commits them to the 
log. However, the connection is lost before the `Produce` responses are 
received by the client. Subsequent retries by the producer all fail to be 
delivered.

It is possible in this scenario for the first batch `A` to reach the delivery 
timeout before the subsequence batches. This triggers the following check: 
[https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L642.]
 Depending whether retries are exhausted, we may adjust sequence numbers.

The intuition behind this check is that if retries have not been exhausted, 
then we saw a fatal error and the batch could not have been written to the log. 
Hence we should bump the epoch and adjust the sequence numbers of the pending 
batches since they are presumed to be doomed to failure. So in this case, 
batches B and C might get reset with the bumped epoch:
 * B: epoch=1, seq=0
 * C: epoch=1, seq=1

This can result in duplicate records in the log.

The root of the issue is that this logic does not account for expiration of the 
delivery timeout. When the delivery timeout is reached, the number of retries 
is still likely much lower than the max allowed number of retries (which is 
`Int.MaxValue` by default).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13964) kafka-configs.sh end with UnsupportedVersionException when describing TLS user with quotas

2022-11-09 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-13964.
-
Resolution: Duplicate

Thanks for reporting the issue. This will be resolved by 
https://issues.apache.org/jira/browse/KAFKA-14084.

> kafka-configs.sh end with UnsupportedVersionException when describing TLS 
> user with quotas 
> ---
>
> Key: KAFKA-13964
> URL: https://issues.apache.org/jira/browse/KAFKA-13964
> Project: Kafka
>  Issue Type: Bug
>  Components: admin, kraft
>Affects Versions: 3.2.0
> Environment: Kafka 3.2.0 running on OpenShift 4.10 in KRaft mode 
> managed by Strimzi
>Reporter: Jakub Stejskal
>Priority: Minor
>
> {color:#424242}Usage of {color:#424242}kafka-configs.sh end with 
> {color:#424242}org.apache.kafka.common.errors.UnsupportedVersionException: 
> The broker does not support DESCRIBE_USER_SCRAM_CREDENTIALS when describing 
> TLS user with quotas enabled. {color}{color}{color}
>  
> {code:java}
> bin/kafka-configs.sh --bootstrap-server localhost:9092 --describe --user 
> CN=encrypted-arnost` got status code 1 and stderr: -- Error while 
> executing config command with args '--bootstrap-server localhost:9092 
> --describe --user CN=encrypted-arnost' 
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.UnsupportedVersionException: The broker does 
> not support DESCRIBE_USER_SCRAM_CREDENTIALS{code}
> STDOUT contains all necessary data, but the script itself ends with return 
> code 1 and the error above. Scram-sha has not been configured anywhere in 
> that case (not supported by KRaft). This might be fixed by adding support for 
> scram-sha in the next version (not reproducible without KRaft enabled).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14316) NoSuchElementException in feature control iterator

2022-10-18 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14316.
-
Fix Version/s: 3.4.0
   3.3.2
   Resolution: Fixed

> NoSuchElementException in feature control iterator
> --
>
> Key: KAFKA-14316
> URL: https://issues.apache.org/jira/browse/KAFKA-14316
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>
> We noticed this exception during testing:
> {code:java}
> java.util.NoSuchElementException
> 2 at 
> org.apache.kafka.timeline.SnapshottableHashTable$HistoricalIterator.next(SnapshottableHashTable.java:276)
> 3 at 
> org.apache.kafka.timeline.SnapshottableHashTable$HistoricalIterator.next(SnapshottableHashTable.java:189)
> 4 at 
> org.apache.kafka.timeline.TimelineHashMap$EntryIterator.next(TimelineHashMap.java:360)
> 5 at 
> org.apache.kafka.timeline.TimelineHashMap$EntryIterator.next(TimelineHashMap.java:346)
> 6 at 
> org.apache.kafka.controller.FeatureControlManager$FeatureControlIterator.next(FeatureControlManager.java:375)
> 7 at 
> org.apache.kafka.controller.FeatureControlManager$FeatureControlIterator.next(FeatureControlManager.java:347)
> 8 at 
> org.apache.kafka.controller.SnapshotGenerator.generateBatch(SnapshotGenerator.java:109)
> 9 at 
> org.apache.kafka.controller.SnapshotGenerator.generateBatches(SnapshotGenerator.java:126)
> 10at 
> org.apache.kafka.controller.QuorumController$SnapshotGeneratorManager.run(QuorumController.java:637)
> 11at 
> org.apache.kafka.controller.QuorumController$ControlEvent.run(QuorumController.java:539)
> 12at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
> 13at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
> 14at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
> 15at java.base/java.lang.Thread.run(Thread.java:833)
> 16at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:64) 
> {code}
> The iterator `FeatureControlIterator.hasNext()` checks two conditions: 1) 
> whether we have already written the metadata version, and 2) whether the 
> underlying iterator has additional records. However, in `next()`, we also 
> check that the metadata version is at least high enough to include it in the 
> log. When this fails, then we can see an unexpected `NoSuchElementException` 
> if the underlying iterator is empty.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14319) Storage tool format command does not work with old metadata versions

2022-10-18 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14319:
---

 Summary: Storage tool format command does not work with old 
metadata versions
 Key: KAFKA-14319
 URL: https://issues.apache.org/jira/browse/KAFKA-14319
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


When using the format tool with older metadata versions, we see the following 
error:
{code:java}
$ bin/kafka-storage.sh format --cluster-id vVlhxM7VT3C-3nz7yEkiCQ --config 
config/kraft/server.properties --release-version "3.1-IV0" 
Exception in thread "main" java.lang.RuntimeException: Bootstrap metadata 
versions before 3.3-IV0 are not supported. Can't load metadata from format 
command
        at 
org.apache.kafka.metadata.bootstrap.BootstrapMetadata.(BootstrapMetadata.java:83)
        at 
org.apache.kafka.metadata.bootstrap.BootstrapMetadata.fromVersion(BootstrapMetadata.java:48)
        at 
kafka.tools.StorageTool$.$anonfun$formatCommand$2(StorageTool.scala:265)
        at 
kafka.tools.StorageTool$.$anonfun$formatCommand$2$adapted(StorageTool.scala:254)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at kafka.tools.StorageTool$.formatCommand(StorageTool.scala:254)
        at kafka.tools.StorageTool$.main(StorageTool.scala:61)
        at kafka.tools.StorageTool.main(StorageTool.scala) {code}
For versions prior to `3.3-IV0`, we should skip creation of the 
`bootstrap.checkpoint` file instead of failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14316) NoSuchElementException in feature control iterator

2022-10-18 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14316:
---

 Summary: NoSuchElementException in feature control iterator
 Key: KAFKA-14316
 URL: https://issues.apache.org/jira/browse/KAFKA-14316
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Jason Gustafson
Assignee: Jason Gustafson


We noticed this exception during testing:
{code:java}
java.util.NoSuchElementException
2   at 
org.apache.kafka.timeline.SnapshottableHashTable$HistoricalIterator.next(SnapshottableHashTable.java:276)
3   at 
org.apache.kafka.timeline.SnapshottableHashTable$HistoricalIterator.next(SnapshottableHashTable.java:189)
4   at 
org.apache.kafka.timeline.TimelineHashMap$EntryIterator.next(TimelineHashMap.java:360)
5   at 
org.apache.kafka.timeline.TimelineHashMap$EntryIterator.next(TimelineHashMap.java:346)
6   at 
org.apache.kafka.controller.FeatureControlManager$FeatureControlIterator.next(FeatureControlManager.java:375)
7   at 
org.apache.kafka.controller.FeatureControlManager$FeatureControlIterator.next(FeatureControlManager.java:347)
8   at 
org.apache.kafka.controller.SnapshotGenerator.generateBatch(SnapshotGenerator.java:109)
9   at 
org.apache.kafka.controller.SnapshotGenerator.generateBatches(SnapshotGenerator.java:126)
10  at 
org.apache.kafka.controller.QuorumController$SnapshotGeneratorManager.run(QuorumController.java:637)
11  at 
org.apache.kafka.controller.QuorumController$ControlEvent.run(QuorumController.java:539)
12  at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
13  at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
14  at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
15  at java.base/java.lang.Thread.run(Thread.java:833)
16  at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:64) 
{code}
The iterator `FeatureControlIterator.hasNext()` checks two conditions: 1) 
whether we have already written the metadata version, and 2) whether the 
underlying iterator has additional records. However, in `next()`, we also check 
that the metadata version is at least high enough to include it in the log. 
When this fails, then we can see an unexpected `NoSuchElementException` if the 
underlying iterator is empty.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14296) Partition leaders are not demoted during kraft controlled shutdown

2022-10-13 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14296.
-
Fix Version/s: 3.4.0
   3.3.2
   Resolution: Fixed

> Partition leaders are not demoted during kraft controlled shutdown
> --
>
> Key: KAFKA-14296
> URL: https://issues.apache.org/jira/browse/KAFKA-14296
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: David Jacot
>Assignee: David Jacot
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>
> When the BrokerServer starts its shutting down process, it transitions to 
> SHUTTING_DOWN and sets isShuttingDown to true. With this state change, the 
> follower state changes are short-cutted. This means that a broker which was 
> serving as leader would remain acting as a leader until controlled shutdown 
> completes. Instead, we want the leader and ISR state to be updated so that 
> requests will return NOT_LEADER and the client can find the new leader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14292) KRaft broker controlled shutdown can be delayed indefinitely

2022-10-13 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14292.
-
Fix Version/s: 3.4.0
   3.3.2
   Resolution: Fixed

> KRaft broker controlled shutdown can be delayed indefinitely
> 
>
> Key: KAFKA-14292
> URL: https://issues.apache.org/jira/browse/KAFKA-14292
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Alyssa Huang
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>
> We noticed when rolling a kraft cluster that it took an unexpectedly long 
> time for one of the brokers to shutdown. In the logs, we saw the following:
> {code:java}
> Oct 11, 2022 @ 17:53:38.277   [Controller 1] The request from broker 8 to 
> shut down can not yet be granted because the lowest active offset 2283357 is 
> not greater than the broker's shutdown offset 2283358. 
> org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
> 2Oct 11, 2022 @ 17:53:38.277  [Controller 1] Updated the controlled shutdown 
> offset for broker 8 to 2283362.  
> org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
> 3Oct 11, 2022 @ 17:53:40.278  [Controller 1] Updated the controlled shutdown 
> offset for broker 8 to 2283366.  
> org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
> 4Oct 11, 2022 @ 17:53:40.278  [Controller 1] The request from broker 8 to 
> shut down can not yet be granted because the lowest active offset 2283361 is 
> not greater than the broker's shutdown offset 2283362. 
> org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
> 5Oct 11, 2022 @ 17:53:42.279  [Controller 1] The request from broker 8 to 
> shut down can not yet be granted because the lowest active offset 2283365 is 
> not greater than the broker's shutdown offset 2283366. 
> org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
> 6Oct 11, 2022 @ 17:53:42.279  [Controller 1] Updated the controlled shutdown 
> offset for broker 8 to 2283370.  
> org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
> 7Oct 11, 2022 @ 17:53:44.280  [Controller 1] The request from broker 8 to 
> shut down can not yet be granted because the lowest active offset 2283369 is 
> not greater than the broker's shutdown offset 2283370. 
> org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
> 8Oct 11, 2022 @ 17:53:44.281  [Controller 1] Updated the controlled shutdown 
> offset for broker 8 to 2283374.  
> org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG{code}
> From what I can tell, it looks like the controller waits until all brokers 
> have caught up to the {{controlledShutdownOffset}} of the broker that is 
> shutting down before allowing it to proceed. Probably the intent is to make 
> sure they have all the leader and ISR state.
> The problem is that the {{controlledShutdownOffset}} seems to be updated 
> after every heartbeat that the controller receives: 
> https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L1996.
>  Unless all other brokers can catch up to that offset before the next 
> heartbeat from the shutting down broker is received, then the broker remains 
> in the shutting down state indefinitely.
> In this case, it took more than 40 minutes before the broker completed 
> shutdown:
> {code:java}
> 1Oct 11, 2022 @ 18:36:36.105  [Controller 1] The request from broker 8 to 
> shut down has been granted since the lowest active offset 2288510 is now 
> greater than the broker's controlled shutdown offset 2288510.  
> org.apache.kafka.controller.BrokerHeartbeatManager  INFO
> 2Oct 11, 2022 @ 18:40:35.197  [Controller 1] The request from broker 8 to 
> unfence has been granted because it has caught up with the offset of it's 
> register broker record 2288906.   
> org.apache.kafka.controller.BrokerHeartbeatManager  INFO{code}
> It seems like the bug here is that we should not keep updating 
> {{controlledShutdownOffset}} if it has already been set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14292) KRaft broker controlled shutdown can be delayed indefinitely

2022-10-11 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14292:
---

 Summary: KRaft broker controlled shutdown can be delayed 
indefinitely
 Key: KAFKA-14292
 URL: https://issues.apache.org/jira/browse/KAFKA-14292
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Alyssa Huang


We noticed when rolling a kraft cluster that it took an unexpectedly long time 
for one of the brokers to shutdown. In the logs, we saw the following:
{code:java}
Oct 11, 2022 @ 17:53:38.277 [Controller 1] The request from broker 8 to 
shut down can not yet be granted because the lowest active offset 2283357 is 
not greater than the broker's shutdown offset 2283358. 
org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
2Oct 11, 2022 @ 17:53:38.277[Controller 1] Updated the controlled shutdown 
offset for broker 8 to 2283362.  
org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
3Oct 11, 2022 @ 17:53:40.278[Controller 1] Updated the controlled shutdown 
offset for broker 8 to 2283366.  
org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
4Oct 11, 2022 @ 17:53:40.278[Controller 1] The request from broker 8 to 
shut down can not yet be granted because the lowest active offset 2283361 is 
not greater than the broker's shutdown offset 2283362. 
org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
5Oct 11, 2022 @ 17:53:42.279[Controller 1] The request from broker 8 to 
shut down can not yet be granted because the lowest active offset 2283365 is 
not greater than the broker's shutdown offset 2283366. 
org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
6Oct 11, 2022 @ 17:53:42.279[Controller 1] Updated the controlled shutdown 
offset for broker 8 to 2283370.  
org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
7Oct 11, 2022 @ 17:53:44.280[Controller 1] The request from broker 8 to 
shut down can not yet be granted because the lowest active offset 2283369 is 
not greater than the broker's shutdown offset 2283370. 
org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG   
8Oct 11, 2022 @ 17:53:44.281[Controller 1] Updated the controlled shutdown 
offset for broker 8 to 2283374.  
org.apache.kafka.controller.BrokerHeartbeatManager  DEBUG{code}
>From what I can tell, it looks like the controller waits until all brokers 
>have caught up to the {{controlledShutdownOffset}} of the broker that is 
>shutting down before allowing it to proceed. Probably the intent is to make 
>sure they have all the leader and ISR state.

The problem is that the {{controlledShutdownOffset}} seems to be updated after 
every heartbeat that the controller receives: 
https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L1996.
 Unless all other brokers can catch up to that offset before the next heartbeat 
from the shutting down broker is received, then the broker remains in the 
shutting down state indefinitely.

In this case, it took more than 40 minutes before the broker completed shutdown:
{code:java}
1Oct 11, 2022 @ 18:36:36.105[Controller 1] The request from broker 8 to 
shut down has been granted since the lowest active offset 2288510 is now 
greater than the broker's controlled shutdown offset 2288510.  
org.apache.kafka.controller.BrokerHeartbeatManager  INFO
2Oct 11, 2022 @ 18:40:35.197[Controller 1] The request from broker 8 to 
unfence has been granted because it has caught up with the offset of it's 
register broker record 2288906.   
org.apache.kafka.controller.BrokerHeartbeatManager  INFO{code}
It seems like the bug here is that we should not keep updating 
{{controlledShutdownOffset}} if it has already been set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14247) Implement EventHandler interface and DefaultEventHandler

2022-10-03 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14247.
-
Resolution: Fixed

> Implement EventHandler interface and DefaultEventHandler
> 
>
> Key: KAFKA-14247
> URL: https://issues.apache.org/jira/browse/KAFKA-14247
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>
> The polling thread uses events to communicate with the background thread.  
> The events send to the background thread are the {_}Requests{_}, and the 
> events send from the background thread to the polling thread are the 
> {_}Responses{_}.
>  
> Here we have an EventHandler interface and DefaultEventHandler 
> implementation.  The implementation uses two blocking queues to send events 
> both ways.  The two methods, add and poll allows the client, i.e., the 
> polling thread, to retrieve and add events to the handler.
>  
> PR: https://github.com/apache/kafka/pull/12663



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-10140) Incremental config api excludes plugin config changes

2022-09-26 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-10140:

Description: 
I was trying to alter the jmx metric filters using the incremental alter config 
api and hit this error:
{code:java}
java.util.NoSuchElementException: key not found: metrics.jmx.blacklist
at scala.collection.MapLike.default(MapLike.scala:235)
at scala.collection.MapLike.default$(MapLike.scala:234)
at scala.collection.AbstractMap.default(Map.scala:65)
at scala.collection.MapLike.apply(MapLike.scala:144)
at scala.collection.MapLike.apply$(MapLike.scala:143)
at scala.collection.AbstractMap.apply(Map.scala:65)
at kafka.server.AdminManager.listType$1(AdminManager.scala:681)
at 
kafka.server.AdminManager.$anonfun$prepareIncrementalConfigs$1(AdminManager.scala:693)
at kafka.server.AdminManager.prepareIncrementalConfigs(AdminManager.scala:687)
at 
kafka.server.AdminManager.$anonfun$incrementalAlterConfigs$1(AdminManager.scala:618)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:273)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:154)
at scala.collection.TraversableLike.map(TraversableLike.scala:273)
at scala.collection.TraversableLike.map$(TraversableLike.scala:266)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at kafka.server.AdminManager.incrementalAlterConfigs(AdminManager.scala:589)
at 
kafka.server.KafkaApis.handleIncrementalAlterConfigsRequest(KafkaApis.scala:2698)
at kafka.server.KafkaApis.handle(KafkaApis.scala:188)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:78)
at java.base/java.lang.Thread.run(Thread.java:834) {code}
 

It looks like we are only allowing changes to the keys defined in `KafkaConfig` 
through this API. This excludes config changes to any plugin components such as 
`JmxReporter`.

Note that I was able to use the regular `alterConfig` API to change this config.

  was:
I was trying to alter the jmx metric filters using the incremental alter config 
api and hit this error:
```
java.util.NoSuchElementException: key not found: metrics.jmx.blacklist
at scala.collection.MapLike.default(MapLike.scala:235)
at scala.collection.MapLike.default$(MapLike.scala:234)
at scala.collection.AbstractMap.default(Map.scala:65)
at scala.collection.MapLike.apply(MapLike.scala:144)
at scala.collection.MapLike.apply$(MapLike.scala:143)
at scala.collection.AbstractMap.apply(Map.scala:65)
at kafka.server.AdminManager.listType$1(AdminManager.scala:681)
at 
kafka.server.AdminManager.$anonfun$prepareIncrementalConfigs$1(AdminManager.scala:693)
at 
kafka.server.AdminManager.prepareIncrementalConfigs(AdminManager.scala:687)
at 
kafka.server.AdminManager.$anonfun$incrementalAlterConfigs$1(AdminManager.scala:618)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:273)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:154)
at scala.collection.TraversableLike.map(TraversableLike.scala:273)
at scala.collection.TraversableLike.map$(TraversableLike.scala:266)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
kafka.server.AdminManager.incrementalAlterConfigs(AdminManager.scala:589)
at 
kafka.server.KafkaApis.handleIncrementalAlterConfigsRequest(KafkaApis.scala:2698)
at kafka.server.KafkaApis.handle(KafkaApis.scala:188)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:78)
at java.base/java.lang.Thread.run(Thread.java:834)
```

It looks like we are only allowing changes to the keys defined in `KafkaConfig` 
through this API. This excludes config changes to any plugin components such as 
`JmxReporter`. 

Note that I was able to use the regular `alterConfig` API to change this config.


> Incremental config api excludes plugin config changes
> -
>
> Key: KAFKA-10140
> URL: https://issues.apache.org/jira/browse/KAFKA-10140
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Critical
>
> I was trying to alter the jmx metric filters using the incremental alter 
> config api and hit this error:
> {code:java}
> java.util.NoSuchElementException: key not found: metrics.jmx.blacklist
> at scala.collection.MapLike.default(MapLike.scala:235)
> at scala.collection.MapLike.default$(MapLike.scala:234)
> at scala.collection.AbstractMap.default(Map.scala:65)
> at scala.collection.MapLike.apply(MapLike.scala:144)
> at scala.collection.MapLike.apply$(MapLike.scala:143)
> at scala.collection.AbstractMap.apply(Map.scala:65)
> at kafka.server.AdminManager.listType$1(AdminManager.scala:681)
> at 
> kafka.server.AdminManager.$anonfun$prepareIncrementalConfigs$1(AdminManager.scala:693)
> 

[jira] [Resolved] (KAFKA-14236) ListGroups request produces too much Denied logs in authorizer

2022-09-21 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14236.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

> ListGroups request produces too much Denied logs in authorizer
> --
>
> Key: KAFKA-14236
> URL: https://issues.apache.org/jira/browse/KAFKA-14236
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.0.0
>Reporter: Alexandre GRIFFAUT
>Priority: Major
>  Labels: patch-available
> Fix For: 3.4.0
>
>
> Context
> On a multi-tenant secured cluster, with many consumers, a call to ListGroups 
> api will log an authorization error for each consumer group of other tenant.
> Reason
> The handleListGroupsRequest function first tries to authorize a DESCRIBE 
> CLUSTER, and if it fails it will then try to authorize a DESCRIBE GROUP on 
> each consumer group.
> Fix
> In that case neither the DESCRIBE CLUSTER, nor the DESCRIBE GROUP of other 
> tenant were intended, and should be specified in the Action using 
> logIfDenied: false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14240) Ensure kraft metadata log dir is initialized with valid snapshot state

2022-09-19 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14240.
-
Resolution: Fixed

> Ensure kraft metadata log dir is initialized with valid snapshot state
> --
>
> Key: KAFKA-14240
> URL: https://issues.apache.org/jira/browse/KAFKA-14240
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.3.0
>
>
> If the first segment under __cluster_metadata has a base offset greater than 
> 0, then there must exist at least one snapshot which has a larger offset than 
> whatever the first segment starts at. We should check for this at startup to 
> prevent the controller from initialization with invalid state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14238) KRaft replicas can delete segments not included in a snapshot

2022-09-17 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14238.
-
Resolution: Fixed

> KRaft replicas can delete segments not included in a snapshot
> -
>
> Key: KAFKA-14238
> URL: https://issues.apache.org/jira/browse/KAFKA-14238
> Project: Kafka
>  Issue Type: Bug
>  Components: core, kraft
>Reporter: Jose Armando Garcia Sancio
>Assignee: Jose Armando Garcia Sancio
>Priority: Blocker
> Fix For: 3.3.0
>
>
> We see this in the log
> {code:java}
> Deleting segment LogSegment(baseOffset=243864, size=9269150, 
> lastModifiedTime=1662486784182, largestRecordTimestamp=Some(1662486784160)) 
> due to retention time 60480ms breach based on the largest record 
> timestamp in the segment {code}
> This then cause {{KafkaRaftClient}} to throw an exception when sending 
> batches to the listener:
> {code:java}
>  java.lang.IllegalStateException: Snapshot expected since next offset of 
> org.apache.kafka.controller.QuorumController$QuorumMetaLogListener@195461949 
> is 0, log start offset is 369668 and high-watermark is 547379
>   at 
> org.apache.kafka.raft.KafkaRaftClient.lambda$updateListenersProgress$4(KafkaRaftClient.java:312)
>   at java.base/java.util.Optional.orElseThrow(Optional.java:403)
>   at 
> org.apache.kafka.raft.KafkaRaftClient.lambda$updateListenersProgress$5(KafkaRaftClient.java:311)
>   at java.base/java.util.OptionalLong.ifPresent(OptionalLong.java:165)
>   at 
> org.apache.kafka.raft.KafkaRaftClient.updateListenersProgress(KafkaRaftClient.java:309){code}
> The on disk state for the cluster metadata partition confirms this:
> {code:java}
>  ls __cluster_metadata-0/
> 00369668.index
> 00369668.log
> 00369668.timeindex
> 00503411.index
> 00503411.log
> 00503411.snapshot
> 00503411.timeindex
> 00548746.snapshot
> leader-epoch-checkpoint
> partition.metadata
> quorum-state{code}
> Noticed that there are no {{checkpoint}} files and the log doesn't have a 
> segment at base offset 0.
> This is happening because the {{LogConfig}} used for KRaft sets the retention 
> policy to {{delete}} which causes the method {{deleteOldSegments}} to delete 
> old segments even if there are no snaspshot for it. For KRaft, Kafka should 
> only delete segment that breach the log start offset.
> Log configuration for KRaft:
> {code:java}
>   val props = new Properties()
>   props.put(LogConfig.MaxMessageBytesProp, 
> config.maxBatchSizeInBytes.toString)
>   props.put(LogConfig.SegmentBytesProp, Int.box(config.logSegmentBytes))
>   props.put(LogConfig.SegmentMsProp, Long.box(config.logSegmentMillis))
>   props.put(LogConfig.FileDeleteDelayMsProp, 
> Int.box(Defaults.FileDeleteDelayMs))
>   LogConfig.validateValues(props)
>   val defaultLogConfig = LogConfig(props){code}
> Segment deletion code:
> {code:java}
>  def deleteOldSegments(): Int = {
>   if (config.delete) {
> deleteLogStartOffsetBreachedSegments() +
>   deleteRetentionSizeBreachedSegments() +
>   deleteRetentionMsBreachedSegments()
>   } else {
> deleteLogStartOffsetBreachedSegments()
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14240) Ensure kraft metadata log dir is initialized with valid snapshot state

2022-09-16 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14240:
---

 Summary: Ensure kraft metadata log dir is initialized with valid 
snapshot state
 Key: KAFKA-14240
 URL: https://issues.apache.org/jira/browse/KAFKA-14240
 Project: Kafka
  Issue Type: Improvement
Reporter: Jason Gustafson
Assignee: Jason Gustafson


If the first segment under __cluster_metadata has a base offset greater than 0, 
then there must exist at least one snapshot which has a larger offset than 
whatever the first segment starts at. We should check for this at startup to 
prevent the controller from initialization with invalid state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14224) Consumer should auto-commit prior to cooperative partition revocation

2022-09-13 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14224:

Description: With the old "eager" reassignment logic, we always revoked all 
partitions prior to each rebalance. When auto-commit is enabled, a part of this 
process is committing current positions. Under the new "cooperative" logic, we 
defer revocation until after the rebalance, which means we can continue 
fetching while the rebalance is in progress. However, when reviewing 
KAFKA-14196, we noticed that there is no similar logic to commit offsets prior 
to this deferred revocation. This means that cooperative consumption is more 
likely to lead to have duplicate consumption even when there is no failure 
involved.  (was: With the old "eager" reassignment logic, we always revoked all 
partitions prior to each rebalance. When auto-commit is enabled, a part of this 
process is committing current position. Under the new "cooperative" logic, we 
defer revocation until after the rebalance, which means we can continue 
fetching while the rebalance is in progress. However, when reviewing 
KAFKA-14196, we noticed that there is no similar logic to commit offsets prior 
to this deferred revocation. This means that cooperative consumption is more 
likely to lead to have duplicate consumption even when there is no failure 
involved.)

> Consumer should auto-commit prior to cooperative partition revocation
> -
>
> Key: KAFKA-14224
> URL: https://issues.apache.org/jira/browse/KAFKA-14224
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>  Labels: new-consumer-threading-should-fix
>
> With the old "eager" reassignment logic, we always revoked all partitions 
> prior to each rebalance. When auto-commit is enabled, a part of this process 
> is committing current positions. Under the new "cooperative" logic, we defer 
> revocation until after the rebalance, which means we can continue fetching 
> while the rebalance is in progress. However, when reviewing KAFKA-14196, we 
> noticed that there is no similar logic to commit offsets prior to this 
> deferred revocation. This means that cooperative consumption is more likely 
> to lead to have duplicate consumption even when there is no failure involved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14196) Duplicated consumption during rebalance, causing OffsetValidationTest to act flaky

2022-09-12 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603328#comment-17603328
 ] 

Jason Gustafson commented on KAFKA-14196:
-

>  Also, this is currently marked as a blocker. Is there a crisp description of 
>the regression?

Prior to revocation, eager rebalance strategies will attempt to auto-commit 
offsets before revoking partitions and joining the rebalance. Originally this 
logic was synchronous, which meant there was no opportunity for additional data 
to be returned before the revocation completed. This changed when we introduced 
asynchronous offset commit logic. Any progress made between the time the 
asynchronous offset commit was sent and the revocation completed would be lost. 
This results in duplicate consumption.

> Duplicated consumption during rebalance, causing OffsetValidationTest to act 
> flaky
> --
>
> Key: KAFKA-14196
> URL: https://issues.apache.org/jira/browse/KAFKA-14196
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.2.1
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Blocker
>  Labels: new-consumer-threading-should-fix
> Fix For: 3.3.0, 3.2.2
>
>
> Several flaky tests under OffsetValidationTest are indicating potential 
> consumer duplication issue, when autocommit is enabled.  I believe this is 
> affecting *3.2* and onward.  Below shows the failure message:
>  
> {code:java}
> Total consumed records 3366 did not match consumed position 3331 {code}
>  
> After investigating the log, I discovered that the data consumed between the 
> start of a rebalance event and the async commit was lost for those failing 
> tests.  In the example below, the rebalance event kicks in at around 
> 1662054846995 (first record), and the async commit of the offset 3739 is 
> completed at around 1662054847015 (right before partitions_revoked).
>  
> {code:java}
> {"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
> {"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
> {"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
> {"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]}
>  {code}
> A few things to note here:
>  # Manually calling commitSync in the onPartitionsRevoke cb seems to 
> alleviate the issue
>  # Setting includeMetadataInTimeout to false also seems to alleviate the 
> issue.
> The above tries seems to suggest that contract between poll() and 
> asyncCommit() is broken.  AFAIK, we implicitly uses poll() to ack the 
> previously fetched data, and the consumer would (try to) commit these offsets 
> in the current poll() loop.  However, it seems like as the poll continues to 
> loop, the "acked" data isn't being committed.
>  
> I believe this could be introduced in  KAFKA-14024, which originated from 
> KAFKA-13310.
> More specifically, (see the comments below), the ConsumerCoordinator will 
> alway return before async commit, due to the previous incomplete commit.  
> However, this is a bit contradictory here because:
>  # I think we want to commit asynchronously while the poll continues, and if 
> we do that, we are back to KAFKA-14024, that the consumer will get rebalance 
> timeout and get kicked out of the group.
>  # But we also need to commit all the "acked" offsets before revoking the 
> partition, and this has to be blocked.
> *Steps to Reproduce the Issue:*
>  # Check out AK 3.2
>  # Run this several times: (Recommend to only run runs with autocommit 
> enabled in consumer_test.py to save time)
> {code:java}
> _DUCKTAPE_OPTIONS="--debug" 
> TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_consumer_failure"
>  bash tests/docker/run_tests.sh {code}
>  
> *Steps to Diagnose the Issue:*
>  # Open the test results in *results/*
>  # Go to the consumer log.  It might look like this
>  
> {code:java}
> results/2022-09-03--005/OffsetValidationTest/test_consumer_failure/clean_shutdown=True.enable_autocommit=True.metadata_quorum=ZK/2/VerifiableConsumer-0-xx/dockerYY
>  {code}
> 3. Find the docker instance that has partition getting revoked 

[jira] [Resolved] (KAFKA-14215) KRaft forwarded requests have no quota enforcement

2022-09-12 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14215.
-
Resolution: Fixed

> KRaft forwarded requests have no quota enforcement
> --
>
> Key: KAFKA-14215
> URL: https://issues.apache.org/jira/browse/KAFKA-14215
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.3
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 3.3.0, 3.3
>
>
> On the broker, the `BrokerMetadataPublisher` is responsible for propagating 
> quota changes from `ClientQuota` records to `ClientQuotaManager`. On the 
> controller, there is no similar logic, so no client quotas are enforced on 
> the controller.
> On the broker side, there is no enforcement as well since the broker assumes 
> that the controller will be the one to do it. Basically it looks at the 
> throttle time returned in the response from the controller. If it is 0, then 
> the response is sent immediately without any throttling. 
> So the consequence of both of these issues is that controller-bound requests 
> have no throttling today.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14224) Consumer should auto-commit prior to cooperative partition revocation

2022-09-12 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14224:

Description: With the old "eager" reassignment logic, we always revoked all 
partitions prior to each rebalance. When auto-commit is enabled, a part of this 
process is committing current position. Under the new "cooperative" logic, we 
defer revocation until after the rebalance, which means we can continue 
fetching while the rebalance is in progress. However, when reviewing 
KAFKA-14196, we noticed that there is no similar logic to commit offsets prior 
to this deferred revocation. This means that cooperative consumption is more 
likely to lead to have duplicate consumption even when there is no failure 
involved.  (was: With the old "eager" reassignment logic, we always revoked all 
partitions prior to each rebalance. When auto-commit is enabled, a part of this 
process is committing current position. Under the cooperative logic, we defer 
revocation until after the rebalance, which means we can continue fetching 
while the rebalance is in progress. However, when reviewing KAFKA-14196, we 
noticed that there is no similar logic to commit offsets prior to this deferred 
revocation. This means that cooperative consumption is more likely to lead to 
have duplicate consumption even when there is no failure involved.)

> Consumer should auto-commit prior to cooperative partition revocation
> -
>
> Key: KAFKA-14224
> URL: https://issues.apache.org/jira/browse/KAFKA-14224
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> With the old "eager" reassignment logic, we always revoked all partitions 
> prior to each rebalance. When auto-commit is enabled, a part of this process 
> is committing current position. Under the new "cooperative" logic, we defer 
> revocation until after the rebalance, which means we can continue fetching 
> while the rebalance is in progress. However, when reviewing KAFKA-14196, we 
> noticed that there is no similar logic to commit offsets prior to this 
> deferred revocation. This means that cooperative consumption is more likely 
> to lead to have duplicate consumption even when there is no failure involved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14224) Consumer should auto-commit prior to cooperative partition revocation

2022-09-12 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14224:
---

 Summary: Consumer should auto-commit prior to cooperative 
partition revocation
 Key: KAFKA-14224
 URL: https://issues.apache.org/jira/browse/KAFKA-14224
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


With the old "eager" reassignment logic, we always revoked all partitions prior 
to each rebalance. When auto-commit is enabled, a part of this process is 
committing current position. Under the cooperative logic, we defer revocation 
until after the rebalance, which means we can continue fetching while the 
rebalance is in progress. However, when reviewing KAFKA-14196, we noticed that 
there is no similar logic to commit offsets prior to this deferred revocation. 
This means that cooperative consumption is more likely to lead to have 
duplicate consumption even when there is no failure involved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14215) KRaft forwarded requests have no quota enforcement

2022-09-09 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14215:
---

 Summary: KRaft forwarded requests have no quota enforcement
 Key: KAFKA-14215
 URL: https://issues.apache.org/jira/browse/KAFKA-14215
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


On the broker, the `BrokerMetadataPublisher` is responsible for propagating 
quota changes from `ClientQuota` records to `ClientQuotaManager`. On the 
controller, there is no similar logic, so no client quotas are enforced on the 
controller.

On the broker side, there is no enforcement as well since the broker assumes 
that the controller will be the one to do it. Basically it looks at the 
throttle time returned in the response from the controller. If it is 0, then 
the response is sent immediately without any throttling. 

So the consequence of both of these issues is that controller-bound requests 
have no throttling today.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14196) Duplicated consumption during rebalance, causing OffsetValidationTest to act flaky

2022-09-08 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14196:

Affects Version/s: 3.2.1
   (was: 3.3.0)

> Duplicated consumption during rebalance, causing OffsetValidationTest to act 
> flaky
> --
>
> Key: KAFKA-14196
> URL: https://issues.apache.org/jira/browse/KAFKA-14196
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.2.1
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>  Labels: new-consumer-threading-should-fix
>
> Several flaky tests under OffsetValidationTest are indicating potential 
> consumer duplication issue, when autocommit is enabled.  I believe this is 
> affecting *3.2* and onward.  Below shows the failure message:
>  
> {code:java}
> Total consumed records 3366 did not match consumed position 3331 {code}
>  
> After investigating the log, I discovered that the data consumed between the 
> start of a rebalance event and the async commit was lost for those failing 
> tests.  In the example below, the rebalance event kicks in at around 
> 1662054846995 (first record), and the async commit of the offset 3739 is 
> completed at around 1662054847015 (right before partitions_revoked).
>  
> {code:java}
> {"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
> {"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
> {"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
> {"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]}
>  {code}
> A few things to note here:
>  # Manually calling commitSync in the onPartitionsRevoke cb seems to 
> alleviate the issue
>  # Setting includeMetadataInTimeout to false also seems to alleviate the 
> issue.
> The above tries seems to suggest that contract between poll() and 
> asyncCommit() is broken.  AFAIK, we implicitly uses poll() to ack the 
> previously fetched data, and the consumer would (try to) commit these offsets 
> in the current poll() loop.  However, it seems like as the poll continues to 
> loop, the "acked" data isn't being committed.
>  
> I believe this could be introduced in  KAFKA-14024, which originated from 
> KAFKA-13310.
> More specifically, (see the comments below), the ConsumerCoordinator will 
> alway return before async commit, due to the previous incomplete commit.  
> However, this is a bit contradictory here because:
>  # I think we want to commit asynchronously while the poll continues, and if 
> we do that, we are back to KAFKA-14024, that the consumer will get rebalance 
> timeout and get kicked out of the group.
>  # But we also need to commit all the "acked" offsets before revoking the 
> partition, and this has to be blocked.
> *Steps to Reproduce the Issue:*
>  # Check out AK 3.2
>  # Run this several times: (Recommend to only run runs with autocommit 
> enabled in consumer_test.py to save time)
> {code:java}
> _DUCKTAPE_OPTIONS="--debug" 
> TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_consumer_failure"
>  bash tests/docker/run_tests.sh {code}
>  
> *Steps to Diagnose the Issue:*
>  # Open the test results in *results/*
>  # Go to the consumer log.  It might look like this
>  
> {code:java}
> results/2022-09-03--005/OffsetValidationTest/test_consumer_failure/clean_shutdown=True.enable_autocommit=True.metadata_quorum=ZK/2/VerifiableConsumer-0-xx/dockerYY
>  {code}
> 3. Find the docker instance that has partition getting revoked and rejoined.  
> Observed the offset before and after.
> *Propose Fixes:*
>  TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14196) Duplicated consumption during rebalance, causing OffsetValidationTest to act flaky

2022-09-08 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14196:

Priority: Blocker  (was: Major)

> Duplicated consumption during rebalance, causing OffsetValidationTest to act 
> flaky
> --
>
> Key: KAFKA-14196
> URL: https://issues.apache.org/jira/browse/KAFKA-14196
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.2.1
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Blocker
>  Labels: new-consumer-threading-should-fix
> Fix For: 3.3.0, 3.2.2
>
>
> Several flaky tests under OffsetValidationTest are indicating potential 
> consumer duplication issue, when autocommit is enabled.  I believe this is 
> affecting *3.2* and onward.  Below shows the failure message:
>  
> {code:java}
> Total consumed records 3366 did not match consumed position 3331 {code}
>  
> After investigating the log, I discovered that the data consumed between the 
> start of a rebalance event and the async commit was lost for those failing 
> tests.  In the example below, the rebalance event kicks in at around 
> 1662054846995 (first record), and the async commit of the offset 3739 is 
> completed at around 1662054847015 (right before partitions_revoked).
>  
> {code:java}
> {"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
> {"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
> {"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
> {"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]}
>  {code}
> A few things to note here:
>  # Manually calling commitSync in the onPartitionsRevoke cb seems to 
> alleviate the issue
>  # Setting includeMetadataInTimeout to false also seems to alleviate the 
> issue.
> The above tries seems to suggest that contract between poll() and 
> asyncCommit() is broken.  AFAIK, we implicitly uses poll() to ack the 
> previously fetched data, and the consumer would (try to) commit these offsets 
> in the current poll() loop.  However, it seems like as the poll continues to 
> loop, the "acked" data isn't being committed.
>  
> I believe this could be introduced in  KAFKA-14024, which originated from 
> KAFKA-13310.
> More specifically, (see the comments below), the ConsumerCoordinator will 
> alway return before async commit, due to the previous incomplete commit.  
> However, this is a bit contradictory here because:
>  # I think we want to commit asynchronously while the poll continues, and if 
> we do that, we are back to KAFKA-14024, that the consumer will get rebalance 
> timeout and get kicked out of the group.
>  # But we also need to commit all the "acked" offsets before revoking the 
> partition, and this has to be blocked.
> *Steps to Reproduce the Issue:*
>  # Check out AK 3.2
>  # Run this several times: (Recommend to only run runs with autocommit 
> enabled in consumer_test.py to save time)
> {code:java}
> _DUCKTAPE_OPTIONS="--debug" 
> TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_consumer_failure"
>  bash tests/docker/run_tests.sh {code}
>  
> *Steps to Diagnose the Issue:*
>  # Open the test results in *results/*
>  # Go to the consumer log.  It might look like this
>  
> {code:java}
> results/2022-09-03--005/OffsetValidationTest/test_consumer_failure/clean_shutdown=True.enable_autocommit=True.metadata_quorum=ZK/2/VerifiableConsumer-0-xx/dockerYY
>  {code}
> 3. Find the docker instance that has partition getting revoked and rejoined.  
> Observed the offset before and after.
> *Propose Fixes:*
>  TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14196) Duplicated consumption during rebalance, causing OffsetValidationTest to act flaky

2022-09-08 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14196:

Fix Version/s: 3.3.0
   3.2.2

> Duplicated consumption during rebalance, causing OffsetValidationTest to act 
> flaky
> --
>
> Key: KAFKA-14196
> URL: https://issues.apache.org/jira/browse/KAFKA-14196
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.2.1
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>  Labels: new-consumer-threading-should-fix
> Fix For: 3.3.0, 3.2.2
>
>
> Several flaky tests under OffsetValidationTest are indicating potential 
> consumer duplication issue, when autocommit is enabled.  I believe this is 
> affecting *3.2* and onward.  Below shows the failure message:
>  
> {code:java}
> Total consumed records 3366 did not match consumed position 3331 {code}
>  
> After investigating the log, I discovered that the data consumed between the 
> start of a rebalance event and the async commit was lost for those failing 
> tests.  In the example below, the rebalance event kicks in at around 
> 1662054846995 (first record), and the async commit of the offset 3739 is 
> completed at around 1662054847015 (right before partitions_revoked).
>  
> {code:java}
> {"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
> {"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
> {"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
> {"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]}
>  {code}
> A few things to note here:
>  # Manually calling commitSync in the onPartitionsRevoke cb seems to 
> alleviate the issue
>  # Setting includeMetadataInTimeout to false also seems to alleviate the 
> issue.
> The above tries seems to suggest that contract between poll() and 
> asyncCommit() is broken.  AFAIK, we implicitly uses poll() to ack the 
> previously fetched data, and the consumer would (try to) commit these offsets 
> in the current poll() loop.  However, it seems like as the poll continues to 
> loop, the "acked" data isn't being committed.
>  
> I believe this could be introduced in  KAFKA-14024, which originated from 
> KAFKA-13310.
> More specifically, (see the comments below), the ConsumerCoordinator will 
> alway return before async commit, due to the previous incomplete commit.  
> However, this is a bit contradictory here because:
>  # I think we want to commit asynchronously while the poll continues, and if 
> we do that, we are back to KAFKA-14024, that the consumer will get rebalance 
> timeout and get kicked out of the group.
>  # But we also need to commit all the "acked" offsets before revoking the 
> partition, and this has to be blocked.
> *Steps to Reproduce the Issue:*
>  # Check out AK 3.2
>  # Run this several times: (Recommend to only run runs with autocommit 
> enabled in consumer_test.py to save time)
> {code:java}
> _DUCKTAPE_OPTIONS="--debug" 
> TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_consumer_failure"
>  bash tests/docker/run_tests.sh {code}
>  
> *Steps to Diagnose the Issue:*
>  # Open the test results in *results/*
>  # Go to the consumer log.  It might look like this
>  
> {code:java}
> results/2022-09-03--005/OffsetValidationTest/test_consumer_failure/clean_shutdown=True.enable_autocommit=True.metadata_quorum=ZK/2/VerifiableConsumer-0-xx/dockerYY
>  {code}
> 3. Find the docker instance that has partition getting revoked and rejoined.  
> Observed the offset before and after.
> *Propose Fixes:*
>  TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14196) Duplicated consumption during rebalance, causing OffsetValidationTest to act flaky

2022-09-08 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14196:

Affects Version/s: (was: 3.2.1)

> Duplicated consumption during rebalance, causing OffsetValidationTest to act 
> flaky
> --
>
> Key: KAFKA-14196
> URL: https://issues.apache.org/jira/browse/KAFKA-14196
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.3.0
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>  Labels: new-consumer-threading-should-fix
>
> Several flaky tests under OffsetValidationTest are indicating potential 
> consumer duplication issue, when autocommit is enabled.  I believe this is 
> affecting *3.2* and onward.  Below shows the failure message:
>  
> {code:java}
> Total consumed records 3366 did not match consumed position 3331 {code}
>  
> After investigating the log, I discovered that the data consumed between the 
> start of a rebalance event and the async commit was lost for those failing 
> tests.  In the example below, the rebalance event kicks in at around 
> 1662054846995 (first record), and the async commit of the offset 3739 is 
> completed at around 1662054847015 (right before partitions_revoked).
>  
> {code:java}
> {"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
> {"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
> {"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
> {"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]}
>  {code}
> A few things to note here:
>  # Manually calling commitSync in the onPartitionsRevoke cb seems to 
> alleviate the issue
>  # Setting includeMetadataInTimeout to false also seems to alleviate the 
> issue.
> The above tries seems to suggest that contract between poll() and 
> asyncCommit() is broken.  AFAIK, we implicitly uses poll() to ack the 
> previously fetched data, and the consumer would (try to) commit these offsets 
> in the current poll() loop.  However, it seems like as the poll continues to 
> loop, the "acked" data isn't being committed.
>  
> I believe this could be introduced in  KAFKA-14024, which originated from 
> KAFKA-13310.
> More specifically, (see the comments below), the ConsumerCoordinator will 
> alway return before async commit, due to the previous incomplete commit.  
> However, this is a bit contradictory here because:
>  # I think we want to commit asynchronously while the poll continues, and if 
> we do that, we are back to KAFKA-14024, that the consumer will get rebalance 
> timeout and get kicked out of the group.
>  # But we also need to commit all the "acked" offsets before revoking the 
> partition, and this has to be blocked.
> *Steps to Reproduce the Issue:*
>  # Check out AK 3.2
>  # Run this several times: (Recommend to only run runs with autocommit 
> enabled in consumer_test.py to save time)
> {code:java}
> _DUCKTAPE_OPTIONS="--debug" 
> TC_PATHS="tests/kafkatest/tests/client/consumer_test.py::OffsetValidationTest.test_consumer_failure"
>  bash tests/docker/run_tests.sh {code}
>  
> *Steps to Diagnose the Issue:*
>  # Open the test results in *results/*
>  # Go to the consumer log.  It might look like this
>  
> {code:java}
> results/2022-09-03--005/OffsetValidationTest/test_consumer_failure/clean_shutdown=True.enable_autocommit=True.metadata_quorum=ZK/2/VerifiableConsumer-0-xx/dockerYY
>  {code}
> 3. Find the docker instance that has partition getting revoked and rejoined.  
> Observed the offset before and after.
> *Propose Fixes:*
>  TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14201) Consumer should not send group instance ID if committing with empty member ID

2022-09-02 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599744#comment-17599744
 ] 

Jason Gustafson commented on KAFKA-14201:
-

Worth mentioning that the alternative is to make the server more permissive and 
just ignore instance ID if it is set in OffsetCommit while memberId is empty.

> Consumer should not send group instance ID if committing with empty member ID
> -
>
> Key: KAFKA-14201
> URL: https://issues.apache.org/jira/browse/KAFKA-14201
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> The consumer group instance ID is used to support a notion of "static" 
> consumer groups. The idea is to be able to identify the same group instance 
> across restarts so that a rebalance is not needed. However, if the user sets 
> `group.instance.id` in the consumer configuration, but uses "simple" 
> assignment with `assign()`, then the instance ID nevertheless is sent in the 
> OffsetCommit request to the coordinator. This may result in a surprising 
> UNKNOWN_MEMBER_ID error. The consumer should probably be smart enough to only 
> send the instance ID when committing as part of a consumer group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14201) Consumer should not send group instance ID if committing with empty member ID

2022-09-02 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14201:
---

 Summary: Consumer should not send group instance ID if committing 
with empty member ID
 Key: KAFKA-14201
 URL: https://issues.apache.org/jira/browse/KAFKA-14201
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


The consumer group instance ID is used to support a notion of "static" consumer 
groups. The idea is to be able to identify the same group instance across 
restarts so that a rebalance is not needed. However, if the user sets 
`group.instance.id` in the consumer configuration, but uses "simple" assignment 
with `assign()`, then the instance ID nevertheless is sent in the OffsetCommit 
request to the coordinator. This may result in a surprising UNKNOWN_MEMBER_ID 
error. The consumer should probably be smart enough to only send the instance 
ID when committing as part of a consumer group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14177) Correctly support older kraft versions without FeatureLevelRecord

2022-08-25 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14177.
-
Resolution: Fixed

> Correctly support older kraft versions without FeatureLevelRecord
> -
>
> Key: KAFKA-14177
> URL: https://issues.apache.org/jira/browse/KAFKA-14177
> Project: Kafka
>  Issue Type: Bug
>Reporter: Colin McCabe
>Assignee: Colin McCabe
>Priority: Blocker
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-14183) Kraft bootstrap metadata file should use snapshot header/footer

2022-08-25 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson reassigned KAFKA-14183:
---

Assignee: Jason Gustafson

> Kraft bootstrap metadata file should use snapshot header/footer
> ---
>
> Key: KAFKA-14183
> URL: https://issues.apache.org/jira/browse/KAFKA-14183
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.3.0
>
>
> The bootstrap checkpoint file that we use in kraft is intended to follow the 
> usual snapshot format, but currently it does not include the header/footer 
> control records. The main purpose of these at the moment is to set a version 
> for the checkpoint file itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14183) Kraft bootstrap metadata file should use snapshot header/footer

2022-08-25 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14183:

Description: The bootstrap checkpoint file that we use in kraft is intended 
to follow the usual snapshot format, but currently it does not include the 
header/footer control records. The main purpose of these at the moment is to 
set a version for the checkpoint file itself.

> Kraft bootstrap metadata file should use snapshot header/footer
> ---
>
> Key: KAFKA-14183
> URL: https://issues.apache.org/jira/browse/KAFKA-14183
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
> Fix For: 3.3.0
>
>
> The bootstrap checkpoint file that we use in kraft is intended to follow the 
> usual snapshot format, but currently it does not include the header/footer 
> control records. The main purpose of these at the moment is to set a version 
> for the checkpoint file itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14183) Kraft bootstrap metadata file should use snapshot header/footer

2022-08-25 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14183:
---

 Summary: Kraft bootstrap metadata file should use snapshot 
header/footer
 Key: KAFKA-14183
 URL: https://issues.apache.org/jira/browse/KAFKA-14183
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
 Fix For: 3.3.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-13166) EOFException when Controller handles unknown API

2022-08-22 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583117#comment-17583117
 ] 

Jason Gustafson edited comment on KAFKA-13166 at 8/22/22 5:42 PM:
--

I am going to resolve this jira. There were two issues that we have fixed:
 * [https://github.com/apache/kafka/pull/12403]: This patch fixes error 
handling in `ControllerApis` so that we catch errors which occur during 
serialization and sending of the response. Prior to this, any error occurred 
during this phase would be silently ignored and the connection would be left 
hanging.
 * [https://github.com/apache/kafka/pull/12538:] This patch fixes a bug during 
kraft shutdown which can cause a response to try to be sent after the 
controller's socket server has been shutdown. When this occurs, we get an 
exception trace which matches that in the description above.


was (Author: hachikuji):
I am going to resolve this jira. There were two issues that we have fixed:
 * [https://github.com/apache/kafka/pull/12403]: This patch fixes error 
handling in `ControllerApis` so that we catch errors which occur during 
serialization and sending of the response. Prior to this, any error occurred 
during this phase would be silently ignored and the connection would be left 
hanging.
 * [https://github.com/apache/kafka/pull/12538:] This patch fixes a bug during 
kraft shutdown which can cause a response to be sent after the controller's 
socket server has been shutdown. When this occurs, we get an exception trace 
which matches that in the description above.

> EOFException when Controller handles unknown API
> 
>
> Key: KAFKA-13166
> URL: https://issues.apache.org/jira/browse/KAFKA-13166
> Project: Kafka
>  Issue Type: Bug
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Minor
> Fix For: 3.3.0
>
>
> When ControllerApis handles an unsupported RPC, it silently drops the request 
> due to an unhandled exception. 
> The following stack trace was manually printed since this exception was 
> suppressed on the controller. 
> {code}
> java.util.NoSuchElementException: key not found: UpdateFeatures
>   at scala.collection.MapOps.default(Map.scala:274)
>   at scala.collection.MapOps.default$(Map.scala:273)
>   at scala.collection.AbstractMap.default(Map.scala:405)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:425)
>   at kafka.network.RequestChannel$Metrics.apply(RequestChannel.scala:74)
>   at 
> kafka.network.RequestChannel.$anonfun$updateErrorMetrics$1(RequestChannel.scala:458)
>   at 
> kafka.network.RequestChannel.$anonfun$updateErrorMetrics$1$adapted(RequestChannel.scala:457)
>   at 
> kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62)
>   at 
> scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359)
>   at 
> scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355)
>   at 
> scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309)
>   at 
> kafka.network.RequestChannel.updateErrorMetrics(RequestChannel.scala:457)
>   at kafka.network.RequestChannel.sendResponse(RequestChannel.scala:388)
>   at 
> kafka.server.RequestHandlerHelper.sendErrorOrCloseConnection(RequestHandlerHelper.scala:93)
>   at 
> kafka.server.RequestHandlerHelper.sendErrorResponseMaybeThrottle(RequestHandlerHelper.scala:121)
>   at 
> kafka.server.RequestHandlerHelper.handleError(RequestHandlerHelper.scala:78)
>   at kafka.server.ControllerApis.handle(ControllerApis.scala:116)
>   at 
> kafka.server.ControllerApis.$anonfun$handleEnvelopeRequest$1(ControllerApis.scala:125)
>   at 
> kafka.server.ControllerApis.$anonfun$handleEnvelopeRequest$1$adapted(ControllerApis.scala:125)
>   at 
> kafka.server.EnvelopeUtils$.handleEnvelopeRequest(EnvelopeUtils.scala:65)
>   at 
> kafka.server.ControllerApis.handleEnvelopeRequest(ControllerApis.scala:125)
>   at kafka.server.ControllerApis.handle(ControllerApis.scala:103)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> This is due to a bug in the metrics code in RequestChannel.
> The result is that the request fails, but no indication is given that it was 
> due to an unsupported API on either the broker, controller, or client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-13166) EOFException when Controller handles unknown API

2022-08-22 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583117#comment-17583117
 ] 

Jason Gustafson commented on KAFKA-13166:
-

I am going to resolve this jira. There were two issues that we have fixed:
 * [https://github.com/apache/kafka/pull/12403]: This patch fixes error 
handling in `ControllerApis` so that we catch errors which occur during 
serialization and sending of the response. Prior to this, any error occurred 
during this phase would be silently ignored and the connection would be left 
hanging.
 * [https://github.com/apache/kafka/pull/12538:] This patch fixes a bug during 
kraft shutdown which can cause a response to be sent after the controller's 
socket server has been shutdown. When this occurs, we get an exception trace 
which matches that in the description above.

> EOFException when Controller handles unknown API
> 
>
> Key: KAFKA-13166
> URL: https://issues.apache.org/jira/browse/KAFKA-13166
> Project: Kafka
>  Issue Type: Bug
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Minor
> Fix For: 3.3.0, 3.4.0
>
>
> When ControllerApis handles an unsupported RPC, it silently drops the request 
> due to an unhandled exception. 
> The following stack trace was manually printed since this exception was 
> suppressed on the controller. 
> {code}
> java.util.NoSuchElementException: key not found: UpdateFeatures
>   at scala.collection.MapOps.default(Map.scala:274)
>   at scala.collection.MapOps.default$(Map.scala:273)
>   at scala.collection.AbstractMap.default(Map.scala:405)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:425)
>   at kafka.network.RequestChannel$Metrics.apply(RequestChannel.scala:74)
>   at 
> kafka.network.RequestChannel.$anonfun$updateErrorMetrics$1(RequestChannel.scala:458)
>   at 
> kafka.network.RequestChannel.$anonfun$updateErrorMetrics$1$adapted(RequestChannel.scala:457)
>   at 
> kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62)
>   at 
> scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359)
>   at 
> scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355)
>   at 
> scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309)
>   at 
> kafka.network.RequestChannel.updateErrorMetrics(RequestChannel.scala:457)
>   at kafka.network.RequestChannel.sendResponse(RequestChannel.scala:388)
>   at 
> kafka.server.RequestHandlerHelper.sendErrorOrCloseConnection(RequestHandlerHelper.scala:93)
>   at 
> kafka.server.RequestHandlerHelper.sendErrorResponseMaybeThrottle(RequestHandlerHelper.scala:121)
>   at 
> kafka.server.RequestHandlerHelper.handleError(RequestHandlerHelper.scala:78)
>   at kafka.server.ControllerApis.handle(ControllerApis.scala:116)
>   at 
> kafka.server.ControllerApis.$anonfun$handleEnvelopeRequest$1(ControllerApis.scala:125)
>   at 
> kafka.server.ControllerApis.$anonfun$handleEnvelopeRequest$1$adapted(ControllerApis.scala:125)
>   at 
> kafka.server.EnvelopeUtils$.handleEnvelopeRequest(EnvelopeUtils.scala:65)
>   at 
> kafka.server.ControllerApis.handleEnvelopeRequest(ControllerApis.scala:125)
>   at kafka.server.ControllerApis.handle(ControllerApis.scala:103)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> This is due to a bug in the metrics code in RequestChannel.
> The result is that the request fails, but no indication is given that it was 
> due to an unsupported API on either the broker, controller, or client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-13166) EOFException when Controller handles unknown API

2022-08-22 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-13166:

Fix Version/s: (was: 3.4.0)

> EOFException when Controller handles unknown API
> 
>
> Key: KAFKA-13166
> URL: https://issues.apache.org/jira/browse/KAFKA-13166
> Project: Kafka
>  Issue Type: Bug
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Minor
> Fix For: 3.3.0
>
>
> When ControllerApis handles an unsupported RPC, it silently drops the request 
> due to an unhandled exception. 
> The following stack trace was manually printed since this exception was 
> suppressed on the controller. 
> {code}
> java.util.NoSuchElementException: key not found: UpdateFeatures
>   at scala.collection.MapOps.default(Map.scala:274)
>   at scala.collection.MapOps.default$(Map.scala:273)
>   at scala.collection.AbstractMap.default(Map.scala:405)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:425)
>   at kafka.network.RequestChannel$Metrics.apply(RequestChannel.scala:74)
>   at 
> kafka.network.RequestChannel.$anonfun$updateErrorMetrics$1(RequestChannel.scala:458)
>   at 
> kafka.network.RequestChannel.$anonfun$updateErrorMetrics$1$adapted(RequestChannel.scala:457)
>   at 
> kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62)
>   at 
> scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359)
>   at 
> scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355)
>   at 
> scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309)
>   at 
> kafka.network.RequestChannel.updateErrorMetrics(RequestChannel.scala:457)
>   at kafka.network.RequestChannel.sendResponse(RequestChannel.scala:388)
>   at 
> kafka.server.RequestHandlerHelper.sendErrorOrCloseConnection(RequestHandlerHelper.scala:93)
>   at 
> kafka.server.RequestHandlerHelper.sendErrorResponseMaybeThrottle(RequestHandlerHelper.scala:121)
>   at 
> kafka.server.RequestHandlerHelper.handleError(RequestHandlerHelper.scala:78)
>   at kafka.server.ControllerApis.handle(ControllerApis.scala:116)
>   at 
> kafka.server.ControllerApis.$anonfun$handleEnvelopeRequest$1(ControllerApis.scala:125)
>   at 
> kafka.server.ControllerApis.$anonfun$handleEnvelopeRequest$1$adapted(ControllerApis.scala:125)
>   at 
> kafka.server.EnvelopeUtils$.handleEnvelopeRequest(EnvelopeUtils.scala:65)
>   at 
> kafka.server.ControllerApis.handleEnvelopeRequest(ControllerApis.scala:125)
>   at kafka.server.ControllerApis.handle(ControllerApis.scala:103)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> This is due to a bug in the metrics code in RequestChannel.
> The result is that the request fails, but no indication is given that it was 
> due to an unsupported API on either the broker, controller, or client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13166) EOFException when Controller handles unknown API

2022-08-22 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-13166.
-
Resolution: Fixed

> EOFException when Controller handles unknown API
> 
>
> Key: KAFKA-13166
> URL: https://issues.apache.org/jira/browse/KAFKA-13166
> Project: Kafka
>  Issue Type: Bug
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Minor
> Fix For: 3.3.0, 3.4.0
>
>
> When ControllerApis handles an unsupported RPC, it silently drops the request 
> due to an unhandled exception. 
> The following stack trace was manually printed since this exception was 
> suppressed on the controller. 
> {code}
> java.util.NoSuchElementException: key not found: UpdateFeatures
>   at scala.collection.MapOps.default(Map.scala:274)
>   at scala.collection.MapOps.default$(Map.scala:273)
>   at scala.collection.AbstractMap.default(Map.scala:405)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:425)
>   at kafka.network.RequestChannel$Metrics.apply(RequestChannel.scala:74)
>   at 
> kafka.network.RequestChannel.$anonfun$updateErrorMetrics$1(RequestChannel.scala:458)
>   at 
> kafka.network.RequestChannel.$anonfun$updateErrorMetrics$1$adapted(RequestChannel.scala:457)
>   at 
> kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62)
>   at 
> scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:359)
>   at 
> scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355)
>   at 
> scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309)
>   at 
> kafka.network.RequestChannel.updateErrorMetrics(RequestChannel.scala:457)
>   at kafka.network.RequestChannel.sendResponse(RequestChannel.scala:388)
>   at 
> kafka.server.RequestHandlerHelper.sendErrorOrCloseConnection(RequestHandlerHelper.scala:93)
>   at 
> kafka.server.RequestHandlerHelper.sendErrorResponseMaybeThrottle(RequestHandlerHelper.scala:121)
>   at 
> kafka.server.RequestHandlerHelper.handleError(RequestHandlerHelper.scala:78)
>   at kafka.server.ControllerApis.handle(ControllerApis.scala:116)
>   at 
> kafka.server.ControllerApis.$anonfun$handleEnvelopeRequest$1(ControllerApis.scala:125)
>   at 
> kafka.server.ControllerApis.$anonfun$handleEnvelopeRequest$1$adapted(ControllerApis.scala:125)
>   at 
> kafka.server.EnvelopeUtils$.handleEnvelopeRequest(EnvelopeUtils.scala:65)
>   at 
> kafka.server.ControllerApis.handleEnvelopeRequest(ControllerApis.scala:125)
>   at kafka.server.ControllerApis.handle(ControllerApis.scala:103)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> This is due to a bug in the metrics code in RequestChannel.
> The result is that the request fails, but no indication is given that it was 
> due to an unsupported API on either the broker, controller, or client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13914) Implement kafka-metadata-quorum.sh

2022-08-20 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-13914.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

> Implement kafka-metadata-quorum.sh
> --
>
> Key: KAFKA-13914
> URL: https://issues.apache.org/jira/browse/KAFKA-13914
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jason Gustafson
>Assignee: dengziming
>Priority: Major
> Fix For: 3.3.0
>
>
> KIP-595 documents a tool for describing quorum status 
> `kafka-metadata-quorum.sh`: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum#KIP595:ARaftProtocolfortheMetadataQuorum-ToolingSupport.]
>   We need to implement this.
> Note that this depends on the Admin API for `DescribeQuorum`, which is 
> proposed in KIP-836: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-836%3A+Addition+of+Information+in+DescribeQuorumResponse+about+Voter+Lag.]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14167) Unexpected UNKNOWN_SERVER_ERROR raised from kraft controller

2022-08-17 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14167.
-
Resolution: Fixed

> Unexpected UNKNOWN_SERVER_ERROR raised from kraft controller
> 
>
> Key: KAFKA-14167
> URL: https://issues.apache.org/jira/browse/KAFKA-14167
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 3.3.0
>
>
> In `ControllerApis`, we have callbacks such as the following after completion:
> {code:java}
>     controller.allocateProducerIds(context, allocatedProducerIdsRequest.data)
>       .handle[Unit] { (results, exception) =>
>         if (exception != null) {
>           requestHelper.handleError(request, exception)
>         } else {
>           requestHelper.sendResponseMaybeThrottle(request, requestThrottleMs 
> => {
>             results.setThrottleTimeMs(requestThrottleMs)
>             new AllocateProducerIdsResponse(results)
>           })
>         }
>       } {code}
> What I see locally is that the underlying exception that gets passed to 
> `handle` always gets wrapped in a `CompletionException`. When passed to 
> `getErrorResponse`, this error will get converted to `UNKNOWN_SERVER_ERROR`. 
> For example, in this case, a `NOT_CONTROLLER` error returned from the 
> controller would be returned as `UNKNOWN_SERVER_ERROR`. It looks like there 
> are a few APIs that are potentially affected by this bug, such as 
> `DeleteTopics` and `UpdateFeatures`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13940) DescribeQuorum returns INVALID_REQUEST if not handled by leader

2022-08-17 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-13940.
-
Resolution: Fixed

> DescribeQuorum returns INVALID_REQUEST if not handled by leader
> ---
>
> Key: KAFKA-13940
> URL: https://issues.apache.org/jira/browse/KAFKA-13940
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.3.0
>
>
> In `KafkaRaftClient.handleDescribeQuorum`, we currently return 
> INVALID_REQUEST if the node is not the current raft leader. This is 
> surprising and doesn't work with our general approach for retrying forwarded 
> APIs. In `BrokerToControllerChannelManager`, we only retry after 
> `NOT_CONTROLLER` errors. It would be more consistent with the other Raft APIs 
> if we returned NOT_LEADER_OR_FOLLOWER, but that also means we need additional 
> logic in `BrokerToControllerChannelManager` to handle that error and retry 
> correctly. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14167) Unexpected UNKNOWN_SERVER_ERROR raised from kraft controller

2022-08-15 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14167:

Priority: Blocker  (was: Major)

> Unexpected UNKNOWN_SERVER_ERROR raised from kraft controller
> 
>
> Key: KAFKA-14167
> URL: https://issues.apache.org/jira/browse/KAFKA-14167
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 3.3.0
>
>
> In `ControllerApis`, we have callbacks such as the following after completion:
> {code:java}
>     controller.allocateProducerIds(context, allocatedProducerIdsRequest.data)
>       .handle[Unit] { (results, exception) =>
>         if (exception != null) {
>           requestHelper.handleError(request, exception)
>         } else {
>           requestHelper.sendResponseMaybeThrottle(request, requestThrottleMs 
> => {
>             results.setThrottleTimeMs(requestThrottleMs)
>             new AllocateProducerIdsResponse(results)
>           })
>         }
>       } {code}
> What I see locally is that the underlying exception that gets passed to 
> `handle` always gets wrapped in a `CompletionException`. When passed to 
> `getErrorResponse`, this error will get converted to `UNKNOWN_SERVER_ERROR`. 
> For example, in this case, a `NOT_CONTROLLER` error returned from the 
> controller would be returned as `UNKNOWN_SERVER_ERROR`. It looks like there 
> are a few APIs that are potentially affected by this bug, such as 
> `DeleteTopics` and `UpdateFeatures`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14167) Unexpected UNKNOWN_SERVER_ERROR raised from kraft controller

2022-08-15 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14167:
---

 Summary: Unexpected UNKNOWN_SERVER_ERROR raised from kraft 
controller
 Key: KAFKA-14167
 URL: https://issues.apache.org/jira/browse/KAFKA-14167
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


In `ControllerApis`, we have callbacks such as the following after completion:
{code:java}
    controller.allocateProducerIds(context, allocatedProducerIdsRequest.data)
      .handle[Unit] { (results, exception) =>
        if (exception != null) {
          requestHelper.handleError(request, exception)
        } else {
          requestHelper.sendResponseMaybeThrottle(request, requestThrottleMs => 
{
            results.setThrottleTimeMs(requestThrottleMs)
            new AllocateProducerIdsResponse(results)
          })
        }
      } {code}
What I see locally is that the underlying exception that gets passed to 
`handle` always gets wrapped in a `CompletionException`. When passed to 
`getErrorResponse`, this error will get converted to `UNKNOWN_SERVER_ERROR`. 
For example, in this case, a `NOT_CONTROLLER` error returned from the 
controller would be returned as `UNKNOWN_SERVER_ERROR`. It looks like there are 
a few APIs that are potentially affected by this bug, such as `DeleteTopics` 
and `UpdateFeatures`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14167) Unexpected UNKNOWN_SERVER_ERROR raised from kraft controller

2022-08-15 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14167:

Fix Version/s: 3.3.0

> Unexpected UNKNOWN_SERVER_ERROR raised from kraft controller
> 
>
> Key: KAFKA-14167
> URL: https://issues.apache.org/jira/browse/KAFKA-14167
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
> Fix For: 3.3.0
>
>
> In `ControllerApis`, we have callbacks such as the following after completion:
> {code:java}
>     controller.allocateProducerIds(context, allocatedProducerIdsRequest.data)
>       .handle[Unit] { (results, exception) =>
>         if (exception != null) {
>           requestHelper.handleError(request, exception)
>         } else {
>           requestHelper.sendResponseMaybeThrottle(request, requestThrottleMs 
> => {
>             results.setThrottleTimeMs(requestThrottleMs)
>             new AllocateProducerIdsResponse(results)
>           })
>         }
>       } {code}
> What I see locally is that the underlying exception that gets passed to 
> `handle` always gets wrapped in a `CompletionException`. When passed to 
> `getErrorResponse`, this error will get converted to `UNKNOWN_SERVER_ERROR`. 
> For example, in this case, a `NOT_CONTROLLER` error returned from the 
> controller would be returned as `UNKNOWN_SERVER_ERROR`. It looks like there 
> are a few APIs that are potentially affected by this bug, such as 
> `DeleteTopics` and `UpdateFeatures`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-13940) DescribeQuorum returns INVALID_REQUEST if not handled by leader

2022-08-15 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson reassigned KAFKA-13940:
---

Assignee: Jason Gustafson

> DescribeQuorum returns INVALID_REQUEST if not handled by leader
> ---
>
> Key: KAFKA-13940
> URL: https://issues.apache.org/jira/browse/KAFKA-13940
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>
> In `KafkaRaftClient.handleDescribeQuorum`, we currently return 
> INVALID_REQUEST if the node is not the current raft leader. This is 
> surprising and doesn't work with our general approach for retrying forwarded 
> APIs. In `BrokerToControllerChannelManager`, we only retry after 
> `NOT_CONTROLLER` errors. It would be more consistent with the other Raft APIs 
> if we returned NOT_LEADER_OR_FOLLOWER, but that also means we need additional 
> logic in `BrokerToControllerChannelManager` to handle that error and retry 
> correctly. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14154) Persistent URP after controller soft failure

2022-08-15 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14154.
-
Resolution: Fixed

> Persistent URP after controller soft failure
> 
>
> Key: KAFKA-14154
> URL: https://issues.apache.org/jira/browse/KAFKA-14154
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 3.3.0
>
>
> We ran into a scenario where a partition leader was unable to expand the ISR 
> after a soft controller failover. Here is what happened:
> Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as 
> the current controller.
> 1. Broker 1 loses its session in Zookeeper.  
> 2. Broker 2 becomes the new controller.
> 3. During initialization, controller 2 removes 1 from the ISR. So state is 
> updated: leader=2, isr=[2], leader epoch=11.
> 4. Broker 2 receives `LeaderAndIsr` from the new controller with leader 
> epoch=11.
> 5. Broker 2 immediately tries to add replica 1 back to the ISR since it is 
> still fetching and is caught up. However, the 
> `BrokerToControllerChannelManager` is still pointed at controller 1, so that 
> is where the `AlterPartition` is sent.
> 6. Controller 1 does not yet realize that it is not the controller, so it 
> processes the `AlterPartition` request. It sees the leader epoch of 11, which 
> is higher than what it has in its own context. Following changes to the 
> `AlterPartition` validation in 
> [https://github.com/apache/kafka/pull/12032/files,] the controller returns 
> FENCED_LEADER_EPOCH.
> 7. After receiving the FENCED_LEADER_EPOCH from the old controller, the 
> leader is stuck because it assumes that the error implies that another 
> LeaderAndIsr request should be sent.
> Prior to 
> [https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,],
>  the way we handled this case was a little different. We only verified that 
> the leader epoch in the request was at least as large as the current epoch in 
> the controller context. Anything higher was accepted. The controller would 
> have attempted to write the updated state to Zookeeper. This update would 
> have failed because of the controller epoch check, however, we would have 
> returned NOT_CONTROLLER in this case, which is handled in 
> `AlterPartitionManager`.
> It is tempting to revert the logic, but the risk is in the idempotency check: 
> [https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.]
>  If the AlterPartition request happened to match the state inside the old 
> controller, the controller would consider the update successful and return no 
> error. But if its state was already stale at that point, then that might 
> cause the leader to incorrectly assume that the state had been updated.
> One way to fix this problem without weakening the validation is to rely on 
> the controller epoch in `AlterPartitionManager`. When we discover a new 
> controller, we also discover its epoch, so we can pass that through. The 
> `LeaderAndIsr` request already includes the controller epoch of the 
> controller that sent it and we already propagate this through to 
> `AlterPartition.submit`. Hence all we need to do is verify that the epoch of 
> the current controller target is at least as large as the one discovered 
> through the `LeaderAndIsr`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14166) Consistent toString implementations for byte arrays in generated messages

2022-08-15 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14166:
---

 Summary: Consistent toString implementations for byte arrays in 
generated messages
 Key: KAFKA-14166
 URL: https://issues.apache.org/jira/browse/KAFKA-14166
 Project: Kafka
  Issue Type: Improvement
Reporter: Jason Gustafson


In the generated `toString()` implementations for message objects (such as 
protocol RPCs), we are a little inconsistent in how we display types with raw 
bytes. If the type is `Array[Byte]`, then we use Arrays.toString. If the type 
is `ByteBuffer` (i.e. when `zeroCopy` is set), then we use the corresponding 
`ByteBuffer.toString`, which is not often useful. We should try to be 
consistent. By default, it is probably not useful to print the full array 
contents, but we might print summary information (e.g. size, checksum).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13986) DescribeQuorum does not return the observers (brokers) for the Metadata log

2022-08-11 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-13986.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

> DescribeQuorum does not return the observers (brokers) for the Metadata log
> ---
>
> Key: KAFKA-13986
> URL: https://issues.apache.org/jira/browse/KAFKA-13986
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Niket Goel
>Assignee: Niket Goel
>Priority: Major
> Fix For: 3.3.0
>
>
> h2. Background
> While working on the [PR|https://github.com/apache/kafka/pull/12206] for 
> KIP-836, we realized that the `DescribeQuorum` API does not return the 
> brokers as observers for the metadata log.
> As noted by [~dengziming] :
> _We set nodeId=-1 if it's a broker so observers.size==0_
> The related code is:
> [https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/core/src/main/scala/kafka/raft/RaftManager.scala#L185-L189]
> {code:java}
> val nodeId = if (config.processRoles.contains(ControllerRole))
> { OptionalInt.of(config.nodeId) }
> else
> { OptionalInt.empty() }
> {code}
> h2. ToDo
> We should fix this and have the DescribeMetadata API return the brokers as 
> observers for the metadata log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14163) Build failing in streams-scala:compileScala due to zinc compiler cache

2022-08-11 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14163.
-
Resolution: Workaround

> Build failing in streams-scala:compileScala due to zinc compiler cache
> --
>
> Key: KAFKA-14163
> URL: https://issues.apache.org/jira/browse/KAFKA-14163
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> I have been seeing builds failing recently with the following error:
> {code:java}
> [2022-08-11T17:08:22.279Z] * What went wrong:
> [2022-08-11T17:08:22.279Z] Execution failed for task 
> ':streams:streams-scala:compileScala'.
> [2022-08-11T17:08:22.279Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
> compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It 
> is currently in use by another Gradle instance.
> [2022-08-11T17:08:22.279Z]   Owner PID: 25008
> [2022-08-11T17:08:22.279Z]   Our PID: 25524
> [2022-08-11T17:08:22.279Z]   Owner Operation: 
> [2022-08-11T17:08:22.279Z]   Our operation: 
> [2022-08-11T17:08:22.279Z]   Lock file: 
> /home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
>    {code}
> And another:
> {code:java}
> [2022-08-10T21:30:41.779Z] * What went wrong:
> [2022-08-10T21:30:41.779Z] Execution failed for task 
> ':streams:streams-scala:compileScala'.
> [2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
> compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It 
> is currently in use by another Gradle instance.
> [2022-08-10T21:30:41.779Z]   Owner PID: 11022
> [2022-08-10T21:30:41.779Z]   Our PID: 11766
> [2022-08-10T21:30:41.779Z]   Owner Operation: 
> [2022-08-10T21:30:41.779Z]   Our operation: 
> [2022-08-10T21:30:41.779Z]   Lock file: 
> /home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14163) Build failing in streams-scala:compileScala due to zinc compiler cache

2022-08-11 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14163:

Summary: Build failing in streams-scala:compileScala due to zinc compiler 
cache  (was: Build failing in scala:compileScala due to zinc compiler cache)

> Build failing in streams-scala:compileScala due to zinc compiler cache
> --
>
> Key: KAFKA-14163
> URL: https://issues.apache.org/jira/browse/KAFKA-14163
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> I have been seeing builds failing recently with the following error:
> {code:java}
> [2022-08-11T17:08:22.279Z] * What went wrong:
> [2022-08-11T17:08:22.279Z] Execution failed for task 
> ':streams:streams-scala:compileScala'.
> [2022-08-11T17:08:22.279Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
> compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It 
> is currently in use by another Gradle instance.
> [2022-08-11T17:08:22.279Z]   Owner PID: 25008
> [2022-08-11T17:08:22.279Z]   Our PID: 25524
> [2022-08-11T17:08:22.279Z]   Owner Operation: 
> [2022-08-11T17:08:22.279Z]   Our operation: 
> [2022-08-11T17:08:22.279Z]   Lock file: 
> /home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
>    {code}
> And another:
> {code:java}
> [2022-08-10T21:30:41.779Z] * What went wrong:
> [2022-08-10T21:30:41.779Z] Execution failed for task 
> ':streams:streams-scala:compileScala'.
> [2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
> compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It 
> is currently in use by another Gradle instance.
> [2022-08-10T21:30:41.779Z]   Owner PID: 11022
> [2022-08-10T21:30:41.779Z]   Our PID: 11766
> [2022-08-10T21:30:41.779Z]   Owner Operation: 
> [2022-08-10T21:30:41.779Z]   Our operation: 
> [2022-08-10T21:30:41.779Z]   Lock file: 
> /home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14163) Build failing in scala:compileScala due to zinc compiler cache

2022-08-11 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14163:

Description: 
I have been seeing builds failing recently with the following error:

 
{code:java}
[2022-08-11T17:08:22.279Z] * What went wrong:
[2022-08-11T17:08:22.279Z] Execution failed for task 
':streams:streams-scala:compileScala'.
[2022-08-11T17:08:22.279Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It is 
currently in use by another Gradle instance.
[2022-08-11T17:08:22.279Z]   Owner PID: 25008
[2022-08-11T17:08:22.279Z]   Our PID: 25524
[2022-08-11T17:08:22.279Z]   Owner Operation: 
[2022-08-11T17:08:22.279Z]   Our operation: 
[2022-08-11T17:08:22.279Z]   Lock file: 
/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
   {code}
And another:
{code:java}
[2022-08-10T21:30:41.779Z] * What went wrong:
[2022-08-10T21:30:41.779Z] Execution failed for task 
':streams:streams-scala:compileScala'.
[2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It is 
currently in use by another Gradle instance.
[2022-08-10T21:30:41.779Z]   Owner PID: 11022
[2022-08-10T21:30:41.779Z]   Our PID: 11766
[2022-08-10T21:30:41.779Z]   Owner Operation: 
[2022-08-10T21:30:41.779Z]   Our operation: 
[2022-08-10T21:30:41.779Z]   Lock file: 
/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
 {code}

  was:
I have been seeing builds failing recently with the following error:

 
{code:java}
[2022-08-10T21:30:41.779Z] * What went wrong: 

[2022-08-10T21:30:41.779Z] Execution failed for task 
':streams:streams-scala:compileScala'. 

[2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It is 
currently in use by another Gradle instance.
  {code}


 


> Build failing in scala:compileScala due to zinc compiler cache
> --
>
> Key: KAFKA-14163
> URL: https://issues.apache.org/jira/browse/KAFKA-14163
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> I have been seeing builds failing recently with the following error:
>  
> {code:java}
> [2022-08-11T17:08:22.279Z] * What went wrong:
> [2022-08-11T17:08:22.279Z] Execution failed for task 
> ':streams:streams-scala:compileScala'.
> [2022-08-11T17:08:22.279Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
> compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It 
> is currently in use by another Gradle instance.
> [2022-08-11T17:08:22.279Z]   Owner PID: 25008
> [2022-08-11T17:08:22.279Z]   Our PID: 25524
> [2022-08-11T17:08:22.279Z]   Owner Operation: 
> [2022-08-11T17:08:22.279Z]   Our operation: 
> [2022-08-11T17:08:22.279Z]   Lock file: 
> /home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
>    {code}
> And another:
> {code:java}
> [2022-08-10T21:30:41.779Z] * What went wrong:
> [2022-08-10T21:30:41.779Z] Execution failed for task 
> ':streams:streams-scala:compileScala'.
> [2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
> compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It 
> is currently in use by another Gradle instance.
> [2022-08-10T21:30:41.779Z]   Owner PID: 11022
> [2022-08-10T21:30:41.779Z]   Our PID: 11766
> [2022-08-10T21:30:41.779Z]   Owner Operation: 
> [2022-08-10T21:30:41.779Z]   Our operation: 
> [2022-08-10T21:30:41.779Z]   Lock file: 
> /home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14163) Build failing in scala:compileScala due to zinc compiler cache

2022-08-11 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14163:

Description: 
I have been seeing builds failing recently with the following error:
{code:java}
[2022-08-11T17:08:22.279Z] * What went wrong:
[2022-08-11T17:08:22.279Z] Execution failed for task 
':streams:streams-scala:compileScala'.
[2022-08-11T17:08:22.279Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It is 
currently in use by another Gradle instance.
[2022-08-11T17:08:22.279Z]   Owner PID: 25008
[2022-08-11T17:08:22.279Z]   Our PID: 25524
[2022-08-11T17:08:22.279Z]   Owner Operation: 
[2022-08-11T17:08:22.279Z]   Our operation: 
[2022-08-11T17:08:22.279Z]   Lock file: 
/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
   {code}
And another:
{code:java}
[2022-08-10T21:30:41.779Z] * What went wrong:
[2022-08-10T21:30:41.779Z] Execution failed for task 
':streams:streams-scala:compileScala'.
[2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It is 
currently in use by another Gradle instance.
[2022-08-10T21:30:41.779Z]   Owner PID: 11022
[2022-08-10T21:30:41.779Z]   Our PID: 11766
[2022-08-10T21:30:41.779Z]   Owner Operation: 
[2022-08-10T21:30:41.779Z]   Our operation: 
[2022-08-10T21:30:41.779Z]   Lock file: 
/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
 {code}

  was:
I have been seeing builds failing recently with the following error:

 
{code:java}
[2022-08-11T17:08:22.279Z] * What went wrong:
[2022-08-11T17:08:22.279Z] Execution failed for task 
':streams:streams-scala:compileScala'.
[2022-08-11T17:08:22.279Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It is 
currently in use by another Gradle instance.
[2022-08-11T17:08:22.279Z]   Owner PID: 25008
[2022-08-11T17:08:22.279Z]   Our PID: 25524
[2022-08-11T17:08:22.279Z]   Owner Operation: 
[2022-08-11T17:08:22.279Z]   Our operation: 
[2022-08-11T17:08:22.279Z]   Lock file: 
/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
   {code}
And another:
{code:java}
[2022-08-10T21:30:41.779Z] * What went wrong:
[2022-08-10T21:30:41.779Z] Execution failed for task 
':streams:streams-scala:compileScala'.
[2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It is 
currently in use by another Gradle instance.
[2022-08-10T21:30:41.779Z]   Owner PID: 11022
[2022-08-10T21:30:41.779Z]   Our PID: 11766
[2022-08-10T21:30:41.779Z]   Owner Operation: 
[2022-08-10T21:30:41.779Z]   Our operation: 
[2022-08-10T21:30:41.779Z]   Lock file: 
/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
 {code}


> Build failing in scala:compileScala due to zinc compiler cache
> --
>
> Key: KAFKA-14163
> URL: https://issues.apache.org/jira/browse/KAFKA-14163
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
>
> I have been seeing builds failing recently with the following error:
> {code:java}
> [2022-08-11T17:08:22.279Z] * What went wrong:
> [2022-08-11T17:08:22.279Z] Execution failed for task 
> ':streams:streams-scala:compileScala'.
> [2022-08-11T17:08:22.279Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
> compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It 
> is currently in use by another Gradle instance.
> [2022-08-11T17:08:22.279Z]   Owner PID: 25008
> [2022-08-11T17:08:22.279Z]   Our PID: 25524
> [2022-08-11T17:08:22.279Z]   Owner Operation: 
> [2022-08-11T17:08:22.279Z]   Our operation: 
> [2022-08-11T17:08:22.279Z]   Lock file: 
> /home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
>    {code}
> And another:
> {code:java}
> [2022-08-10T21:30:41.779Z] * What went wrong:
> [2022-08-10T21:30:41.779Z] Execution failed for task 
> ':streams:streams-scala:compileScala'.
> [2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
> compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It 
> is currently in use by another Gradle instance.
> [2022-08-10T21:30:41.779Z]   Owner PID: 11022
> [2022-08-10T21:30:41.779Z]   Our PID: 11766
> [2022-08-10T21:30:41.779Z]   Owner Operation: 
> [2022-08-10T21:30:41.779Z]   Our operation: 
> [2022-08-10T21:30:41.779Z]   Lock file: 
> /home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17/zinc-1.6.1_2.13.8_17.lock
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14163) Build failing in scala:compileScala due to zinc compiler cache

2022-08-11 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14163:
---

 Summary: Build failing in scala:compileScala due to zinc compiler 
cache
 Key: KAFKA-14163
 URL: https://issues.apache.org/jira/browse/KAFKA-14163
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


I have been seeing builds failing recently with the following error:

 
{code:java}
[2022-08-10T21:30:41.779Z] * What went wrong: 

[2022-08-10T21:30:41.779Z] Execution failed for task 
':streams:streams-scala:compileScala'. 

[2022-08-10T21:30:41.779Z] > Timeout waiting to lock zinc-1.6.1_2.13.8_17 
compiler cache (/home/jenkins/.gradle/caches/7.5.1/zinc-1.6.1_2.13.8_17). It is 
currently in use by another Gradle instance.
  {code}


 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14154) Persistent URP after controller soft failure

2022-08-09 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14154:

Description: 
We ran into a scenario where a partition leader was unable to expand the ISR 
after a soft controller failover. Here is what happened:

Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as the 
current controller.

1. Broker 1 loses its session in Zookeeper.  

2. Broker 2 becomes the new controller.

3. During initialization, controller 2 removes 1 from the ISR. So state is 
updated: leader=2, isr=[2], leader epoch=11.

4. Broker 2 receives `LeaderAndIsr` from the new controller with leader 
epoch=11.

5. Broker 2 immediately tries to add replica 1 back to the ISR since it is 
still fetching and is caught up. However, the 
`BrokerToControllerChannelManager` is still pointed at controller 1, so that is 
where the `AlterPartition` is sent.

6. Controller 1 does not yet realize that it is not the controller, so it 
processes the `AlterPartition` request. It sees the leader epoch of 11, which 
is higher than what it has in its own context. Following changes to the 
`AlterPartition` validation in 
[https://github.com/apache/kafka/pull/12032/files,] the controller returns 
FENCED_LEADER_EPOCH.

7. After receiving the FENCED_LEADER_EPOCH from the old controller, the leader 
is stuck because it assumes that the error implies that another LeaderAndIsr 
request should be sent.

Prior to 
[https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,],
 the way we handled this case was a little different. We only verified that the 
leader epoch in the request was at least as large as the current epoch in the 
controller context. Anything higher was accepted. The controller would have 
attempted to write the updated state to Zookeeper. This update would have 
failed because of the controller epoch check, however, we would have returned 
NOT_CONTROLLER in this case, which is handled in `AlterPartitionManager`.

It is tempting to revert the logic, but the risk is in the idempotency check: 
[https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.]
 If the AlterPartition request happened to match the state inside the old 
controller, the controller would consider the update successful and return no 
error. But if its state was already stale at that point, then that might cause 
the leader to incorrectly assume that the state had been updated.

One way to fix this problem without weakening the validation is to rely on the 
controller epoch in `AlterPartitionManager`. When we discover a new controller, 
we also discover its epoch, so we can pass that through. The `LeaderAndIsr` 
request already includes the controller epoch of the controller that sent it 
and we already propagate this through to `AlterPartition.submit`. Hence all we 
need to do is verify that the epoch of the current controller target is at 
least as large as the one discovered through the `LeaderAndIsr`.

  was:
We ran into a scenario where a partition leader was unable to expand the ISR 
after a soft controller failover. Here is what happened:

Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as the 
current controller.

1. Broker 1 loses its session in Zookeeper.  

2. Broker 2 becomes the new controller.

3. During initialization, controller 2 removes 1 from the ISR. So state is 
updated: leader=2, isr=[1, 2], leader epoch=11.

4. Broker 2 receives `LeaderAndIsr` from the new controller with leader 
epoch=11.

5. Broker 2 immediately tries to add replica 1 back to the ISR since it is 
still fetching and is caught up. However, the 
`BrokerToControllerChannelManager` is still pointed at controller 1, so that is 
where the `AlterPartition` is sent.

6. Controller 1 does not yet realize that it is not the controller, so it 
processes the `AlterPartition` request. It sees the leader epoch of 11, which 
is higher than what it has in its own context. Following changes to the 
`AlterPartition` validation in 
[https://github.com/apache/kafka/pull/12032/files,] the controller returns 
FENCED_LEADER_EPOCH.

7. After receiving the FENCED_LEADER_EPOCH from the old controller, the leader 
is stuck because it assumes that the error implies that another LeaderAndIsr 
request should be sent.

Prior to 
[https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,],
 the way we handled this case was a little different. We only verified that the 
leader epoch in the request was at least as large as the current epoch in the 
controller context. Anything higher was accepted. The controller would have 
attempted to write the updated state to Zookeeper. This update would have 
failed because of the controller epoch check, however, we would have returned 
NOT_CONTROLLER 

[jira] [Updated] (KAFKA-14154) Persistent URP after controller soft failure

2022-08-09 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14154:

Fix Version/s: 3.3.0

> Persistent URP after controller soft failure
> 
>
> Key: KAFKA-14154
> URL: https://issues.apache.org/jira/browse/KAFKA-14154
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 3.3.0
>
>
> We ran into a scenario where a partition leader was unable to expand the ISR 
> after a soft controller failover. Here is what happened:
> Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as 
> the current controller.
> 1. Broker 1 loses its session in Zookeeper.  
> 2. Broker 2 becomes the new controller.
> 3. During initialization, controller 2 removes 1 from the ISR. So state is 
> updated: leader=2, isr=[1, 2], leader epoch=11.
> 4. Broker 2 receives `LeaderAndIsr` from the new controller with leader 
> epoch=11.
> 5. Broker 2 immediately tries to add replica 1 back to the ISR since it is 
> still fetching and is caught up. However, the 
> `BrokerToControllerChannelManager` is still pointed at controller 1, so that 
> is where the `AlterPartition` is sent.
> 6. Controller 1 does not yet realize that it is not the controller, so it 
> processes the `AlterPartition` request. It sees the leader epoch of 11, which 
> is higher than what it has in its own context. Following changes to the 
> `AlterPartition` validation in 
> [https://github.com/apache/kafka/pull/12032/files,] the controller returns 
> FENCED_LEADER_EPOCH.
> 7. After receiving the FENCED_LEADER_EPOCH from the old controller, the 
> leader is stuck because it assumes that the error implies that another 
> LeaderAndIsr request should be sent.
> Prior to 
> [https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,],
>  the way we handled this case was a little different. We only verified that 
> the leader epoch in the request was at least as large as the current epoch in 
> the controller context. Anything higher was accepted. The controller would 
> have attempted to write the updated state to Zookeeper. This update would 
> have failed because of the controller epoch check, however, we would have 
> returned NOT_CONTROLLER in this case, which is handled in 
> `AlterPartitionManager`.
> It is tempting to revert the logic, but the risk is in the idempotency check: 
> [https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.]
>  If the AlterPartition request happened to match the state inside the old 
> controller, the controller would consider the update successful and return no 
> error. But if its state was already stale at that point, then that might 
> cause the leader to incorrectly assume that the state had been updated.
> One way to fix this problem without weakening the validation is to rely on 
> the controller epoch in `AlterPartitionManager`. When we discover a new 
> controller, we also discover its epoch, so we can pass that through. The 
> `LeaderAndIsr` request already includes the controller epoch of the 
> controller that sent it and we already propagate this through to 
> `AlterPartition.submit`. Hence all we need to do is verify that the epoch of 
> the current controller target is at least as large as the one discovered 
> through the `LeaderAndIsr`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14154) Persistent URP after controller soft failure

2022-08-09 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14154:
---

 Summary: Persistent URP after controller soft failure
 Key: KAFKA-14154
 URL: https://issues.apache.org/jira/browse/KAFKA-14154
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson


We ran into a scenario where a partition leader was unable to expand the ISR 
after a soft controller failover. Here is what happened:

Initial state: leader=1, isr=[1,2], leader epoch=10. Broker 1 is acting as the 
current controller.

1. Broker 1 loses its session in Zookeeper.  

2. Broker 2 becomes the new controller.

3. During initialization, controller 2 removes 1 from the ISR. So state is 
updated: leader=2, isr=[1, 2], leader epoch=11.

4. Broker 2 receives `LeaderAndIsr` from the new controller with leader 
epoch=11.

5. Broker 2 immediately tries to add replica 1 back to the ISR since it is 
still fetching and is caught up. However, the 
`BrokerToControllerChannelManager` is still pointed at controller 1, so that is 
where the `AlterPartition` is sent.

6. Controller 1 does not yet realize that it is not the controller, so it 
processes the `AlterPartition` request. It sees the leader epoch of 11, which 
is higher than what it has in its own context. Following changes to the 
`AlterPartition` validation in 
[https://github.com/apache/kafka/pull/12032/files,] the controller returns 
FENCED_LEADER_EPOCH.

7. After receiving the FENCED_LEADER_EPOCH from the old controller, the leader 
is stuck because it assumes that the error implies that another LeaderAndIsr 
request should be sent.

Prior to 
[https://github.com/apache/kafka/pull/12032/files|https://github.com/apache/kafka/pull/12032/files,],
 the way we handled this case was a little different. We only verified that the 
leader epoch in the request was at least as large as the current epoch in the 
controller context. Anything higher was accepted. The controller would have 
attempted to write the updated state to Zookeeper. This update would have 
failed because of the controller epoch check, however, we would have returned 
NOT_CONTROLLER in this case, which is handled in `AlterPartitionManager`.

It is tempting to revert the logic, but the risk is in the idempotency check: 
[https://github.com/apache/kafka/pull/12032/files#diff-3e042c962e80577a4cc9bbcccf0950651c6b312097a86164af50003c00c50d37L2369.]
 If the AlterPartition request happened to match the state inside the old 
controller, the controller would consider the update successful and return no 
error. But if its state was already stale at that point, then that might cause 
the leader to incorrectly assume that the state had been updated.

One way to fix this problem without weakening the validation is to rely on the 
controller epoch in `AlterPartitionManager`. When we discover a new controller, 
we also discover its epoch, so we can pass that through. The `LeaderAndIsr` 
request already includes the controller epoch of the controller that sent it 
and we already propagate this through to `AlterPartition.submit`. Hence all we 
need to do is verify that the epoch of the current controller target is at 
least as large as the one discovered through the `LeaderAndIsr`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14152) Add logic to fence kraft brokers which have fallen behind in replication

2022-08-09 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14152:
---

 Summary: Add logic to fence kraft brokers which have fallen behind 
in replication
 Key: KAFKA-14152
 URL: https://issues.apache.org/jira/browse/KAFKA-14152
 Project: Kafka
  Issue Type: Improvement
Reporter: Jason Gustafson


When a kraft broker registers with the controller, it must catch up to the 
current metadata before it is unfenced. However, once it has been unfenced, it 
only needs to continue sending heartbeats to remain unfenced. It can fall 
arbitrarily behind in the replication of the metadata log and remain unfenced. 
We should consider whether there is an inverse condition that we can use to 
fence a broker that has fallen behind.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14144) AlterPartition is not idempotent when requests time out

2022-08-09 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14144.
-
Resolution: Fixed

> AlterPartition is not idempotent when requests time out
> ---
>
> Key: KAFKA-14144
> URL: https://issues.apache.org/jira/browse/KAFKA-14144
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: David Mao
>Assignee: David Mao
>Priority: Blocker
> Fix For: 3.3.0
>
>
> [https://github.com/apache/kafka/pull/12032] changed the validation order of 
> AlterPartition requests to fence requests with a stale partition epoch before 
> we compare the leader and ISR contents.
> This results in a loss of idempotency if a leader does not receive an 
> AlterPartition response because retries will receive an 
> INVALID_UPDATE_VERSION error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14104) Perform CRC validation on KRaft Batch Records and Snapshots

2022-08-08 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14104.
-
Resolution: Fixed

> Perform CRC validation on KRaft Batch Records and Snapshots
> ---
>
> Key: KAFKA-14104
> URL: https://issues.apache.org/jira/browse/KAFKA-14104
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Niket Goel
>Assignee: Niket Goel
>Priority: Blocker
> Fix For: 3.3.0
>
>
> Today we stamp the BatchRecord header with a CRC [1] and verify that CRC 
> before the log is written to disk [2]. The CRC checks should also be verified 
> when the records are read back from disk. The same procedure should be 
> followed for KRaft snapshots as well.
> [1] 
> [https://github.com/apache/kafka/blob/6b76c01cf895db0651e2cdcc07c2c392f00a8ceb/clients/src/main/java/org/apache/kafka/common/record/DefaultRecordBatch.java#L501=]
>  
> [2] 
> [https://github.com/apache/kafka/blob/679e9e0cee67e7d3d2ece204a421ea7da31d73e9/core/src/main/scala/kafka/log/UnifiedLog.scala#L1143]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14139) Replaced disk can lead to loss of committed data even with non-empty ISR

2022-08-03 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14139:

Description: 
We have been thinking about disk failure cases recently. Suppose that a disk 
has failed and the user needs to restart the disk from an empty state. The 
concern is whether this can lead to the unnecessary loss of committed data.

For normal topic partitions, removal from the ISR during controlled shutdown 
buys us some protection. After the replica is restarted, it must prove its 
state to the leader before it can be added back to the ISR. And it cannot 
become a leader until it does so.

An obvious exception to this is when the replica is the last member in the ISR. 
In this case, the disk failure itself has compromised the committed data, so 
some amount of loss must be expected.

We have been considering other scenarios in which the loss of one disk can lead 
to data loss even when there are replicas remaining which have all of the 
committed entries. One such scenario is this:

Suppose we have a partition with two replicas: A and B. Initially A is the 
leader and it is the only member of the ISR.
 # Broker B catches up to A, so A attempts to send an AlterPartition request to 
the controller to add B into the ISR.
 # Before the AlterPartition request is received, replica B has a hard failure.
 # The current controller successfully fences broker B. It takes no action on 
this partition since B is already out of the ISR.
 # Before the controller receives the AlterPartition request to add B, it also 
fails.
 # While the new controller is initializing, suppose that replica B finishes 
startup, but the disk has been replaced (all of the previous state has been 
lost).
 # The new controller sees the registration from broker B first.
 # Finally, the AlterPartition from A arrives which adds B back into the ISR 
even though it has an empty log.

(Credit for coming up with this scenario goes to [~junrao] .)

I tested this in KRaft and confirmed that this sequence is possible (even if 
perhaps unlikely). There are a few ways we could have potentially detected the 
issue. First, perhaps the leader should have bumped the leader epoch on all 
partitions when B was fenced. Then the inflight AlterPartition would be doomed 
no matter when it arrived.

Alternatively, we could have relied on the broker epoch to distinguish the dead 
broker's state from that of the restarted broker. This could be done by 
including the broker epoch in both the `Fetch` request and in `AlterPartition`.

Finally, perhaps even normal kafka replication should be using a unique 
identifier for each disk so that we can reliably detect when it has changed. 
For example, something like what was proposed for the metadata quorum here: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes.]
 

  was:
We have been thinking about disk failure cases recently. Suppose that a disk 
has failed and the user needs to restart the disk from an empty state. The 
concern is whether this can lead to the unnecessary loss of committed data.

For normal topic partitions, removal from the ISR during controlled shutdown 
buys us some protection. After the replica is restarted, it must prove its 
state to the leader before it can be added back to the ISR. And it cannot 
become a leader until it does so.

An obvious exception to this is when the replica is the last member in the ISR. 
In this case, the disk failure itself has compromised the committed data, so 
some amount of loss must be expected.

We have been considering other scenarios in which the loss of one disk can lead 
to data loss even when there are replicas remaining which have all of the 
committed entries. One such scenario is this:

Suppose we have a partition with two replicas: A and B. Initially A is the 
leader and it is the only member of the ISR.
 # Broker B catches up to A, so A attempts to send an AlterPartition request to 
the controller to add B into the ISR.
 # Before the AlterPartition request is received, replica B has a hard failure.
 # The current controller successfully fences broker B. It takes no action on 
this partition since B is already out of the ISR.
 # Before the controller receives the AlterPartition request to add B, it also 
fails.
 # While the new controller is initializing, suppose that replica B finishes 
startup, but the disk has been replaced (all of the previous state has been 
lost).
 # The new controller sees the registration from broker B first.
 # Finally, the AlterPartition from A arrives which adds B back into the ISR.

(Credit for coming up with this scenario goes to [~junrao] .)

I tested this in KRaft and confirmed that this sequence is possible (even if 
perhaps unlikely). There are a few ways we could have potentially detected the 
issue. First, perhaps the leader should have bumped the leader epoch on all 

[jira] [Created] (KAFKA-14139) Replaced disk can lead to loss of committed data even with non-empty ISR

2022-08-03 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14139:
---

 Summary: Replaced disk can lead to loss of committed data even 
with non-empty ISR
 Key: KAFKA-14139
 URL: https://issues.apache.org/jira/browse/KAFKA-14139
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson


We have been thinking about disk failure cases recently. Suppose that a disk 
has failed and the user needs to restart the disk from an empty state. The 
concern is whether this can lead to the unnecessary loss of committed data.

For normal topic partitions, removal from the ISR during controlled shutdown 
buys us some protection. After the replica is restarted, it must prove its 
state to the leader before it can be added back to the ISR. And it cannot 
become a leader until it does so.

An obvious exception to this is when the replica is the last member in the ISR. 
In this case, the disk failure itself has compromised the committed data, so 
some amount of loss must be expected.

We have been considering other scenarios in which the loss of one disk can lead 
to data loss even when there are replicas remaining which have all of the 
committed entries. One such scenario is this:

Suppose we have a partition with two replicas: A and B. Initially A is the 
leader and it is the only member of the ISR.
 # Broker B catches up to A, so A attempts to send an AlterPartition request to 
the controller to add B into the ISR.
 # Before the AlterPartition request is received, replica B has a hard failure.
 # The current controller successfully fences broker B. It takes no action on 
this partition since B is already out of the ISR.
 # Before the controller receives the AlterPartition request to add B, it also 
fails.
 # While the new controller is initializing, suppose that replica B finishes 
startup, but the disk has been replaced (all of the previous state has been 
lost).
 # The new controller sees the registration from broker B first.
 # Finally, the AlterPartition from A arrives which adds B back into the ISR.

(Credit for coming up with this scenario goes to [~junrao] .)

I tested this in KRaft and confirmed that this sequence is possible (even if 
perhaps unlikely). There are a few ways we could have potentially detected the 
issue. First, perhaps the leader should have bumped the leader epoch on all 
partitions when B was fenced. Then the inflight AlterPartition would be doomed 
no matter when it arrived.

Alternatively, we could have relied on the broker epoch to distinguish the dead 
broker's state from that of the restarted broker. This could be done by 
including the broker epoch in both the `Fetch` request and in `AlterPartition`.

Finally, perhaps even normal kafka replication should be using a unique 
identifier for each disk so that we can reliably detect when it has changed. 
For example, something like what was proposed for the metadata quorum here: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes.]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14078) Replica fetches to follower should return NOT_LEADER error

2022-07-25 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14078.
-
Resolution: Fixed

> Replica fetches to follower should return NOT_LEADER error
> --
>
> Key: KAFKA-14078
> URL: https://issues.apache.org/jira/browse/KAFKA-14078
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 3.3.0
>
>
> After the fix for KAFKA-13837, if a follower receives a request from another 
> replica, it will return UNKNOWN_LEADER_EPOCH even if the leader epoch 
> matches. We need to do epoch leader/epoch validation first before we check 
> whether we have a valid replica.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14078) Replica fetches to follower should return NOT_LEADER error

2022-07-14 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14078:
---

 Summary: Replica fetches to follower should return NOT_LEADER error
 Key: KAFKA-14078
 URL: https://issues.apache.org/jira/browse/KAFKA-14078
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jason Gustafson
 Fix For: 3.3.0


After the fix for KAFKA-13837, if a follower receives a request from another 
replica, it will return UNKNOWN_LEADER_EPOCH even if the leader epoch matches. 
We need to do epoch leader/epoch validation first before we check whether we 
have a valid replica.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-13888) KIP-836: Addition of Information in DescribeQuorumResponse about Voter Lag

2022-07-14 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-13888:

Priority: Blocker  (was: Major)

> KIP-836: Addition of Information in DescribeQuorumResponse about Voter Lag
> --
>
> Key: KAFKA-13888
> URL: https://issues.apache.org/jira/browse/KAFKA-13888
> Project: Kafka
>  Issue Type: Improvement
>  Components: kraft
>Reporter: Niket Goel
>Assignee: lqjacklee
>Priority: Blocker
> Fix For: 3.3.0
>
>
> Tracking issue for the implementation of KIP:836



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14077) KRaft should support recovery from failed disk

2022-07-14 Thread Jason Gustafson (Jira)
Jason Gustafson created KAFKA-14077:
---

 Summary: KRaft should support recovery from failed disk
 Key: KAFKA-14077
 URL: https://issues.apache.org/jira/browse/KAFKA-14077
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
 Fix For: 3.3.0


If one of the nodes in the metadata quorum has a disk failure, there is no way 
currently to safely bring the node back into the quorum. When we lose disk 
state, we are at risk of losing committed data even if the failure only affects 
a minority of the cluster.

Here is an example. Suppose that a metadata quorum has 3 members: v1, v2, and 
v3. Initially, v1 is the leader and writes a record at offset 1. After v2 
acknowledges replication of the record, it becomes committed. Suppose that v1 
fails before v3 has a chance to replicate this record. As long as v1 remains 
down, the raft protocol guarantees that only v2 can become leader, so the 
record cannot be lost. The raft protocol expects that when v1 returns, it will 
still have that record, but what if there is a disk failure, the state cannot 
be recovered and v1 participates in leader election? Then we would have 
committed data on a minority of the voters. The main problem here concerns how 
we recover from this impaired state without risking the loss of this data.

Consider a naive solution which brings v1 back with an empty disk. Since the 
node has lost is prior knowledge of the state of the quorum, it will vote for 
any candidate that comes along. If v3 becomes a candidate, then it will vote 
for itself and it just needs the vote from v1 to become leader. If that 
happens, then the committed data on v2 will become lost.

This is just one scenario. In general, the invariants that the raft protocol is 
designed to preserve go out the window when disk state is lost. For example, it 
is also possible to contrive a scenario where the loss of disk state leads to 
multiple leaders. There is a good reason why raft requires that any vote cast 
by a voter is written to disk since otherwise the voter may vote for different 
candidates in the same epoch.

Many systems solve this problem with a unique identifier which is generated 
automatically and stored on disk. This identifier is then committed to the raft 
log. If a disk changes, we would see a new identifier and we can prevent the 
node from breaking raft invariants. Then recovery from a failed disk requires a 
quorum reconfiguration. We need something like this in KRaft to make disk 
recovery possible.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14055) Transaction markers may be lost during cleaning if data keys conflict with marker keys

2022-07-10 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14055:

Affects Version/s: 2.8.1
   2.7.2
   2.6.3
   2.5.1
   2.4.1

> Transaction markers may be lost during cleaning if data keys conflict with 
> marker keys
> --
>
> Key: KAFKA-14055
> URL: https://issues.apache.org/jira/browse/KAFKA-14055
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.1, 2.5.1, 2.6.3, 2.7.2, 2.8.1, 3.0.1, 3.2.0, 3.1.1
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
> Fix For: 3.3.0, 3.0.2, 3.1.2, 3.2.1
>
>
> We have been seeing recently hanging transactions occur on streams changelog 
> topics quite frequently. After investigation, we found that the keys used in 
> the changelog topic conflict with the keys used in the transaction markers 
> (the schema used in control records is 4 bytes, which happens to be the same 
> for the changelog topics that we investigated). When we build the offset map 
> prior to cleaning, we do properly exclude the transaction marker keys, but 
> the bug is the fact that we do not exclude them during the cleaning phase. 
> This can result in the marker being removed from the cleaned log before the 
> corresponding data is removed when there is a user record with a conflicting 
> key at a higher offset. A side effect of this is a so-called "hanging" 
> transaction, but the bigger problem is that we lose the atomicity of the 
> transaction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14055) Transaction markers may be lost during cleaning if data keys conflict with marker keys

2022-07-10 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-14055.
-
Resolution: Fixed

> Transaction markers may be lost during cleaning if data keys conflict with 
> marker keys
> --
>
> Key: KAFKA-14055
> URL: https://issues.apache.org/jira/browse/KAFKA-14055
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
> Fix For: 3.3.0, 3.0.2, 3.1.2, 3.2.1
>
>
> We have been seeing recently hanging transactions occur on streams changelog 
> topics quite frequently. After investigation, we found that the keys used in 
> the changelog topic conflict with the keys used in the transaction markers 
> (the schema used in control records is 4 bytes, which happens to be the same 
> for the changelog topics that we investigated). When we build the offset map 
> prior to cleaning, we do properly exclude the transaction marker keys, but 
> the bug is the fact that we do not exclude them during the cleaning phase. 
> This can result in the marker being removed from the cleaned log before the 
> corresponding data is removed when there is a user record with a conflicting 
> key at a higher offset. A side effect of this is a so-called "hanging" 
> transaction, but the bigger problem is that we lose the atomicity of the 
> transaction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >