Ivan Yurchenko created KAFKA-15931: -------------------------------------- Summary: Cached transaction index gets closed if tiered storage read is interrupted Key: KAFKA-15931 URL: https://issues.apache.org/jira/browse/KAFKA-15931 Project: Kafka Issue Type: Bug Components: Tiered-Storage Affects Versions: 3.6.0 Reporter: Ivan Yurchenko
This reproduces when reading from remote storage with the default {{fetch.max.wait.ms}} (500) or lower. This error is logged {noformat} [2023-11-29 14:01:01,166] ERROR Error occurred while reading the remote data for topic1-0 (kafka.log.remote.RemoteLogReader) org.apache.kafka.common.KafkaException: Failed read position from the transaction index <index_file_name> at org.apache.kafka.storage.internals.log.TransactionIndex$1.hasNext(TransactionIndex.java:235) at org.apache.kafka.storage.internals.log.TransactionIndex.collectAbortedTxns(TransactionIndex.java:171) at kafka.log.remote.RemoteLogManager.collectAbortedTransactions(RemoteLogManager.java:1359) at kafka.log.remote.RemoteLogManager.addAbortedTransactions(RemoteLogManager.java:1341) at kafka.log.remote.RemoteLogManager.read(RemoteLogManager.java:1310) at kafka.log.remote.RemoteLogReader.call(RemoteLogReader.java:62) at kafka.log.remote.RemoteLogReader.call(RemoteLogReader.java:31) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.nio.channels.ClosedChannelException at java.base/sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:150) at java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:325) at org.apache.kafka.storage.internals.log.TransactionIndex$1.hasNext(TransactionIndex.java:233) ... 10 more {noformat} and after that this txn index becomes unusable until the process is restarted. I suspect, it's caused by the reading thread being interrupted due to the fetch timeout. At least [this code|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c59dfd791f62d41f332db9e306bc1422/src/java.base/share/classes/java/nio/channels/spi/AbstractInterruptibleChannel.java#L159-L160] in {{AbstractInterruptibleChannel}} is called. -- This message was sent by Atlassian Jira (v8.20.10#820010)