Hi All, I am revisiting the ongoing issue of getting a multi instance multi threaded kafka streams cluster to work.
Scenario is that we have a 12 partition source topic. (note our server cluster replication factor is 3). We have a 3 machines client cluster with one instance on each. Each instances uses 4 thread. Streams version is 0.10.2 with latest deadlock fix and rocks db optimization from trunk. We also have an identical single partition topic and another single threaded instance doing identical processing as the above one. This uses version 0.10.1.1 This streams application never goes down. The above application used to go down frequently with high cpu wait time and also we used to get frequent deadlock issues. However since including the fixes we see very little cpu wait time and now application does not enter into deadlock. The threads simply get uncaught exception thrown from the streams application and they die one by one eventually shutting down the entire client cluster. So we now need to understand what could be causing these exceptions and how we can fix those. Here is the summary instance 84 All four thread die due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. So is this something we can handle at streams level and not get it thrown all the way to the thread. instance 85 two again dies due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. other two die due to Caused by: org.rocksdb.RocksDBException: ~ I know this is some known rocksdb issue. Is there a way we can handle it at stream side. What do you suggest to avoid this or what can be causing it. instance 87 two again die due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. one dies due to org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for new-part-advice-key-table-changelog-11: 30015 ms has passed since last append I have really not understood what this means and any idea what could be the issue here? last one dies due to Caused by: java.lang.IllegalStateException: task [0_9] Log end offset of new-part-advice-key-table-changelog-9 should not change while restoring: old end offset 647352, current offset 647632 I feel this should not be thrown to the stream thread too and handled at streams level. The complete logs can be found at: https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_85_87_log.zip?dl=0 So I feel basically the streams application should be more resilient and should not fail due to exceptions but should have a way to handle them. or provide programmers the hooks that even in case a stream thread is shut down there is a way to start a new thread so that we have a running streams application. The popular reason seems to me org.apache.kafka.common.errors.NotLeaderForPartitionException, and this along with few others should get handled. Let us know what are your thoughts. Thanks Sachin