Hi, The broker is a three machine cluster. The replication factor for input and also internal topics is 3. Brokers don't seem to fail. I always see their instances running.
Also note that when an identical streams application with single thread on a single instance is pulling data from some other non partitioned identical topic, the application never fails. Note there too replication factor is 3 for input and internal topics. Please let us know if you have something for other errors. Also what ways we can make the steams resilient. I do feel we need hooks to start new stream threads just in case some thread shuts down due to unhandled exception, or streams application itself doing a better job in handling such and not shutting down the threads. Thanks Sachin On Sat, Mar 25, 2017 at 11:03 PM, Eno Thereska <eno.there...@gmail.com> wrote: > Hi Sachin, > > See my previous email on the NotLeaderForPartitionException error. > > What is your Kafka configuration, how many brokers are you using? Also > could you share the replication level (if different from 1) of your streams > topics? Are there brokers failing while Streams is running? > > Thanks > Eno > > On 25/03/2017, 11:00, "Sachin Mittal" <sjmit...@gmail.com> wrote: > > Hi All, > I am revisiting the ongoing issue of getting a multi instance multi > threaded kafka streams cluster to work. > > Scenario is that we have a 12 partition source topic. (note our server > cluster replication factor is 3). > We have a 3 machines client cluster with one instance on each. Each > instances uses 4 thread. > Streams version is 0.10.2 with latest deadlock fix and rocks db > optimization from trunk. > > We also have an identical single partition topic and another single > threaded instance doing identical processing as the above one. This > uses > version 0.10.1.1 > This streams application never goes down. > > The above application used to go down frequently with high cpu wait > time > and also we used to get frequent deadlock issues. However since > including > the fixes we see very little cpu wait time and now application does not > enter into deadlock. The threads simply get uncaught exception thrown > from > the streams application and they die one by one eventually shutting > down > the entire client cluster. > So we now need to understand what could be causing these exceptions > and how > we can fix those. > > Here is the summary > instance 84 > All four thread die due to > org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server > is not the leader for that topic-partition. > > So is this something we can handle at streams level and not get it > thrown > all the way to the thread. > > > instance 85 > two again dies due to > org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server > is not the leader for that topic-partition. > > other two die due to > Caused by: org.rocksdb.RocksDBException: ~ > I know this is some known rocksdb issue. Is there a way we can handle > it at > stream side. What do you suggest to avoid this or what can be causing > it. > > > instance 87 > two again die due to > org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server > is not the leader for that topic-partition. > > one dies due to > org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) > for > new-part-advice-key-table-changelog-11: 30015 ms has passed since last > append > > I have really not understood what this means and any idea what could > be the > issue here? > > last one dies due to > Caused by: java.lang.IllegalStateException: task [0_9] Log end offset > of > new-part-advice-key-table-changelog-9 should not change while > restoring: > old end offset 647352, current offset 647632 > > I feel this should not be thrown to the stream thread too and handled > at > streams level. > > The complete logs can be found at: > https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_ > 85_87_log.zip?dl=0 > > So I feel basically the streams application should be more resilient > and > should not fail due to exceptions but should have a way to handle them. > or provide programmers the hooks that even in case a stream thread is > shut > down there is a way to start a new thread so that we have a running > streams > application. > > The popular reason seems to me > org.apache.kafka.common.errors.NotLeaderForPartitionException, and > this > along with few others should get handled. > > Let us know what are your thoughts. > > > Thanks > Sachin > > > >