3 machines 12 threads all fail within 2 hours of starting the streams application

Sachin Mittal Sat, 25 Mar 2017 04:26:35 -0700

Hi All,
I am revisiting the ongoing issue of getting a multi instance multi
threaded kafka streams cluster to work.


Scenario is that we have a 12 partition source topic. (note our server
cluster replication factor is 3).
We have a 3 machines client cluster with one instance on each. Each
instances uses 4 thread.
Streams version is 0.10.2 with latest deadlock fix and rocks db
optimization from trunk.

We also have an identical single partition topic and another single
threaded instance doing identical processing as the above one. This uses
version 0.10.1.1
This streams application never goes down.

The above application used to go down frequently with high cpu wait time
and also we used to get frequent deadlock issues. However since including
the fixes we see very little cpu wait time and now application does not
enter into deadlock. The threads simply get uncaught exception thrown from
the streams application and they die one by one eventually shutting down
the entire client cluster.
So we now need to understand what could be causing these exceptions and how
we can fix those.

Here is the summary
instance 84
All four thread die due to
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.

So is this something we can handle at streams level and not get it thrown
all the way to the thread.


instance 85
two again dies due to
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.

other two die due to
Caused by: org.rocksdb.RocksDBException: ~
I know this is some known rocksdb issue. Is there a way we can handle it at
stream side. What do you suggest to avoid this or what can be causing it.


instance 87
two again die due to
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.

one dies due to
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for
new-part-advice-key-table-changelog-11: 30015 ms has passed since last
append

I have really not understood what this means and any idea what could be the
issue here?

last one dies due to
Caused by: java.lang.IllegalStateException: task [0_9] Log end offset of
new-part-advice-key-table-changelog-9 should not change while restoring:
old end offset 647352, current offset 647632

I feel this should not be thrown to the stream thread too and handled at
streams level.

The complete logs can be found at:
https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_85_87_log.zip?dl=0

So I feel basically the streams application should be more resilient and
should not fail due to exceptions but should have a way to handle them.
or provide programmers the hooks that even in case a stream thread is shut
down there is a way to start a new thread so that we have a running streams
application.

The popular reason seems to me
org.apache.kafka.common.errors.NotLeaderForPartitionException, and this
along with few others should get handled.

Let us know what are your thoughts.


Thanks
Sachin

3 machines 12 threads all fail within 2 hours of starting the streams application

Reply via email to