Re: 3 machines 12 threads all fail within 2 hours of starting the streams application

Sachin Mittal Sat, 25 Mar 2017 11:15:04 -0700

Hi,
The broker is a three machine cluster. The replication factor for input and
also internal topics is 3.
Brokers don't seem to fail. I always see their instances running.


Also note that when an identical streams application with single thread on
a single instance is pulling data from some other non partitioned identical
topic, the application never fails. Note there too replication factor is 3
for input and internal topics.

Please let us know if you have something for other errors. Also what ways
we can make the steams resilient. I do feel we need hooks to start new
stream threads just in case some thread shuts down due to unhandled
exception, or streams application itself doing a better job in handling
such and not shutting down the threads.

Thanks
Sachin


On Sat, Mar 25, 2017 at 11:03 PM, Eno Thereska <eno.there...@gmail.com>
wrote:

> Hi Sachin,
>
> See my previous email on the NotLeaderForPartitionException error.
>
> What is your Kafka configuration, how many brokers are you using? Also
> could you share the replication level (if different from 1) of your streams
> topics? Are there brokers failing while Streams is running?
>
> Thanks
> Eno
>
> On 25/03/2017, 11:00, "Sachin Mittal" <sjmit...@gmail.com> wrote:
>
>     Hi All,
>     I am revisiting the ongoing issue of getting a multi instance multi
>     threaded kafka streams cluster to work.
>
>     Scenario is that we have a 12 partition source topic. (note our server
>     cluster replication factor is 3).
>     We have a 3 machines client cluster with one instance on each. Each
>     instances uses 4 thread.
>     Streams version is 0.10.2 with latest deadlock fix and rocks db
>     optimization from trunk.
>
>     We also have an identical single partition topic and another single
>     threaded instance doing identical processing as the above one. This
> uses
>     version 0.10.1.1
>     This streams application never goes down.
>
>     The above application used to go down frequently with high cpu wait
> time
>     and also we used to get frequent deadlock issues. However since
> including
>     the fixes we see very little cpu wait time and now application does not
>     enter into deadlock. The threads simply get uncaught exception thrown
> from
>     the streams application and they die one by one eventually shutting
> down
>     the entire client cluster.
>     So we now need to understand what could be causing these exceptions
> and how
>     we can fix those.
>
>     Here is the summary
>     instance 84
>     All four thread die due to
>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server
>     is not the leader for that topic-partition.
>
>     So is this something we can handle at streams level and not get it
> thrown
>     all the way to the thread.
>
>
>     instance 85
>     two again dies due to
>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server
>     is not the leader for that topic-partition.
>
>     other two die due to
>     Caused by: org.rocksdb.RocksDBException: ~
>     I know this is some known rocksdb issue. Is there a way we can handle
> it at
>     stream side. What do you suggest to avoid this or what can be causing
> it.
>
>
>     instance 87
>     two again die due to
>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server
>     is not the leader for that topic-partition.
>
>     one dies due to
>     org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s)
> for
>     new-part-advice-key-table-changelog-11: 30015 ms has passed since last
>     append
>
>     I have really not understood what this means and any idea what could
> be the
>     issue here?
>
>     last one dies due to
>     Caused by: java.lang.IllegalStateException: task [0_9] Log end offset
> of
>     new-part-advice-key-table-changelog-9 should not change while
> restoring:
>     old end offset 647352, current offset 647632
>
>     I feel this should not be thrown to the stream thread too and handled
> at
>     streams level.
>
>     The complete logs can be found at:
>     https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_
> 85_87_log.zip?dl=0
>
>     So I feel basically the streams application should be more resilient
> and
>     should not fail due to exceptions but should have a way to handle them.
>     or provide programmers the hooks that even in case a stream thread is
> shut
>     down there is a way to start a new thread so that we have a running
> streams
>     application.
>
>     The popular reason seems to me
>     org.apache.kafka.common.errors.NotLeaderForPartitionException, and
> this
>     along with few others should get handled.
>
>     Let us know what are your thoughts.
>
>
>     Thanks
>     Sachin
>
>
>
>

Re: 3 machines 12 threads all fail within 2 hours of starting the streams application

Reply via email to