Re: 3 machines 12 threads all fail within 2 hours of starting the streams application

Matthias J. Sax Tue, 28 Mar 2017 07:28:12 -0700

Thanks for this update! Really appreciate it! This allows us to improve
Kafka further!


We hope to do a bug-fix release including this findings soon!

Also happy, that your applications is running now! Keep us posted if
possible!


-Matthias


On 3/27/17 9:44 PM, Sachin Mittal wrote:
>  - single threaded multiple instances
> This option we could not try. However what we observed that running
> multiple instances on same machine with single thread would still create
> multiple rocksdb instances and somehow the VM is not able to handle many
> rocksdb instances running. Here bottleneck used to be rocks db.
> However this was with earlier config of rocksdb.
> 
> - single thread single instance but multiple partitions
> Here again we would had to restrict the partition so as to limit the
> rocksdb instances.
> 
> However given that we need to partition our input source by atleast 12 to
> take care of the peak time load either of two options were not feasible.
> 
> The recent state is as followed
> 
> 1. We have picked up the latest rocksdb config from trunk.
> 2. Fixed the deadlock issue
> 3. Increased the RAM on machine.
> 
> This worked fine but we got into other issues which I posted in this thread.
> 4. Now we have added ProducerConfig.RETRIES_CONFIG = Integer.MAX_VALUE)
> This setting is working fine. It has been a day now and the application's
> all 12 threads are up and running fine. Since this is month end (and also
> financial year end) and we are implementing this for a banking application,
> we are getting somewhat peak load and it is working fine under pressure.
> Also we are not experiencing any lag like we do for single instance so our
> number of partition logic is also OK.
> 
> To conclude
> 1. rocks db is indeed a bottleneck here, but with latest settings and
> (perhaps increasing the RAM) it is working fine. Again so far we are not
> able to figure out how much of this is hardware issue and how much is a
> software issue.
> 
> 2. the streams application is not resilient. It kills the thread in cases
> where it should handle the exception or even it it wants to throw the
> exception all the way up, it should give devs flexibility to spawn a new
> thread is some thread dies. Example are like
> Log end offset of chagelog should not change while restoring
> or Expiring 1 record(s) for changelog
> or org.rocksdb.RocksDBException: ~
> 
> Lets hope with the PR https://github.com/apache/kafka/pull/2719 much of
> such errors are resolved.
> 
> Thanks
> Sachin
> 
> 
> 
> On Tue, Mar 28, 2017 at 1:02 AM, Matthias J. Sax <matth...@confluent.io>
> wrote:
> 
>> Sachin,
>>
>> about this statement:
>>
>>>> Also note that when an identical streams application with single thread
>> on
>>>> a single instance is pulling data from some other non partitioned
>> identical
>>>> topic, the application never fails.
>>
>> What about some "combinations":
>>  - single threaded multiple instances
>>  - single thread single instance but multiple partitions
>> etc.
>>
>> it would be helpful to understand, what scenario works and what not.
>> Right now you go from
>> single-threaded-sinlge-instance-with-no-partitioning to
>> multi-threaded-multiple-instances-and-partitioned -- that's a big step
>> to reason about the situation.
>>
>>
>> -Matthias
>>
>>
>> On 3/25/17 11:14 AM, Sachin Mittal wrote:
>>> Hi,
>>> The broker is a three machine cluster. The replication factor for input
>> and
>>> also internal topics is 3.
>>> Brokers don't seem to fail. I always see their instances running.
>>>
>>> Also note that when an identical streams application with single thread
>> on
>>> a single instance is pulling data from some other non partitioned
>> identical
>>> topic, the application never fails. Note there too replication factor is
>> 3
>>> for input and internal topics.
>>>
>>> Please let us know if you have something for other errors. Also what ways
>>> we can make the steams resilient. I do feel we need hooks to start new
>>> stream threads just in case some thread shuts down due to unhandled
>>> exception, or streams application itself doing a better job in handling
>>> such and not shutting down the threads.
>>>
>>> Thanks
>>> Sachin
>>>
>>>
>>> On Sat, Mar 25, 2017 at 11:03 PM, Eno Thereska <eno.there...@gmail.com>
>>> wrote:
>>>
>>>> Hi Sachin,
>>>>
>>>> See my previous email on the NotLeaderForPartitionException error.
>>>>
>>>> What is your Kafka configuration, how many brokers are you using? Also
>>>> could you share the replication level (if different from 1) of your
>> streams
>>>> topics? Are there brokers failing while Streams is running?
>>>>
>>>> Thanks
>>>> Eno
>>>>
>>>> On 25/03/2017, 11:00, "Sachin Mittal" <sjmit...@gmail.com> wrote:
>>>>
>>>>     Hi All,
>>>>     I am revisiting the ongoing issue of getting a multi instance multi
>>>>     threaded kafka streams cluster to work.
>>>>
>>>>     Scenario is that we have a 12 partition source topic. (note our
>> server
>>>>     cluster replication factor is 3).
>>>>     We have a 3 machines client cluster with one instance on each. Each
>>>>     instances uses 4 thread.
>>>>     Streams version is 0.10.2 with latest deadlock fix and rocks db
>>>>     optimization from trunk.
>>>>
>>>>     We also have an identical single partition topic and another single
>>>>     threaded instance doing identical processing as the above one. This
>>>> uses
>>>>     version 0.10.1.1
>>>>     This streams application never goes down.
>>>>
>>>>     The above application used to go down frequently with high cpu wait
>>>> time
>>>>     and also we used to get frequent deadlock issues. However since
>>>> including
>>>>     the fixes we see very little cpu wait time and now application does
>> not
>>>>     enter into deadlock. The threads simply get uncaught exception
>> thrown
>>>> from
>>>>     the streams application and they die one by one eventually shutting
>>>> down
>>>>     the entire client cluster.
>>>>     So we now need to understand what could be causing these exceptions
>>>> and how
>>>>     we can fix those.
>>>>
>>>>     Here is the summary
>>>>     instance 84
>>>>     All four thread die due to
>>>>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
>>>> server
>>>>     is not the leader for that topic-partition.
>>>>
>>>>     So is this something we can handle at streams level and not get it
>>>> thrown
>>>>     all the way to the thread.
>>>>
>>>>
>>>>     instance 85
>>>>     two again dies due to
>>>>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
>>>> server
>>>>     is not the leader for that topic-partition.
>>>>
>>>>     other two die due to
>>>>     Caused by: org.rocksdb.RocksDBException: ~
>>>>     I know this is some known rocksdb issue. Is there a way we can
>> handle
>>>> it at
>>>>     stream side. What do you suggest to avoid this or what can be
>> causing
>>>> it.
>>>>
>>>>
>>>>     instance 87
>>>>     two again die due to
>>>>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
>>>> server
>>>>     is not the leader for that topic-partition.
>>>>
>>>>     one dies due to
>>>>     org.apache.kafka.common.errors.TimeoutException: Expiring 1
>> record(s)
>>>> for
>>>>     new-part-advice-key-table-changelog-11: 30015 ms has passed since
>> last
>>>>     append
>>>>
>>>>     I have really not understood what this means and any idea what could
>>>> be the
>>>>     issue here?
>>>>
>>>>     last one dies due to
>>>>     Caused by: java.lang.IllegalStateException: task [0_9] Log end
>> offset
>>>> of
>>>>     new-part-advice-key-table-changelog-9 should not change while
>>>> restoring:
>>>>     old end offset 647352, current offset 647632
>>>>
>>>>     I feel this should not be thrown to the stream thread too and
>> handled
>>>> at
>>>>     streams level.
>>>>
>>>>     The complete logs can be found at:
>>>>     https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_
>>>> 85_87_log.zip?dl=0
>>>>
>>>>     So I feel basically the streams application should be more resilient
>>>> and
>>>>     should not fail due to exceptions but should have a way to handle
>> them.
>>>>     or provide programmers the hooks that even in case a stream thread
>> is
>>>> shut
>>>>     down there is a way to start a new thread so that we have a running
>>>> streams
>>>>     application.
>>>>
>>>>     The popular reason seems to me
>>>>     org.apache.kafka.common.errors.NotLeaderForPartitionException, and
>>>> this
>>>>     along with few others should get handled.
>>>>
>>>>     Let us know what are your thoughts.
>>>>
>>>>
>>>>     Thanks
>>>>     Sachin
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: 3 machines 12 threads all fail within 2 hours of starting the streams application

Reply via email to