Thanks for this update! Really appreciate it! This allows us to improve Kafka further!
We hope to do a bug-fix release including this findings soon! Also happy, that your applications is running now! Keep us posted if possible! -Matthias On 3/27/17 9:44 PM, Sachin Mittal wrote: > - single threaded multiple instances > This option we could not try. However what we observed that running > multiple instances on same machine with single thread would still create > multiple rocksdb instances and somehow the VM is not able to handle many > rocksdb instances running. Here bottleneck used to be rocks db. > However this was with earlier config of rocksdb. > > - single thread single instance but multiple partitions > Here again we would had to restrict the partition so as to limit the > rocksdb instances. > > However given that we need to partition our input source by atleast 12 to > take care of the peak time load either of two options were not feasible. > > The recent state is as followed > > 1. We have picked up the latest rocksdb config from trunk. > 2. Fixed the deadlock issue > 3. Increased the RAM on machine. > > This worked fine but we got into other issues which I posted in this thread. > 4. Now we have added ProducerConfig.RETRIES_CONFIG = Integer.MAX_VALUE) > This setting is working fine. It has been a day now and the application's > all 12 threads are up and running fine. Since this is month end (and also > financial year end) and we are implementing this for a banking application, > we are getting somewhat peak load and it is working fine under pressure. > Also we are not experiencing any lag like we do for single instance so our > number of partition logic is also OK. > > To conclude > 1. rocks db is indeed a bottleneck here, but with latest settings and > (perhaps increasing the RAM) it is working fine. Again so far we are not > able to figure out how much of this is hardware issue and how much is a > software issue. > > 2. the streams application is not resilient. It kills the thread in cases > where it should handle the exception or even it it wants to throw the > exception all the way up, it should give devs flexibility to spawn a new > thread is some thread dies. Example are like > Log end offset of chagelog should not change while restoring > or Expiring 1 record(s) for changelog > or org.rocksdb.RocksDBException: ~ > > Lets hope with the PR https://github.com/apache/kafka/pull/2719 much of > such errors are resolved. > > Thanks > Sachin > > > > On Tue, Mar 28, 2017 at 1:02 AM, Matthias J. Sax <matth...@confluent.io> > wrote: > >> Sachin, >> >> about this statement: >> >>>> Also note that when an identical streams application with single thread >> on >>>> a single instance is pulling data from some other non partitioned >> identical >>>> topic, the application never fails. >> >> What about some "combinations": >> - single threaded multiple instances >> - single thread single instance but multiple partitions >> etc. >> >> it would be helpful to understand, what scenario works and what not. >> Right now you go from >> single-threaded-sinlge-instance-with-no-partitioning to >> multi-threaded-multiple-instances-and-partitioned -- that's a big step >> to reason about the situation. >> >> >> -Matthias >> >> >> On 3/25/17 11:14 AM, Sachin Mittal wrote: >>> Hi, >>> The broker is a three machine cluster. The replication factor for input >> and >>> also internal topics is 3. >>> Brokers don't seem to fail. I always see their instances running. >>> >>> Also note that when an identical streams application with single thread >> on >>> a single instance is pulling data from some other non partitioned >> identical >>> topic, the application never fails. Note there too replication factor is >> 3 >>> for input and internal topics. >>> >>> Please let us know if you have something for other errors. Also what ways >>> we can make the steams resilient. I do feel we need hooks to start new >>> stream threads just in case some thread shuts down due to unhandled >>> exception, or streams application itself doing a better job in handling >>> such and not shutting down the threads. >>> >>> Thanks >>> Sachin >>> >>> >>> On Sat, Mar 25, 2017 at 11:03 PM, Eno Thereska <eno.there...@gmail.com> >>> wrote: >>> >>>> Hi Sachin, >>>> >>>> See my previous email on the NotLeaderForPartitionException error. >>>> >>>> What is your Kafka configuration, how many brokers are you using? Also >>>> could you share the replication level (if different from 1) of your >> streams >>>> topics? Are there brokers failing while Streams is running? >>>> >>>> Thanks >>>> Eno >>>> >>>> On 25/03/2017, 11:00, "Sachin Mittal" <sjmit...@gmail.com> wrote: >>>> >>>> Hi All, >>>> I am revisiting the ongoing issue of getting a multi instance multi >>>> threaded kafka streams cluster to work. >>>> >>>> Scenario is that we have a 12 partition source topic. (note our >> server >>>> cluster replication factor is 3). >>>> We have a 3 machines client cluster with one instance on each. Each >>>> instances uses 4 thread. >>>> Streams version is 0.10.2 with latest deadlock fix and rocks db >>>> optimization from trunk. >>>> >>>> We also have an identical single partition topic and another single >>>> threaded instance doing identical processing as the above one. This >>>> uses >>>> version 0.10.1.1 >>>> This streams application never goes down. >>>> >>>> The above application used to go down frequently with high cpu wait >>>> time >>>> and also we used to get frequent deadlock issues. However since >>>> including >>>> the fixes we see very little cpu wait time and now application does >> not >>>> enter into deadlock. The threads simply get uncaught exception >> thrown >>>> from >>>> the streams application and they die one by one eventually shutting >>>> down >>>> the entire client cluster. >>>> So we now need to understand what could be causing these exceptions >>>> and how >>>> we can fix those. >>>> >>>> Here is the summary >>>> instance 84 >>>> All four thread die due to >>>> org.apache.kafka.common.errors.NotLeaderForPartitionException: This >>>> server >>>> is not the leader for that topic-partition. >>>> >>>> So is this something we can handle at streams level and not get it >>>> thrown >>>> all the way to the thread. >>>> >>>> >>>> instance 85 >>>> two again dies due to >>>> org.apache.kafka.common.errors.NotLeaderForPartitionException: This >>>> server >>>> is not the leader for that topic-partition. >>>> >>>> other two die due to >>>> Caused by: org.rocksdb.RocksDBException: ~ >>>> I know this is some known rocksdb issue. Is there a way we can >> handle >>>> it at >>>> stream side. What do you suggest to avoid this or what can be >> causing >>>> it. >>>> >>>> >>>> instance 87 >>>> two again die due to >>>> org.apache.kafka.common.errors.NotLeaderForPartitionException: This >>>> server >>>> is not the leader for that topic-partition. >>>> >>>> one dies due to >>>> org.apache.kafka.common.errors.TimeoutException: Expiring 1 >> record(s) >>>> for >>>> new-part-advice-key-table-changelog-11: 30015 ms has passed since >> last >>>> append >>>> >>>> I have really not understood what this means and any idea what could >>>> be the >>>> issue here? >>>> >>>> last one dies due to >>>> Caused by: java.lang.IllegalStateException: task [0_9] Log end >> offset >>>> of >>>> new-part-advice-key-table-changelog-9 should not change while >>>> restoring: >>>> old end offset 647352, current offset 647632 >>>> >>>> I feel this should not be thrown to the stream thread too and >> handled >>>> at >>>> streams level. >>>> >>>> The complete logs can be found at: >>>> https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_ >>>> 85_87_log.zip?dl=0 >>>> >>>> So I feel basically the streams application should be more resilient >>>> and >>>> should not fail due to exceptions but should have a way to handle >> them. >>>> or provide programmers the hooks that even in case a stream thread >> is >>>> shut >>>> down there is a way to start a new thread so that we have a running >>>> streams >>>> application. >>>> >>>> The popular reason seems to me >>>> org.apache.kafka.common.errors.NotLeaderForPartitionException, and >>>> this >>>> along with few others should get handled. >>>> >>>> Let us know what are your thoughts. >>>> >>>> >>>> Thanks >>>> Sachin >>>> >>>> >>>> >>>> >>> >> >> >
signature.asc
Description: OpenPGP digital signature