Re: 3 machines 12 threads all fail within 2 hours of starting the streams application

Sachin Mittal Mon, 27 Mar 2017 21:45:06 -0700

 - single threaded multiple instances
This option we could not try. However what we observed that running
multiple instances on same machine with single thread would still create
multiple rocksdb instances and somehow the VM is not able to handle many
rocksdb instances running. Here bottleneck used to be rocks db.
However this was with earlier config of rocksdb.


- single thread single instance but multiple partitions
Here again we would had to restrict the partition so as to limit the
rocksdb instances.

However given that we need to partition our input source by atleast 12 to
take care of the peak time load either of two options were not feasible.

The recent state is as followed

1. We have picked up the latest rocksdb config from trunk.
2. Fixed the deadlock issue
3. Increased the RAM on machine.

This worked fine but we got into other issues which I posted in this thread.
4. Now we have added ProducerConfig.RETRIES_CONFIG = Integer.MAX_VALUE)
This setting is working fine. It has been a day now and the application's
all 12 threads are up and running fine. Since this is month end (and also
financial year end) and we are implementing this for a banking application,
we are getting somewhat peak load and it is working fine under pressure.
Also we are not experiencing any lag like we do for single instance so our
number of partition logic is also OK.

To conclude
1. rocks db is indeed a bottleneck here, but with latest settings and
(perhaps increasing the RAM) it is working fine. Again so far we are not
able to figure out how much of this is hardware issue and how much is a
software issue.

2. the streams application is not resilient. It kills the thread in cases
where it should handle the exception or even it it wants to throw the
exception all the way up, it should give devs flexibility to spawn a new
thread is some thread dies. Example are like
Log end offset of chagelog should not change while restoring
or Expiring 1 record(s) for changelog
or org.rocksdb.RocksDBException: ~

Lets hope with the PR https://github.com/apache/kafka/pull/2719 much of
such errors are resolved.

Thanks
Sachin



On Tue, Mar 28, 2017 at 1:02 AM, Matthias J. Sax <matth...@confluent.io>
wrote:

> Sachin,
>
> about this statement:
>
> >> Also note that when an identical streams application with single thread
> on
> >> a single instance is pulling data from some other non partitioned
> identical
> >> topic, the application never fails.
>
> What about some "combinations":
>  - single threaded multiple instances
>  - single thread single instance but multiple partitions
> etc.
>
> it would be helpful to understand, what scenario works and what not.
> Right now you go from
> single-threaded-sinlge-instance-with-no-partitioning to
> multi-threaded-multiple-instances-and-partitioned -- that's a big step
> to reason about the situation.
>
>
> -Matthias
>
>
> On 3/25/17 11:14 AM, Sachin Mittal wrote:
> > Hi,
> > The broker is a three machine cluster. The replication factor for input
> and
> > also internal topics is 3.
> > Brokers don't seem to fail. I always see their instances running.
> >
> > Also note that when an identical streams application with single thread
> on
> > a single instance is pulling data from some other non partitioned
> identical
> > topic, the application never fails. Note there too replication factor is
> 3
> > for input and internal topics.
> >
> > Please let us know if you have something for other errors. Also what ways
> > we can make the steams resilient. I do feel we need hooks to start new
> > stream threads just in case some thread shuts down due to unhandled
> > exception, or streams application itself doing a better job in handling
> > such and not shutting down the threads.
> >
> > Thanks
> > Sachin
> >
> >
> > On Sat, Mar 25, 2017 at 11:03 PM, Eno Thereska <eno.there...@gmail.com>
> > wrote:
> >
> >> Hi Sachin,
> >>
> >> See my previous email on the NotLeaderForPartitionException error.
> >>
> >> What is your Kafka configuration, how many brokers are you using? Also
> >> could you share the replication level (if different from 1) of your
> streams
> >> topics? Are there brokers failing while Streams is running?
> >>
> >> Thanks
> >> Eno
> >>
> >> On 25/03/2017, 11:00, "Sachin Mittal" <sjmit...@gmail.com> wrote:
> >>
> >>     Hi All,
> >>     I am revisiting the ongoing issue of getting a multi instance multi
> >>     threaded kafka streams cluster to work.
> >>
> >>     Scenario is that we have a 12 partition source topic. (note our
> server
> >>     cluster replication factor is 3).
> >>     We have a 3 machines client cluster with one instance on each. Each
> >>     instances uses 4 thread.
> >>     Streams version is 0.10.2 with latest deadlock fix and rocks db
> >>     optimization from trunk.
> >>
> >>     We also have an identical single partition topic and another single
> >>     threaded instance doing identical processing as the above one. This
> >> uses
> >>     version 0.10.1.1
> >>     This streams application never goes down.
> >>
> >>     The above application used to go down frequently with high cpu wait
> >> time
> >>     and also we used to get frequent deadlock issues. However since
> >> including
> >>     the fixes we see very little cpu wait time and now application does
> not
> >>     enter into deadlock. The threads simply get uncaught exception
> thrown
> >> from
> >>     the streams application and they die one by one eventually shutting
> >> down
> >>     the entire client cluster.
> >>     So we now need to understand what could be causing these exceptions
> >> and how
> >>     we can fix those.
> >>
> >>     Here is the summary
> >>     instance 84
> >>     All four thread die due to
> >>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> >> server
> >>     is not the leader for that topic-partition.
> >>
> >>     So is this something we can handle at streams level and not get it
> >> thrown
> >>     all the way to the thread.
> >>
> >>
> >>     instance 85
> >>     two again dies due to
> >>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> >> server
> >>     is not the leader for that topic-partition.
> >>
> >>     other two die due to
> >>     Caused by: org.rocksdb.RocksDBException: ~
> >>     I know this is some known rocksdb issue. Is there a way we can
> handle
> >> it at
> >>     stream side. What do you suggest to avoid this or what can be
> causing
> >> it.
> >>
> >>
> >>     instance 87
> >>     two again die due to
> >>     org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> >> server
> >>     is not the leader for that topic-partition.
> >>
> >>     one dies due to
> >>     org.apache.kafka.common.errors.TimeoutException: Expiring 1
> record(s)
> >> for
> >>     new-part-advice-key-table-changelog-11: 30015 ms has passed since
> last
> >>     append
> >>
> >>     I have really not understood what this means and any idea what could
> >> be the
> >>     issue here?
> >>
> >>     last one dies due to
> >>     Caused by: java.lang.IllegalStateException: task [0_9] Log end
> offset
> >> of
> >>     new-part-advice-key-table-changelog-9 should not change while
> >> restoring:
> >>     old end offset 647352, current offset 647632
> >>
> >>     I feel this should not be thrown to the stream thread too and
> handled
> >> at
> >>     streams level.
> >>
> >>     The complete logs can be found at:
> >>     https://www.dropbox.com/s/2t4ysfdqbtmcusq/complete_84_
> >> 85_87_log.zip?dl=0
> >>
> >>     So I feel basically the streams application should be more resilient
> >> and
> >>     should not fail due to exceptions but should have a way to handle
> them.
> >>     or provide programmers the hooks that even in case a stream thread
> is
> >> shut
> >>     down there is a way to start a new thread so that we have a running
> >> streams
> >>     application.
> >>
> >>     The popular reason seems to me
> >>     org.apache.kafka.common.errors.NotLeaderForPartitionException, and
> >> this
> >>     along with few others should get handled.
> >>
> >>     Let us know what are your thoughts.
> >>
> >>
> >>     Thanks
> >>     Sachin
> >>
> >>
> >>
> >>
> >
>
>

Re: 3 machines 12 threads all fail within 2 hours of starting the streams application

Reply via email to