Hello Ara, Your encountered issue seems to be KAFKA-3812 <https://issues.apache.org/jira/browse/KAFKA-3812>, and KAFKA-3938 <https://issues.apache.org/jira/browse/KAFKA-3938>. Could you try to upgrade to the newly released 0.10.1.0 version and see if this issue goes away? If not I would love to investigate this issue further with you.
Guozhang Guozhang On Sun, Oct 23, 2016 at 1:45 PM, Ara Ebrahimi <ara.ebrah...@argyledata.com> wrote: > And then this on a different node: > > 2016-10-23 13:43:57 INFO StreamThread:286 - stream-thread > [StreamThread-3] Stream thread shutdown complete > 2016-10-23 13:43:57 ERROR StreamPipeline:169 - An exception has occurred > org.apache.kafka.streams.errors.StreamsException: stream-thread > [StreamThread-3] Failed to rebalance > at org.apache.kafka.streams.processor.internals.StreamThread.runLoop( > StreamThread.java:401) > at org.apache.kafka.streams.processor.internals. > StreamThread.run(StreamThread.java:235) > Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error > while creating the state manager > at org.apache.kafka.streams.processor.internals.AbstractTask.<init>( > AbstractTask.java:72) > at org.apache.kafka.streams.processor.internals. > StreamTask.<init>(StreamTask.java:90) > at org.apache.kafka.streams.processor.internals. > StreamThread.createStreamTask(StreamThread.java:622) > at org.apache.kafka.streams.processor.internals. > StreamThread.addStreamTasks(StreamThread.java:649) > at org.apache.kafka.streams.processor.internals.StreamThread.access$000( > StreamThread.java:69) > at org.apache.kafka.streams.processor.internals.StreamThread$1. > onPartitionsAssigned(StreamThread.java:120) > at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator. > onJoinComplete(ConsumerCoordinator.java:228) > at org.apache.kafka.clients.consumer.internals.AbstractCoordinator. > joinGroupIfNeeded(AbstractCoordinator.java:313) > at org.apache.kafka.clients.consumer.internals.AbstractCoordinator. > ensureActiveGroup(AbstractCoordinator.java:277) > at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll( > ConsumerCoordinator.java:259) > at org.apache.kafka.clients.consumer.KafkaConsumer. > pollOnce(KafkaConsumer.java:1013) > at org.apache.kafka.clients.consumer.KafkaConsumer.poll( > KafkaConsumer.java:979) > at org.apache.kafka.streams.processor.internals.StreamThread.runLoop( > StreamThread.java:398) > ... 1 more > Caused by: java.io.IOException: task [7_1] Failed to lock the state > directory: /tmp/kafka-streams/argyle-streams/7_1 > at org.apache.kafka.streams.processor.internals. > ProcessorStateManager.<init>(ProcessorStateManager.java:98) > at org.apache.kafka.streams.processor.internals.AbstractTask.<init>( > AbstractTask.java:69) > ... 13 more > > Ara. > > On Oct 23, 2016, at 1:24 PM, Ara Ebrahimi <ara.ebrah...@argyledata.com< > mailto:ara.ebrah...@argyledata.com>> wrote: > > Hi, > > This happens when I hammer our 5 kafka streaming nodes (each with 4 > streaming threads) hard enough for an hour or so: > > 2016-10-23 13:04:17 ERROR StreamThread:324 - stream-thread > [StreamThread-2] Failed to flush state for StreamTask 3_8: > org.apache.kafka.streams.errors.ProcessorStateException: task [3_8] > Failed to flush state store streams-data-record-stats-avro-br-store > at org.apache.kafka.streams.processor.internals. > ProcessorStateManager.flush(ProcessorStateManager.java:322) > at org.apache.kafka.streams.processor.internals.AbstractTask.flushState( > AbstractTask.java:181) > at org.apache.kafka.streams.processor.internals.StreamThread$4.apply( > StreamThread.java:360) > at org.apache.kafka.streams.processor.internals.StreamThread. > performOnAllTasks(StreamThread.java:322) > at org.apache.kafka.streams.processor.internals. > StreamThread.flushAllState(StreamThread.java:357) > at org.apache.kafka.streams.processor.internals.StreamThread. > shutdownTasksAndState(StreamThread.java:295) > at org.apache.kafka.streams.processor.internals.StreamThread.shutdown( > StreamThread.java:262) > at org.apache.kafka.streams.processor.internals. > StreamThread.run(StreamThread.java:245) > Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error > opening store streams-data-record-stats-avro-br-store-201505160000 at > location /tmp/kafka-streams/argyle-streams/3_8/streams-data- > record-stats-avro-br-store/streams-data-record-stats- > avro-br-store-201505160000 > at org.apache.kafka.streams.state.internals.RocksDBStore. > openDB(RocksDBStore.java:196) > at org.apache.kafka.streams.state.internals.RocksDBStore. > openDB(RocksDBStore.java:158) > at org.apache.kafka.streams.state.internals.RocksDBWindowStore$Segment. > openDB(RocksDBWindowStore.java:72) > at org.apache.kafka.streams.state.internals.RocksDBWindowStore. > getOrCreateSegment(RocksDBWindowStore.java:402) > at org.apache.kafka.streams.state.internals.RocksDBWindowStore. > putAndReturnInternalKey(RocksDBWindowStore.java:310) > at org.apache.kafka.streams.state.internals.RocksDBWindowStore.put( > RocksDBWindowStore.java:292) > at org.apache.kafka.streams.state.internals.MeteredWindowStore.put( > MeteredWindowStore.java:101) > at org.apache.kafka.streams.state.internals.CachingWindowStore$1.apply( > CachingWindowStore.java:87) > at org.apache.kafka.streams.state.internals.NamedCache. > flush(NamedCache.java:117) > at org.apache.kafka.streams.state.internals.ThreadCache. > flush(ThreadCache.java:100) > at org.apache.kafka.streams.state.internals.CachingWindowStore.flush( > CachingWindowStore.java:118) > at org.apache.kafka.streams.processor.internals. > ProcessorStateManager.flush(ProcessorStateManager.java:320) > ... 7 more > Caused by: org.rocksdb.RocksDBException: IO error: lock > /tmp/kafka-streams/argyle-streams/3_8/streams-data- > record-stats-avro-br-store/streams-data-record-stats- > avro-br-store-201505160000/LOCK: No locks available > at org.rocksdb.RocksDB.open(Native Method) > at org.rocksdb.RocksDB.open(RocksDB.java:184) > at org.apache.kafka.streams.state.internals.RocksDBStore. > openDB(RocksDBStore.java:189) > ... 18 more > > Some sort of a locking bug? > > Note that when this happen this node stops processing anything and the > other nodes seem to want to pick up the load, which brings the whole > streaming cluster to a stand still. That’s very worrying. Is a document > somewhere describing *in detail* how failover for streaming works? > > Ara. > > > > ________________________________ > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you have > received it in error, please notify the sender immediately and delete the > original. Any other use of the e-mail by you is prohibited. Thank you in > advance for your cooperation. > > ________________________________ > > > > ________________________________ > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you have > received it in error, please notify the sender immediately and delete the > original. Any other use of the e-mail by you is prohibited. Thank you in > advance for your cooperation. > > ________________________________ > > > > > ________________________________ > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you have > received it in error, please notify the sender immediately and delete the > original. Any other use of the e-mail by you is prohibited. Thank you in > advance for your cooperation. > > ________________________________ > -- -- Guozhang