Re: RocksDB flushing issue on 0.10.2 streams

2017-07-06 Thread Greg Fodor
Also sorry, to clarify the job context: - This is a job running across 5 nodes on AWS Linux - It is under load with a large number of partitions: approximately 700-800 topic-partitions assignments in total for the entire job. Topics involved have large # of partitions, 128 each. - 32 stream

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-06 Thread Greg Fodor
That's great news, thanks! On Thu, Jul 6, 2017 at 6:18 AM, Damian Guy wrote: > Hi Greg, > I've been able to reproduce it by running multiple instances with standby > tasks and many threads. If i force some rebalances, then i see the failure. > Now to see if i can repro in

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-06 Thread Damian Guy
Hi Greg, I've been able to reproduce it by running multiple instances with standby tasks and many threads. If i force some rebalances, then i see the failure. Now to see if i can repro in a test. I think it is probably the same issue as: https://issues.apache.org/jira/browse/KAFKA-5070 On Thu, 6

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-06 Thread Damian Guy
Greg, what OS are you running on? Are you able to reproduce this in a test at all? For instance, based on what you described it would seem that i should be able to start a streams app, wait for it to be up and running, run the state dir cleanup, see it fail. However, i can't reproduce it. On Wed,

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-05 Thread Damian Guy
Thanks Greg. I'll look into it more tomorrow. Just finding it difficult to reproduce in a test. Thanks for providing the sequence, gives me something to try and repo. Appreciated. Thanks, Damian On Wed, 5 Jul 2017 at 19:57, Greg Fodor wrote: > Also, the sequence of events is:

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-05 Thread Greg Fodor
Also, the sequence of events is: - Job starts, rebalance happens, things run along smoothly. - After 10 minutes (retrospectively) the cleanup task kicks on and removes some directories - Tasks immediately start failing when trying to flush their state stores On Wed, Jul 5, 2017 at 11:55 AM,

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-05 Thread Greg Fodor
The issue I am hitting is not the directory locking issues we've seen in the past. The issue seems to be, as you mentioned, that the state dir is getting deleted by the store cleanup process, but there are still tasks running that are trying to flush the state store. It seems more than a little

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-05 Thread Damian Guy
BTW - i'm trying to reproduce it, but not having much luck so far... On Wed, 5 Jul 2017 at 09:27 Damian Guy wrote: > Thans for the updates Greg. There were some minor changes around this in > 0.11.0 to make it less likely to happen, but we've only ever seen the > locking

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-05 Thread Damian Guy
Thans for the updates Greg. There were some minor changes around this in 0.11.0 to make it less likely to happen, but we've only ever seen the locking fail in the event of a rebalance. When everything is running state dirs shouldn't be deleted if they are being used as the lock will fail. On

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-05 Thread Greg Fodor
I can report that setting state.cleanup.delay.ms to a very large value (effectively disabling it) works around the issue. It seems that the state store cleanup process can somehow get out ahead of another task that still thinks it should be writing to the state store/flushing it. In my test runs,

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-04 Thread Greg Fodor
Upon another run, I see the same error occur during a rebalance, so either my log was showing a rebalance or there is a shared underlying issue with state stores. On Tue, Jul 4, 2017 at 11:35 AM, Greg Fodor wrote: > Also, I am on 0.10.2.1, so poll interval was already set to

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-04 Thread Greg Fodor
Also, I am on 0.10.2.1, so poll interval was already set to MAX_VALUE. On Tue, Jul 4, 2017 at 11:28 AM, Greg Fodor wrote: > I've nuked the nodes this happened on, but the job had been running for > about 5-10 minutes across 5 nodes before this happened. Does the log show a >

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-04 Thread Greg Fodor
I've nuked the nodes this happened on, but the job had been running for about 5-10 minutes across 5 nodes before this happened. Does the log show a rebalance was happening? It looks to me like the standby task was just committing as part of normal operations. On Tue, Jul 4, 2017 at 7:40 AM,

Re: RocksDB flushing issue on 0.10.2 streams

2017-07-04 Thread Damian Guy
Hi Greg, Obviously a bit difficult to read the RocksDBException, but my guess is it is because the state directory gets deleted right before the flush happens: 2017-07-04 10:54:46,829 [myid:] - INFO [StreamThread-21:StateDirectory@213] - Deleting obsolete state directory 0_10 for task 0_10 Yes