Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-24 Thread Guozhang Wang
Henry, Thanks for the great feedbacks. I'm making some proposal for adding the control mechanism for latency v.s. data volume tradeoffs, which I will put up to wiki once it is done: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Discussions We can continue the discussion from th

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Henry Cai
My use case is actually myTable.aggregate().to("output_topic"), so I need a way to suppress the number of outputs. I don't think correlating the internal cache flush with the output window emit frequency is ideal. It's too hard for application developer to see when the cache will be flushed, we w

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Jay Kreps
To summarize a chat session Guozhang and I just had on slack: We currently do dedupe the output for stateful operations (e.g. join, aggregate). They hit an in-memory cache and only produce output to rocksdb/kafka when either that cache fills up or the commit period occurs. So the changelog for the

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Henry Cai
0.10.0.1 is fine for me, I am actually building from trunk head for streams package. On Wed, Apr 20, 2016 at 5:06 PM, Guozhang Wang wrote: > I saw that note, thanks for commenting. > > I are cutting the next 0.10.0.0 RC next week, so I am not certain if it > will make it for 0.10.0.0. But we can

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Guozhang Wang
I saw that note, thanks for commenting. I are cutting the next 0.10.0.0 RC next week, so I am not certain if it will make it for 0.10.0.0. But we can push it to be in 0.10.0.1. Guozhang On Wed, Apr 20, 2016 at 4:57 PM, Henry Cai wrote: > Thanks. > > Do you know when KAFKA-3101 will be implemen

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Henry Cai
Thanks. Do you know when KAFKA-3101 will be implemented? I also add a note to that JIRA for a left outer join use case which also need buffer support. On Wed, Apr 20, 2016 at 4:42 PM, Guozhang Wang wrote: > Henry, > > I thought you were concerned about consumer memory contention. That's a > v

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Guozhang Wang
Henry, I thought you were concerned about consumer memory contention. That's a valid point, and yes, you need to keep those buffered records in a persistent store. As I mentioned we are trying to do optimize the aggregation outputs as in https://issues.apache.org/jira/browse/KAFKA-3101 Its idea

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Henry Cai
I think this scheme still has problems. If during 'holding' I literally hold (don't return the method call), I will starve the thread. If I am writing the output to a in-memory buffer and let the method returns, the kafka stream will acknowledge the record to upstream queue as processed, so I wou

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Guozhang Wang
By "holding the stream", I assume you are still consuming data, but just that you only write data every 10 minutes instead of upon each received record right? Anyways, in either case, consumer should not have severe memory issue as Kafka Streams will pause its consuming when enough data is buffere

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Henry Cai
So hold the stream for 15 minutes wouldn't cause too much performance problems? On Wed, Apr 20, 2016 at 3:16 PM, Guozhang Wang wrote: > Consumer' buffer does not depend on offset committing, once it is given > from the poll() call it is out of the buffer. If offsets are not committed, > then upo

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-20 Thread Guozhang Wang
Consumer' buffer does not depend on offset committing, once it is given from the poll() call it is out of the buffer. If offsets are not committed, then upon failover it will simply re-consumer these records again from Kafka. Guozhang On Tue, Apr 19, 2016 at 11:34 PM, Henry Cai wrote: > For the

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-19 Thread Henry Cai
For the technique of custom Processor of holding call to context.forward(), if I hold it for 10 minutes, what does that mean for the consumer acknowledgement on source node? I guess if I hold it for 10 minutes, the consumer is not going to ack to the upstream queue, will that impact the consumer p

Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-19 Thread Guozhang Wang
Yes we are aware of this behavior and are working on optimizing it: https://issues.apache.org/jira/browse/KAFKA-3101 More generally, we are considering to add a "trigger" interface similar to the Millwheel model where users can customize when they want to emit outputs to the downstream operators.

How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-19 Thread Jeff Klukas
Is it true that the aggregation and reduction methods of KStream will emit a new output message for each incoming message? I have an application that's copying a Postgres replication stream to a Kafka topic, and activity tends to be clustered, with many updates to a given primary key happening in