Re: Kafka Streams vs Spark Streaming : reduce by window

Eno Thereska Thu, 15 Jun 2017 08:58:07 -0700

Hi Paolo,

Yeah, so if you want fewer records, you should actually "not" disable cache. If 
you disable cache you'll get all the records as you described.


About closing windows: if you close a window and a late record arrives that 
should have been in that window, you basically lose the ability to process that 
record. In Kafka Streams we are robust to that, in that we handle late arriving 
records. There is a comparison here for example when we compare it to other 
methods that depend on watermarks or triggers: 
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/ 
<https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/> 

Eno


> On 15 Jun 2017, at 14:57, Paolo Patierno <ppatie...@live.com> wrote:
> 
> Hi Emo,
> 
> 
> thanks for the reply !
> 
> Regarding the cache I'm already using CACHE_MAX_BYTES_BUFFERING_CONFIG = 0 
> (so disabling cache).
> 
> Regarding the interactive query API (I'll take a look) it means that it's up 
> to the application doing something like we have oob with Spark.
> 
> May I ask what do you mean with "We don’t believe in closing windows" ? Isn't 
> it much more code that user has to write for having the same result ?
> 
> I'm exploring Kafka Streams and it's very powerful imho even because the 
> usage is pretty simple but this scenario could have a lack against Spark.
> 
> 
> Thanks,
> 
> Paolo.
> 
> 
> Paolo Patierno
> Senior Software Engineer (IoT) @ Red Hat
> Microsoft MVP on Windows Embedded & IoT
> Microsoft Azure Advisor
> 
> Twitter : @ppatierno<http://twitter.com/ppatierno>
> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno>
> Blog : DevExperience<http://paolopatierno.wordpress.com/>
> 
> 
> ________________________________
> From: Eno Thereska <eno.there...@gmail.com>
> Sent: Thursday, June 15, 2017 1:45 PM
> To: users@kafka.apache.org
> Subject: Re: Kafka Streams vs Spark Streaming : reduce by window
> 
> Hi Paolo,
> 
> That is indeed correct. We don’t believe in closing windows in Kafka Streams.
> You could reduce the number of downstream records by using record caches: 
> http://docs.confluent.io/current/streams/developer-guide.html#record-caches-in-the-dsl
>  
> <http://docs.confluent.io/current/streams/developer-guide.html#record-caches-in-the-dsl>.
> 
> Alternatively you can just query the KTable whenever you want using the 
> Interactive Query APIs (so when you query dictates what  data you receive), 
> see this 
> https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
>  
> <https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/>
> 
> Thanks
> Eno
>> On Jun 15, 2017, at 2:38 PM, Paolo Patierno <ppatie...@live.com> wrote:
>> 
>> Hi,
>> 
>> 
>> using the streams library I noticed a difference (or there is a lack of 
>> knowledge on my side)with Apache Spark.
>> 
>> Imagine following scenario ...
>> 
>> 
>> I have a source topic where numeric values come in and I want to check the 
>> maximum value in the latest 5 seconds but ... putting the max value into a 
>> destination topic every 5 seconds.
>> 
>> This is what happens with reduceByWindow method in Spark.
>> 
>> I'm using reduce on a KStream here that process the max value taking into 
>> account previous values in the latest 5 seconds but the final value is put 
>> into the destination topic for each incoming value.
>> 
>> 
>> For example ...
>> 
>> 
>> An application sends numeric values every 1 second.
>> 
>> With Spark ... the source gets values every 1 second, process max in a 
>> window of 5 seconds, puts the max into the destination every 5 seconds (so 
>> when the window ends). If the sequence is 21, 25, 22, 20, 26 the output will 
>> be just 26.
>> 
>> With Kafka Streams ... the source gets values every 1 second, process max in 
>> a window of 5 seconds, puts the max into the destination every 1 seconds (so 
>> every time an incoming value arrives). Of course, if for example the 
>> sequence is 21, 25, 22, 20, 26 ... the output will be 21, 25, 25, 25, 26.
>> 
>> 
>> Is it possible with Kafka Streams ? Or it's something to do at application 
>> level ?
>> 
>> 
>> Thanks,
>> 
>> Paolo
>> 
>> 
>> Paolo Patierno
>> Senior Software Engineer (IoT) @ Red Hat
>> Microsoft MVP on Windows Embedded & IoT
>> Microsoft Azure Advisor
>> 
>> Twitter : @ppatierno<http://twitter.com/ppatierno>
>> Linkedin : paolopatierno<http://it.linkedin.com/in/paolopatierno>
>> Blog : DevExperience<http://paolopatierno.wordpress.com/>
>

Re: Kafka Streams vs Spark Streaming : reduce by window

Reply via email to