Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Cody Koeninger
Yeah. If you're looking to reduce over more than one microbatch/rdd, there's also reduceByKeyAndWindow On Thu, Sep 15, 2016 at 4:27 AM, Daan Debie wrote: > I have another (semi-related) question: I see in the documentation that > DStream has a transformation reduceByKey.

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Daan Debie
I have another (semi-related) question: I see in the documentation that DStream has a transformation reduceByKey. Does this work on _all_ elements in the stream, as they're coming in, or is this a transformation per RDD/micro batch? I assume the latter, otherwise it would be more akin to

Re: Spark Streaming - dividing DStream into mini batches

2016-09-14 Thread Daan Debie
Thanks for the awesome explanation! It's super clear to me now :) On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger wrote: > The DStream implementation decides how to produce an RDD for a time > (this is the compute method) > > The RDD implementation decides how to partition

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Ah, that makes it much clearer, thanks! It also brings up an additional question: who/what decides on the partitioning? Does Spark Streaming decide to divide a micro batch/RDD into more than 1 partition based on size? Or is it something that the "source" (SocketStream, KafkaStream etc.) decides?

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
The DStream implementation decides how to produce an RDD for a time (this is the compute method) The RDD implementation decides how to partition things (this is the getPartitions method) You can look at those methods in DirectKafkaInputDStream and KafkaRDD respectively if you want to see an

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
A micro batch is an RDD. An RDD has partitions, so different executors can work on different partitions concurrently. Don't think of that as multiple micro-batches within a time slot. It's one RDD within a time slot, with multiple partitions. On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Thanks, but that thread does not answer my questions, which are about the distributed nature of RDDs vs the small nature of "micro batches" and on how Spark Streaming distributes work. On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh wrote: > Hi Daan, > > You may find

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Mich Talebzadeh
Hi Daan, You may find this link Re: Is "spark streaming" streaming or mini-batch? helpful. This was a thread in this forum not long ago. HTH Dr Mich Talebzadeh LinkedIn *