I have another (semi-related) question: I see in the documentation that DStream has a transformation reduceByKey. Does this work on _all_ elements in the stream, as they're coming in, or is this a transformation per RDD/micro batch? I assume the latter, otherwise it would be more akin to updateStateByKey, right?
On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger <c...@koeninger.org> wrote: > The DStream implementation decides how to produce an RDD for a time > (this is the compute method) > > The RDD implementation decides how to partition things (this is the > getPartitions method) > > You can look at those methods in DirectKafkaInputDStream and KafkaRDD > respectively if you want to see an example > > On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <debie.d...@gmail.com> wrote: > > Ah, that makes it much clearer, thanks! > > > > It also brings up an additional question: who/what decides on the > > partitioning? Does Spark Streaming decide to divide a micro batch/RDD > into > > more than 1 partition based on size? Or is it something that the "source" > > (SocketStream, KafkaStream etc.) decides? > > > > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> > >> A micro batch is an RDD. > >> > >> An RDD has partitions, so different executors can work on different > >> partitions concurrently. > >> > >> Don't think of that as multiple micro-batches within a time slot. > >> It's one RDD within a time slot, with multiple partitions. > >> > >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com> > wrote: > >> > Thanks, but that thread does not answer my questions, which are about > >> > the > >> > distributed nature of RDDs vs the small nature of "micro batches" and > on > >> > how > >> > Spark Streaming distributes work. > >> > > >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh > >> > <mich.talebza...@gmail.com> > >> > wrote: > >> >> > >> >> Hi Daan, > >> >> > >> >> You may find this link Re: Is "spark streaming" streaming or > >> >> mini-batch? > >> >> helpful. This was a thread in this forum not long ago. > >> >> > >> >> HTH > >> >> > >> >> Dr Mich Talebzadeh > >> >> > >> >> > >> >> > >> >> LinkedIn > >> >> > >> >> https://www.linkedin.com/profile/view?id= > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >> >> > >> >> > >> >> > >> >> http://talebzadehmich.wordpress.com > >> >> > >> >> > >> >> Disclaimer: Use it at your own risk. Any and all responsibility for > any > >> >> loss, damage or destruction of data or any other property which may > >> >> arise > >> >> from relying on this email's technical content is explicitly > >> >> disclaimed. The > >> >> author will in no case be liable for any monetary damages arising > from > >> >> such > >> >> loss, damage or destruction. > >> >> > >> >> > >> >> > >> >> > >> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com> > wrote: > >> >>> > >> >>> Hi all! > >> >>> > >> >>> When reading about Spark Streaming and its execution model, I see > >> >>> diagrams > >> >>> like this a lot: > >> >>> > >> >>> > >> >>> > >> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/ > file/n27699/lambda-architecture-with-spark-spark- > streaming-kafka-cassandra-akka-and-scala-31-638.jpg> > >> >>> > >> >>> It does a fine job explaining how DStreams consist of micro batches > >> >>> that > >> >>> are > >> >>> basically RDDs. There are however some things I don't understand: > >> >>> > >> >>> - RDDs are distributed by design, but micro batches are conceptually > >> >>> small. > >> >>> How/why are these micro batches distributed so that they need to be > >> >>> implemented as RDD? > >> >>> - The above image doesn't explain how Spark Streaming parallelizes > >> >>> data. > >> >>> According to the image, a stream of events get broken into micro > >> >>> batches > >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is > a > >> >>> micro > >> >>> batch, etc.). How does parallelism come into play here? Is it that > >> >>> even > >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events, > >> >>> that > >> >>> multiple micro batches for that time slot will be created and > >> >>> distributed > >> >>> across the executors? > >> >>> > >> >>> Clarification would be helpful! > >> >>> > >> >>> Daan > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> View this message in context: > >> >>> > >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark- > Streaming-dividing-DStream-into-mini-batches-tp27699.html > >> >>> Sent from the Apache Spark User List mailing list archive at > >> >>> Nabble.com. > >> >>> > >> >>> ------------------------------------------------------------ > --------- > >> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> >>> > >> >> > >> > > > > > >