Re: Spark Streaming - dividing DStream into mini batches

Daan Debie Thu, 15 Sep 2016 02:27:44 -0700

I have another (semi-related) question: I see in the documentation that
DStream has a transformation reduceByKey. Does this work on _all_ elements
in the stream, as they're coming in, or is this a transformation per
RDD/micro batch? I assume the latter, otherwise it would be more akin to
updateStateByKey, right?


On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger <c...@koeninger.org> wrote:

> The DStream implementation decides how to produce an RDD for a time
> (this is the compute method)
>
> The RDD implementation decides how to partition things (this is the
> getPartitions method)
>
> You can look at those methods in DirectKafkaInputDStream and KafkaRDD
> respectively if you want to see an example
>
> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <debie.d...@gmail.com> wrote:
> > Ah, that makes it much clearer, thanks!
> >
> > It also brings up an additional question: who/what decides on the
> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
> into
> > more than 1 partition based on size? Or is it something that the "source"
> > (SocketStream, KafkaStream etc.) decides?
> >
> > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
> >>
> >> A micro batch is an RDD.
> >>
> >> An RDD has partitions, so different executors can work on different
> >> partitions concurrently.
> >>
> >> Don't think of that as multiple micro-batches within a time slot.
> >> It's one RDD within a time slot, with multiple partitions.
> >>
> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie <debie.d...@gmail.com>
> wrote:
> >> > Thanks, but that thread does not answer my questions, which are about
> >> > the
> >> > distributed nature of RDDs vs the small nature of "micro batches" and
> on
> >> > how
> >> > Spark Streaming distributes work.
> >> >
> >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
> >> > <mich.talebza...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi Daan,
> >> >>
> >> >> You may find this link Re: Is "spark streaming" streaming or
> >> >> mini-batch?
> >> >> helpful. This was a thread in this forum not long ago.
> >> >>
> >> >> HTH
> >> >>
> >> >> Dr Mich Talebzadeh
> >> >>
> >> >>
> >> >>
> >> >> LinkedIn
> >> >>
> >> >> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>
> >> >>
> >> >>
> >> >> http://talebzadehmich.wordpress.com
> >> >>
> >> >>
> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >> >> loss, damage or destruction of data or any other property which may
> >> >> arise
> >> >> from relying on this email's technical content is explicitly
> >> >> disclaimed. The
> >> >> author will in no case be liable for any monetary damages arising
> from
> >> >> such
> >> >> loss, damage or destruction.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 13 September 2016 at 14:25, DandyDev <debie.d...@gmail.com>
> wrote:
> >> >>>
> >> >>> Hi all!
> >> >>>
> >> >>> When reading about Spark Streaming and its execution model, I see
> >> >>> diagrams
> >> >>> like this a lot:
> >> >>>
> >> >>>
> >> >>>
> >> >>> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >> >>>
> >> >>> It does a fine job explaining how DStreams consist of micro batches
> >> >>> that
> >> >>> are
> >> >>> basically RDDs. There are however some things I don't understand:
> >> >>>
> >> >>> - RDDs are distributed by design, but micro batches are conceptually
> >> >>> small.
> >> >>> How/why are these micro batches distributed so that they need to be
> >> >>> implemented as RDD?
> >> >>> - The above image doesn't explain how Spark Streaming parallelizes
> >> >>> data.
> >> >>> According to the image, a stream of events get broken into micro
> >> >>> batches
> >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is
> a
> >> >>> micro
> >> >>> batch, etc.). How does parallelism come into play here? Is it that
> >> >>> even
> >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> >> >>> that
> >> >>> multiple micro batches for that time slot will be created and
> >> >>> distributed
> >> >>> across the executors?
> >> >>>
> >> >>> Clarification would be helpful!
> >> >>>
> >> >>> Daan
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> View this message in context:
> >> >>>
> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >> >>> Sent from the Apache Spark User List mailing list archive at
> >> >>> Nabble.com.
> >> >>>
> >> >>> ------------------------------------------------------------
> ---------
> >> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: Spark Streaming - dividing DStream into mini batches

Reply via email to