Spark and Spark Streaming code sharing best practice.

Jean-Pascal Billaud Wed, 18 Feb 2015 10:59:01 -0800

Hey,

It seems pretty clear that one of the strength of Spark is to be able to
share your code between your batch and streaming layer. Though, given that
Spark streaming uses DStream being a set of RDDs and Spark uses a single
RDD there might some complexity associated with it.


Of course since DStream is a superset of RDDs, one can just run the same
code at the RDD granularity using DStream::forEachRDD. While this should
work for map, I am not sure how that can work when it comes to reduce phase
given that a group of keys spans across multiple RDDs.

One of the option is to change the dataset object on which a job works on.
For example of passing an RDD to a class method, one passes a higher level
object (MetaRDD) that wraps around RDD or DStream depending the context. At
this point the job calls its regular maps, reduces and so on and the
MetaRDD wrapper would delegate accordingly.

Just would like to know the official best practice from the spark community
though.

Thanks,

Spark and Spark Streaming code sharing best practice.

Reply via email to