Re: What's the difference between map and transform in spark streaming?

Tathagata Das Thu, 27 Feb 2014 18:36:07 -0800

To clarify further. A* yourDStream.map(record => yourFunction(record)) * will
do something on every record in every RDDs in the DStream. Which
essentially means every records in the DStream. But *yourDStream.transform(rdd
=> anotherFunction(rdd))* allows you to do arbitrary stuff on every RDD in
the DStream. For example, if you do

*yourDStream.transform(rdd => rdd.map(record => yourFunction(record)) * is
exactly same as the one in the first line. Only a map function.

However, you can also do

*yourDStream.transform(rdd =>
rdd.map(...).reduceByKey(....).filter(...).flatMap(....).sortByKey(...) ) *

which obviously involves multiple stages of shuffles by keys. So
*transform*is far more general operation than *map
*that allows arbitrary computations on each RDD of a DStream. For example,
say you want to sort every batch of data by a key. Currerntly, there is no
DStream.sortByKey() to do that. However, you can easily use transform to do
 *DStream.transform(rdd => rdd.sortByKey()).*

TD

On Thu, Feb 27, 2014 at 1:46 PM, Mayur Rustagi <[email protected]>wrote:

> DStream -> RDD ->partitions ->  rows
>
> map works on each rows
> mapPartitions: works on each partition
> transform: each rdd
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Thu, Feb 27, 2014 at 1:42 PM, Aureliano Buendia 
> <[email protected]>wrote:
>
>> Hi,
>>
>> Is transform supposed to be a generalized form of map? They seem to be
>> doing the same job.
>>
>
>

Re: What's the difference between map and transform in spark streaming?

Reply via email to