To clarify further. A* yourDStream.map(record => yourFunction(record)) * will do something on every record in every RDDs in the DStream. Which essentially means every records in the DStream. But *yourDStream.transform(rdd => anotherFunction(rdd))* allows you to do arbitrary stuff on every RDD in the DStream. For example, if you do
*yourDStream.transform(rdd => rdd.map(record => yourFunction(record)) * is exactly same as the one in the first line. Only a map function. However, you can also do *yourDStream.transform(rdd => rdd.map(...).reduceByKey(....).filter(...).flatMap(....).sortByKey(...) ) * which obviously involves multiple stages of shuffles by keys. So *transform*is far more general operation than *map *that allows arbitrary computations on each RDD of a DStream. For example, say you want to sort every batch of data by a key. Currerntly, there is no DStream.sortByKey() to do that. However, you can easily use transform to do *DStream.transform(rdd => rdd.sortByKey()).* TD On Thu, Feb 27, 2014 at 1:46 PM, Mayur Rustagi <[email protected]>wrote: > DStream -> RDD ->partitions -> rows > > map works on each rows > mapPartitions: works on each partition > transform: each rdd > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On Thu, Feb 27, 2014 at 1:42 PM, Aureliano Buendia > <[email protected]>wrote: > >> Hi, >> >> Is transform supposed to be a generalized form of map? They seem to be >> doing the same job. >> > >
