Hi, Regards, Madhukara Phatak http://datamantra.io/
On Tue, Mar 17, 2015 at 2:31 PM, Tathagata Das <t...@databricks.com> wrote: > That's not super essential, and hence hasn't been done till now. Even in > core Spark there are MappedRDD, etc. even though all of them can be > implemented by MapPartitionedRDD (may be the name is wrong). So its nice to > maintain the consistency, MappedDStream creates MappedRDDs. :) > Though this does not eliminate the possibility that we will do it. Maybe > in future, if we find that maintaining these different DStreams is becoming > a maintenance burden (its isn't yet), we may collapse them to use > transform. We did so in the python API for exactly this reason. > Ok. When I was going through source code it confused me to understand what were right extension points were. So I thought whoever go through the code may get into same situation. But if it's not super essential then ok. > > If you are interested in contributing to Spark Streaming, i can point you > to a number of issues where your contributions will be more valuable. > Yes please. > > TD > > On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak <phatak....@gmail.com> > wrote: > >> Hi, >> Thank you for the response. >> >> Can I give a PR to use transform for all the functions like map,flatMap >> etc so they are consistent with other API's?. >> >> Regards, >> Madhukara Phatak >> http://datamantra.io/ >> >> On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das <t...@databricks.com> >> wrote: >> >>> It's mostly for legacy reasons. First we had added all the >>> MappedDStream, etc. and then later we realized we need to expose something >>> that is more generic for arbitrary RDD-RDD transformations. It can be >>> easily replaced. However, there is a slight value in having MappedDStream, >>> for developers to learn about DStreams. >>> >>> TD >>> >>> On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak <phatak....@gmail.com> >>> wrote: >>> >>>> Hi, >>>> Thanks for the response. I understand that part. But I am asking why >>>> the internal implementation using a subclass when it can use an existing >>>> api? Unless there is a real difference, it feels like code smell to me. >>>> >>>> >>>> Regards, >>>> Madhukara Phatak >>>> http://datamantra.io/ >>>> >>>> On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai <saisai.s...@intel.com> >>>> wrote: >>>> >>>>> I think these two ways are both OK for you to write streaming job, >>>>> `transform` is a more general way for you to transform from one DStream to >>>>> another if there’s no related DStream API (but have related RDD API). But >>>>> using map maybe more straightforward and easy to understand. >>>>> >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Jerry >>>>> >>>>> >>>>> >>>>> *From:* madhu phatak [mailto:phatak....@gmail.com] >>>>> *Sent:* Monday, March 16, 2015 4:32 PM >>>>> *To:* user@spark.apache.org >>>>> *Subject:* MappedStream vs Transform API >>>>> >>>>> >>>>> >>>>> Hi, >>>>> >>>>> Current implementation of map function in spark streaming looks as >>>>> below. >>>>> >>>>> >>>>> >>>>> *def *map[U: ClassTag](mapFunc: T => U): DStream[U] = { >>>>> >>>>> *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc)) >>>>> } >>>>> >>>>> It creates an instance of MappedDStream which is a subclass of >>>>> DStream. >>>>> >>>>> >>>>> >>>>> The same function can be also implemented using transform API >>>>> >>>>> >>>>> >>>>> *def map*[U: ClassTag](mapFunc: T => U): DStream[U] = >>>>> >>>>> this.transform(rdd => { >>>>> >>>>> rdd.map(mapFunc) >>>>> }) >>>>> >>>>> >>>>> >>>>> Both implementation looks same. If they are same, is there any >>>>> advantage having a subclass of DStream?. Why can't we just use transform >>>>> API? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Regards, >>>>> Madhukara Phatak >>>>> http://datamantra.io/ >>>>> >>>> >>>> >>> >> >