Re: RDD to DStream

2014-11-12 Thread Jianshi Huang
; >> Jerry >> >> >> >> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >> *Sent:* Monday, October 27, 2014 4:07 PM >> >> *To:* Shao, Saisai >> *Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com) >> *Subject:* Re: RDD to DStre

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
; > Jerry > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Monday, October 27, 2014 4:07 PM > > *To:* Shao, Saisai > *Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com) > *Subject:* Re: RDD to DStream > > > > Yeah, you&

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
@spark.apache.org; Tathagata Das (t...@databricks.com) Subject: Re: RDD to DStream Yeah, you're absolutely right Saisai. My point is we should allow this kind of logic in RDD, let's say transforming type RDD[(Key, Iterable[T])] to Seq[(Key, RDD[T])]. Make sense? Jianshi On Mon, Oct 27, 2014

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
> to execute in remote side, which obviously do not has SparkContext, I think > Spark cannot support nested RDD in closure. > > > > Thanks > > Jerry > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Monday, October 27, 2014 3:30 PM > >

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
hink Spark cannot support nested RDD in closure. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 3:30 PM To: Shao, Saisai Cc: user@spark.apache.org; Tathagata Das (t...@databricks.com) Subject: Re: RDD to DStream Ok, back to Scala code, I'm

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
but you cannot avoid scanning the whole data. Basically we need to avoid >> fetching large amount of data back to driver. >> >> >> >> >> >> Thanks >> >> Jerry >> >> >> >> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
rom:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Monday, October 27, 2014 2:39 PM > *To:* Shao, Saisai > *Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com) > > *Subject:* Re: RDD to DStream > > > > Hi Saisai, > > > > I understand it&

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
amount of data back to driver. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 2:39 PM To: Shao, Saisai Cc: user@spark.apache.org; Tathagata Das (t...@databricks.com) Subject: Re: RDD to DStream Hi Saisai, I understand it's non-trivial, bu

Re: RDD to DStream

2014-10-26 Thread Jianshi Huang
; > > Thanks > > Jerry > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Monday, October 27, 2014 1:42 PM > *To:* Tathagata Das > *Cc:* Aniket Bhatnagar; user@spark.apache.org > *Subject:* Re: RDD to DStream > > > > I have a

RE: RDD to DStream

2014-10-26 Thread Shao, Saisai
...@gmail.com] Sent: Monday, October 27, 2014 1:42 PM To: Tathagata Das Cc: Aniket Bhatnagar; user@spark.apache.org Subject: Re: RDD to DStream I have a similar requirement. But instead of grouping it by chunkSize, I would have the timeStamp be part of the data. So the function I want has the

Re: RDD to DStream

2014-10-26 Thread Jianshi Huang
I have a similar requirement. But instead of grouping it by chunkSize, I would have the timeStamp be part of the data. So the function I want has the following signature: // RDD of (timestamp, value) def rddToDStream[T](data: RDD[(Long, T)], timeWindow: Long)(implicit ssc: StreamingContext): D

Re: RDD to DStream

2014-08-06 Thread Tathagata Das
Hey Aniket, Great thoughts! I understand the usecase. But as you have realized yourself it is not trivial to cleanly stream a RDD as a DStream. Since RDD operations are defined to be scan based, it is not efficient to define RDD based on slices of data within a partition of another RDD, using pure

Re: RDD to DStream

2014-08-04 Thread Aniket Bhatnagar
The use case for converting RDD into DStream is that I want to simulate a stream from an already persisted data for testing analytics. It is trivial to create a RDD from any persisted data but not so much for DStream. Therefore, my idea to create DStream from RDD. For example, lets say you are tryi

Re: RDD to DStream

2014-08-01 Thread Mayur Rustagi
Nice question :) Ideally you should use a queuestream interface to push RDD into a queue & then spark streaming can handle the rest. Though why are you looking to convert RDD to DStream, another workaround folks use is to source DStream from folders & move files that they need reprocessed back into

Re: RDD to DStream

2014-08-01 Thread Aniket Bhatnagar
Hi everyone I haven't been receiving replies to my queries in the distribution list. Not pissed but I am actually curious to know if my messages are actually going through or not. Can someone please confirm that my msgs are getting delivered via this distribution list? Thanks, Aniket On 1 Augus