Re: Spark Streaming RDD transformation

2014-06-26 Thread Sean Owen
If you want to transform an RDD to a Map, I assume you have an RDD of
pairs. The method collectAsMap() creates a Map from the RDD in this
case.

Do you mean that you want to update a Map object using data in each
RDD? You would use foreachRDD() in that case. Then you can use
RDD.foreach to do something like update a global Map object.

Not sure if this is what you mean but SparkContext.parallelize() can
be used to make an RDD from a List or Array of objects. But that's not
really related to streaming or updating a Map.

On Thu, Jun 26, 2014 at 1:40 PM, Bill Jay bill.jaypeter...@gmail.com wrote:
 Hi all,

 I am current working on a project that requires to transform each RDD in a
 DStream to a Map. Basically, when we get a list of data in each batch, we
 would like to update the global map. I would like to return the map as a
 single RDD.

 I am currently trying to use the function transform. The output will be a
 RDD of the updated map after each batch. How can I create an RDD from
 another data structure such as Int, Map, ect. Thanks!

 Bill


Re: Spark Streaming RDD transformation

2014-06-26 Thread Bill Jay
Thanks, Sean!

I am currently using foreachRDD to update the global map using data in each
RDD. The reason I want to return a map as RDD instead of just updating the
map is that RDD provides many handy methods for output. For example, I want
to save the global map into files in HDFS for each batch in the stream. In
this case, do you have any suggestion how Spark can easily allow me to do
that? Thanks!


On Thu, Jun 26, 2014 at 12:26 PM, Sean Owen so...@cloudera.com wrote:

 If you want to transform an RDD to a Map, I assume you have an RDD of
 pairs. The method collectAsMap() creates a Map from the RDD in this
 case.

 Do you mean that you want to update a Map object using data in each
 RDD? You would use foreachRDD() in that case. Then you can use
 RDD.foreach to do something like update a global Map object.

 Not sure if this is what you mean but SparkContext.parallelize() can
 be used to make an RDD from a List or Array of objects. But that's not
 really related to streaming or updating a Map.

 On Thu, Jun 26, 2014 at 1:40 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi all,
 
  I am current working on a project that requires to transform each RDD in
 a
  DStream to a Map. Basically, when we get a list of data in each batch, we
  would like to update the global map. I would like to return the map as a
  single RDD.
 
  I am currently trying to use the function transform. The output will be a
  RDD of the updated map after each batch. How can I create an RDD from
  another data structure such as Int, Map, ect. Thanks!
 
  Bill