Re: Combining RDD's columns

Jeremy Freeman Fri, 18 Apr 2014 22:37:22 -0700

Hi Ian,

If I understand what you're after, you might find "zip" useful. From the docs:

Zips this RDD with another one, returning key-value pairs with the first 
element in each RDD, second element in each RDD, etc. Assumes that the two RDDs 
have the *same number of partitions* and the *same number of elements in each 
partition* (e.g. one was made through a map on the other).

Here's a toy example:

>> val rdd1 = sc.parallelize(Array("name1", "name2", "name3"), 3)
>> val rdd2 = sc.parallelize(Array("sign1", "sign2", "sign3"), 3)
>> rdd1.collect()
Array[String] = Array(name1, name2, name3)
>> rdd2.collect()
Array[String] = Array(sign1, sign2, sign3)
>> rdd1.zip(rdd2).collect()
Array[(String, String)] = Array((name1,sign1), (name2,sign2), (name3,sign3))

In your case, you might have the first two RDDs calculated from some common raw 
data through a map.

-- Jeremy

---------------------
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab

On Apr 19, 2014, at 12:59 AM, Ian Ferreira <ianferre...@hotmail.com> wrote:

> 
> This may seem contrived but, suppose I wanted to create a collection of  
> "single column" RDD's that contain calculated values, so I want to cache 
> these to avoid re-calc.
> 
> i.e.
> 
> rdd1 = {Names]
> rdd2 = {Star Sign}
> rdd3 = {Age}
> 
> Then I want to create a new virtual RDD that  is a collection of these RDD's 
> to create a "multi-column" RDD
> 
> rddA = {Names, Age}
> rddB = {Names, Star Sign}
> 
> I saw that rdd.union() merges rows, but anything that can combine columns?
> 
> Cheers
> - Ian

Re: Combining RDD's columns

Reply via email to