Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman
Hi,

I’m looking for a way to compare subsets of an RDD intelligently.

 Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?

The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()


def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}

Val keyPairs:(Int, Int) // all key pairs

Val rddPairs = keyPairs.map{

case (a, b) =>

filterSubset(a,r).cartesian(filterSubset(b,r))

}

rddPairs.map{whatever I want to compare…}



I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.



What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?


Thank you for your help and apologies if this email sends more than once
(I'm having some issues with the mailing list)


Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman
Hi,

I’m looking for a way to compare subsets of an RDD intelligently.

 Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?

The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()


def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}

Val keyPairs:(Int, Int) // all key pairs

Val rddPairs = keyPairs.map{

case (a, b) =>

filterSubset(a,r).cartesian(filterSubset(b,r))

}

rddPairs.map{whatever I want to compare…}



I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.



What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?


Thank you for your help