Cogrouping data in dataframes - PairRDD cogroup vs. join - best practices

Matthew Denny Sat, 12 Sep 2015 21:32:56 -0700

I was wondering what are the best practices in regards to cogrouping
data in dataframes.  While joins are obviously a powerful tool, it seems
that there's still some cases where using cogroup (which is only
supported by PairRDDs) is still a better choice.  Consider 2 case
classes with the following data:


case class TypeA(id: Int, typeAStr: String  )
case class TypeB(id: Int, typeBId: Int, typeBStr: String  )

val rddA = ... // contains TypeA(1, "A1")
val rddB = ... // contains TypeB(1, 1, "B1"), TypeB(1, 2, "B2"),

I need basically all objects (regardless of their class) with the same
"id" grouped together.  Thus, cogroup yields exactly what I want: 
val cogroupAB = rddA.map(rec => (rec.id, rec)).cogroup(rddB.map(rec =>
(rec.id, rec))) // printing this out for key = 1 yields: 
//(1,(CompactBuffer(TypeA(1,A1)),CompactBuffer(TypeB(1,1,B1),
TypeB(1,2,B2))))


However, suppose instead I have two Dataframes dfA and dfB with the same
schema and data as rddA and rddB, respectively, but we're using
dataframes because the data is coming from an external source with it's
own schema.  In this case, there appears to be no cogroup between data
frames, only joins: 
val joinResult = dfA.join(dfB, dfA("id") === dfB("id") ) // printing
this out where id = 1yields 
// [1,A1,1,1,B1]
// [1,A1,1,2,B2]

While this gives me the data that I want, this will create a lot of
redundant data if dfA has a lot more fields and there's a lot of dfB
rows with the same id as any given dfA row.  I've seen a post on
stackExchange about the lack of cogroup for dataframes (
http://stackoverflow.com/questions/31806473/spark-dataframe-best-way-to-cogroup-dataframes
) but the only solution posted consists of creating RDDs from the
dataframe, cogrouping them, then converting them back to data frames,
which (the author readily admits) is inefficient.  

Thus, in the case of cogrouping dataframes, would I be best off to: 
1. run the join, then create a dataframe manually removing the redundant
data  
2. run a similar solution to the stackoverflow post 
3. use some other method (maybe this is an idea for an API enhancement?) 

I'd prefer not to convert the data frames to RDDs permanently, as I'd
like to keep all the nice properties of dataframes.  If #3 is an option
and requires an enhancement, I would be interested in discussing further
and possibly contributing.  

thanks, Matt   



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Cogrouping data in dataframes - PairRDD cogroup vs. join - best practices

Reply via email to