I was wondering what are the best practices in regards to cogrouping data in dataframes. While joins are obviously a powerful tool, it seems that there's still some cases where using cogroup (which is only supported by PairRDDs) is still a better choice. Consider 2 case classes with the following data:
case class TypeA(id: Int, typeAStr: String ) case class TypeB(id: Int, typeBId: Int, typeBStr: String ) val rddA = ... // contains TypeA(1, "A1") val rddB = ... // contains TypeB(1, 1, "B1"), TypeB(1, 2, "B2"), I need basically all objects (regardless of their class) with the same "id" grouped together. Thus, cogroup yields exactly what I want: val cogroupAB = rddA.map(rec => (rec.id, rec)).cogroup(rddB.map(rec => (rec.id, rec))) // printing this out for key = 1 yields: //(1,(CompactBuffer(TypeA(1,A1)),CompactBuffer(TypeB(1,1,B1), TypeB(1,2,B2)))) However, suppose instead I have two Dataframes dfA and dfB with the same schema and data as rddA and rddB, respectively, but we're using dataframes because the data is coming from an external source with it's own schema. In this case, there appears to be no cogroup between data frames, only joins: val joinResult = dfA.join(dfB, dfA("id") === dfB("id") ) // printing this out where id = 1yields // [1,A1,1,1,B1] // [1,A1,1,2,B2] While this gives me the data that I want, this will create a lot of redundant data if dfA has a lot more fields and there's a lot of dfB rows with the same id as any given dfA row. I've seen a post on stackExchange about the lack of cogroup for dataframes ( http://stackoverflow.com/questions/31806473/spark-dataframe-best-way-to-cogroup-dataframes ) but the only solution posted consists of creating RDDs from the dataframe, cogrouping them, then converting them back to data frames, which (the author readily admits) is inefficient. Thus, in the case of cogrouping dataframes, would I be best off to: 1. run the join, then create a dataframe manually removing the redundant data 2. run a similar solution to the stackoverflow post 3. use some other method (maybe this is an idea for an API enhancement?) I'd prefer not to convert the data frames to RDDs permanently, as I'd like to keep all the nice properties of dataframes. If #3 is an option and requires an enhancement, I would be interested in discussing further and possibly contributing. thanks, Matt --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org