If you create a PairRDD from the DataFrame, using
dataFrame.toRDD().mapToPair(), then you can call
partitionBy(someCustomPartitioner) which will partition the RDD by the key
(of the pair).
Then the operations on it (like joining with another RDD) will consider
this partitioning.
I'm not sure that
Hi Hemant,
Thank you for your replay.
I think source of my dataframe is not partitioned on key, its an avro
file where 'id' is a field .. but I don't know how to read a file and at
the same time configure partition key. I couldn't find anything on
SQLContext.read.load where you can set
I have two dataframes like this
student_rdf = (studentid, name, ...)
student_result_rdf = (studentid, gpa, ...)
we need to join this two dataframes. we are now doing like this,
student_rdf.join(student_result_rdf, student_result_rdf[studentid]
== student_rdf[studentid])
So it is simple.
Is the source of your dataframe partitioned on key? As per your mail, it
looks like it is not. If that is the case, for partitioning the data, you
will have to shuffle the data anyway.
Another part of your question is - how to co-group data from two dataframes
based on a key? I think for RDD's