Ah, unfortunately that is not possible today as Catalyst has a logical notion of partitioning that is different than that exposed by the RDD.
A researcher at Databricks is considering allowing this kind of optimization for in memory cached relations though. Here is a WIP patch: https://github.com/yhuai/spark/commit/aa4448ac801f833f7c8217fdce30cba9c803a877 On Fri, May 8, 2015 at 4:06 PM, Daniel, Ronald (ELS-SDG) < r.dan...@elsevier.com> wrote: > Just trying to make sure that something I know in advance (the joins > will always have an equality test on one specific field) is used to > optimize the partitioning so the joins only use local data. > > > > Thanks for the info. > > > > Ron > > > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* Friday, May 08, 2015 3:15 PM > *To:* Daniel, Ronald (ELS-SDG) > *Cc:* user@spark.apache.org > *Subject:* Re: Hash Partitioning and Dataframes > > > > What are you trying to accomplish? Internally Spark SQL will add Exchange > operators to make sure that data is partitioned correctly for joins and > aggregations. If you are going to do other RDD operations on the result of > dataframe operations and you need to manually control the partitioning, > call df.rdd and partition as you normally would. > > > > On Fri, May 8, 2015 at 2:47 PM, Daniel, Ronald (ELS-SDG) < > r.dan...@elsevier.com> wrote: > > Hi, > > How can I ensure that a batch of DataFrames I make are all partitioned > based on the value of one column common to them all? > For RDDs I would partitionBy a HashPartitioner, but I don't see that in > the DataFrame API. > If I partition the RDDs that way, then do a toDF(), will the partitioning > be preserved? > > Thanks, > Ron > > >