Ah, unfortunately that is not possible today as Catalyst has a logical
notion of partitioning that is different than that exposed by the RDD.

A researcher at Databricks is considering allowing this kind of
optimization for in memory cached relations though.  Here is a WIP patch:
https://github.com/yhuai/spark/commit/aa4448ac801f833f7c8217fdce30cba9c803a877

On Fri, May 8, 2015 at 4:06 PM, Daniel, Ronald (ELS-SDG) <
r.dan...@elsevier.com> wrote:

>  Just trying to make sure that something I know in advance (the joins
> will always have an equality test on one specific field) is used to
> optimize the partitioning so the joins only use local data.
>
>
>
> Thanks for the info.
>
>
>
> Ron
>
>
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Friday, May 08, 2015 3:15 PM
> *To:* Daniel, Ronald (ELS-SDG)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Hash Partitioning and Dataframes
>
>
>
> What are you trying to accomplish?  Internally Spark SQL will add Exchange
> operators to make sure that data is partitioned correctly for joins and
> aggregations.  If you are going to do other RDD operations on the result of
> dataframe operations and you need to manually control the partitioning,
> call df.rdd and partition as you normally would.
>
>
>
> On Fri, May 8, 2015 at 2:47 PM, Daniel, Ronald (ELS-SDG) <
> r.dan...@elsevier.com> wrote:
>
> Hi,
>
> How can I ensure that a batch of DataFrames I make are all partitioned
> based on the value of one column common to them all?
> For RDDs I would partitionBy a HashPartitioner, but I don't see that in
> the DataFrame API.
> If I partition the RDDs that way, then do a toDF(), will the partitioning
> be preserved?
>
> Thanks,
> Ron
>
>
>

Reply via email to