Please keep in mind i'm fairly new to spark.
I have some spark code where i load two textfiles as datasets and after some
map and filter operations to bring the columns in a specific shape, i join
the datasets.

The join takes place on a common column (of type string).
Is there any way to avoid the exchange/shuffle before the join?

As i understand it, the idea is that if i, initially, hash partition the
datasets based on the join column, then the join would only have to look
within the same partitions to complete the join, thus avoiding a shuffle.

In the rdd API, you can create a hash partitioner and use partitionBy when
creating the RDDS.(Though im not sure if this a sure way to avoid the
shuffle on the join.) Is there any similar method for Dataframe/Dataset API?

I also would like to avoid repartition,repartitionByRange and bucketing
techniques since i only intend to do one join and these also require
shuffling beforehand.







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to