Please keep in mind i'm fairly new to spark. I have some spark code where i load two textfiles as datasets and after some map and filter operations to bring the columns in a specific shape, i join the datasets.
The join takes place on a common column (of type string). Is there any way to avoid the exchange/shuffle before the join? As i understand it, the idea is that if i, initially, hash partition the datasets based on the join column, then the join would only have to look within the same partitions to complete the join, thus avoiding a shuffle. In the rdd API, you can create a hash partitioner and use partitionBy when creating the RDDS.(Though im not sure if this a sure way to avoid the shuffle on the join.) Is there any similar method for Dataframe/Dataset API? I also would like to avoid repartition,repartitionByRange and bucketing techniques since i only intend to do one join and these also require shuffling beforehand. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org