Lekshmi Nair created SPARK-27321: ------------------------------------ Summary: SortMerge join on partition keys causes shuffle Key: SPARK-27321 URL: https://issues.apache.org/jira/browse/SPARK-27321 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Reporter: Lekshmi Nair
We got two datasets thats been persisted as follows: Dataset A: datasetA.repartition(5,datasetA.col("region")) .write().mode(saveMode) .format("parquet") .partitionBy("region") .bucketBy(5,"studentId") .sortBy("studentId") .option("path", parquetFilesDirectory) .saveAsTable( database.tableA)); Dataset B: datasetB.repartition(5,datasetB.col("region")) .write().mode(saveMode) .format("parquet") .partitionBy("region") .bucketBy(5,"studentId") .sortBy("studentId") .option("path", parquetFilesDirectory) .saveAsTable( database.tableB)); If we do join just with the bucketed column "studentId", there is NO shuffle as expected. When we join with region and studentId ,we see data shuffle.Below is the join query. spark.sql("Select * from database.tableA").join(spark.sql("Select * from database.tableB "), Seq("studentId","region")).show(10) Note: We cannot use the partition key as a bucket column -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org