Lekshmi Nair created SPARK-27321:
------------------------------------

             Summary: SortMerge join on partition keys causes shuffle
                 Key: SPARK-27321
                 URL: https://issues.apache.org/jira/browse/SPARK-27321
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.3.2
            Reporter: Lekshmi Nair


We got two datasets thats been persisted as follows: 

Dataset A: 
datasetA.repartition(5,datasetA.col("region")) 
                .write().mode(saveMode) 
                .format("parquet") 
                .partitionBy("region") 
                .bucketBy(5,"studentId") 
                .sortBy("studentId") 
                .option("path", parquetFilesDirectory) 
                .saveAsTable( database.tableA)); 

Dataset B: 
datasetB.repartition(5,datasetB.col("region")) 
                .write().mode(saveMode) 
                .format("parquet") 
                .partitionBy("region") 
                .bucketBy(5,"studentId") 
                .sortBy("studentId") 
                .option("path", parquetFilesDirectory) 
                .saveAsTable( database.tableB)); 

 If we do join just with the bucketed column "studentId", there is NO shuffle 
as expected. 

When we join with region and studentId ,we see data shuffle.Below is the join 
query. 

spark.sql("Select *  from  database.tableA").join(spark.sql("Select *  from   
database.tableB "), Seq("studentId","region")).show(10) 

Note: We cannot use the partition key as a bucket column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to