Franck Tago created SPARK-17859: ----------------------------------- Summary: persist should not impede with spark's ability to perform a broadcast join. Key: SPARK-17859 URL: https://issues.apache.org/jira/browse/SPARK-17859 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.0.0 Environment: spark 2.0.0 , Linux RedHat Reporter: Franck Tago
I am using Spark 2.0.0 My investigation leads me to conclude that calling persist could prevent broadcast join from happening . Example Case1: No persist call var df1 =spark.range(1000000).select($"id".as("id1")) df1: org.apache.spark.sql.DataFrame = [id1: bigint] var df2 =spark.range(1000).select($"id".as("id2")) df2: org.apache.spark.sql.DataFrame = [id2: bigint] df1.join(df2 , $"id1" === $"id2" ).explain == Physical Plan == *BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight :- *Project [id#114L AS id1#117L] : +- *Range (0, 1000000, splits=2) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *Project [id#120L AS id2#123L] +- *Range (0, 1000, splits=2) Case 2: persist call df1.persist.join(df2 , $"id1" === $"id2" ).explain 16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data. == Physical Plan == *SortMergeJoin [id1#3L], [id2#9L], Inner :- *Sort [id1#3L ASC], false, 0 : +- Exchange hashpartitioning(id1#3L, 10) : +- InMemoryTableScan [id1#3L] : : +- InMemoryRelation [id1#3L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : : : +- *Project [id#0L AS id1#3L] : : : +- *Range (0, 1000000, splits=2) +- *Sort [id2#9L ASC], false, 0 +- Exchange hashpartitioning(id2#9L, 10) +- InMemoryTableScan [id2#9L] : +- InMemoryRelation [id2#9L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : : +- *Project [id#6L AS id2#9L] : : +- *Range (0, 1000, splits=2) Why does the persist call prevent the broadcast join . My opinion is that it should not . I was made aware that the persist call is lazy and that might have something to do with it , but I still contend that it should not . Losing broadcast joins is really costly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org