Anton Uvarov created SPARK-45198:
------------------------------------

             Summary: problem using broadcast join with parquet/iceberg tables
                 Key: SPARK-45198
                 URL: https://issues.apache.org/jira/browse/SPARK-45198
             Project: Spark
          Issue Type: Bug
          Components: Build
    Affects Versions: 3.4.1
            Reporter: Anton Uvarov


We have 2 Parquet tables: load_test_full_warehouse.gen_document_type and 
load_test_full_warehouse.generation_document_part.
Trying to make a left join of these two tables onto each other gives a strange 
result. In the case where on the left side of the join we use a large table 
load_test_full_warehouse.generation_document_part, the optimizer uses a 
broadcast join.
However, in the case when on the left in the join we use a small reference 
table, the optimizer chooses to execute the query using the merge sort. 
Although it would seem that the small table on the left in a left join should 
initiate a broadcast join.
  An attempt to use hints and collect statistics did not yield results. The 
following queries were used:

spark.sql(f"""create table iceberg_warehouse.t1 using iceberg 
              as SELECT /*+ BROADCAST(doc_tp) */
                doc.DOCUMENT_DATE
                , doc_tp.NAME as DOCUMENT_TYPE
                , COUNT(*) as DOC_QTY
              FROM load_test_full_warehouse.generation_document_part doc
              LEFT JOIN load_test_full_warehouse.gen_document_type doc_tp ON 
doc.DOCUMENT_TYPE_ID_INT = doc_tp.DOCUMENT_TYPE_ID_INT
              GROUP BY doc.DOCUMENT_DATE, doc_tp.NAME""")

 

spark.sql(f"""create table iceberg_warehouse.t2 using iceberg 
              as SELECT /*+ BROADCAST(doc_tp) */
                doc.DOCUMENT_DATE
                , doc_tp.NAME as DOCUMENT_TYPE
                , COUNT(*) as DOC_QTY
              FROM load_test_full_warehouse.gen_document_type doc_tp
              LEFT JOIN load_test_full_warehouse.generation_document_part doc 
ON doc.DOCUMENT_TYPE_ID_INT = doc_tp.DOCUMENT_TYPE_ID_INT
              GROUP BY doc.DOCUMENT_DATE, doc_tp.NAME""")

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to