Takeshi Yamamuro created SPARK-32564:
----------------------------------------

             Summary:  Inject data statistics to simulate plan generation on 
actual TPCDS data
                 Key: SPARK-32564
                 URL: https://issues.apache.org/jira/browse/SPARK-32564
             Project: Spark
          Issue Type: Improvement
          Components: SQL, Tests
    Affects Versions: 3.1.0
            Reporter: Takeshi Yamamuro


`TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then checks 
if plans can be generated correctly. But, the generated plans can be different 
from actual ones because the input tables are empty (e.g., the plans always use 
broadcast-hash joins, but actual ones use sort-merge joins for larger tables). 
To mitigate the issue, this ticket targets at defining data statistics 
constants extracted from generated TPCDS data in `TPCDSTableStats`, then 
injects the statistics via `spark.sessionState.catalog.alterTableStats` when 
defining TPCDS tables in `TPCDSQuerySuite`.

Please see a link below about how to extract the table statistics:
 - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3

For example, the generated plans of TPCDS `q2` are different with/without this 
fix:
{code:java}
==== w/ this fix: q2 ====
== Physical Plan ==
* Sort (43)
+- Exchange (42)
 +- * Project (41)
 +- * SortMergeJoin Inner (40)
 :- * Sort (28)
 : +- Exchange (27)
 : +- * Project (26)
 : +- * BroadcastHashJoin Inner BuildRight (25)
 : :- * HashAggregate (19)
 : : +- Exchange (18)
 : : +- * HashAggregate (17)
 : : +- * Project (16)
 : : +- * BroadcastHashJoin Inner BuildRight (15)
 : : :- Union (9)
 : : : :- * Project (4)
 : : : : +- * Filter (3)
 : : : : +- * ColumnarToRow (2)
 : : : : +- Scan parquet default.web_sales (1)
 : : : +- * Project (8)
 : : : +- * Filter (7)
 : : : +- * ColumnarToRow (6)
 : : : +- Scan parquet default.catalog_sales (5)
 : : +- BroadcastExchange (14)
 : : +- * Project (13)
 : : +- * Filter (12)
 : : +- * ColumnarToRow (11)
 : : +- Scan parquet default.date_dim (10)
 : +- BroadcastExchange (24)
 : +- * Project (23)
 : +- * Filter (22)
 : +- * ColumnarToRow (21)
 : +- Scan parquet default.date_dim (20)
 +- * Sort (39)
 +- Exchange (38)
 +- * Project (37)
 +- * BroadcastHashJoin Inner BuildRight (36)
 :- * HashAggregate (30)
 : +- ReusedExchange (29)
 +- BroadcastExchange (35)
 +- * Project (34)
 +- * Filter (33)
 +- * ColumnarToRow (32)
 +- Scan parquet default.date_dim (31)
==== w/o this fix: q2 ====
== Physical Plan ==
* Sort (40)
+- Exchange (39)
 +- * Project (38)
 +- * BroadcastHashJoin Inner BuildRight (37)
 :- * Project (26)
 : +- * BroadcastHashJoin Inner BuildRight (25)
 : :- * HashAggregate (19)
 : : +- Exchange (18)
 : : +- * HashAggregate (17)
 : : +- * Project (16)
 : : +- * BroadcastHashJoin Inner BuildRight (15)
 : : :- Union (9)
 : : : :- * Project (4)
 : : : : +- * Filter (3)
 : : : : +- * ColumnarToRow (2)
 : : : : +- Scan parquet default.web_sales (1)
 : : : +- * Project (8)
 : : : +- * Filter (7)
 : : : +- * ColumnarToRow (6)
 : : : +- Scan parquet default.catalog_sales (5)
 : : +- BroadcastExchange (14)
 : : +- * Project (13)
 : : +- * Filter (12)
 : : +- * ColumnarToRow (11)
 : : +- Scan parquet default.date_dim (10)
 : +- BroadcastExchange (24)
 : +- * Project (23)
 : +- * Filter (22)
 : +- * ColumnarToRow (21)
 : +- Scan parquet default.date_dim (20)
 +- BroadcastExchange (36)
 +- * Project (35)
 +- * BroadcastHashJoin Inner BuildRight (34)
 :- * HashAggregate (28)
 : +- ReusedExchange (27)
 +- BroadcastExchange (33)
 +- * Project (32)
 +- * Filter (31)
 +- * ColumnarToRow (30)
 +- Scan parquet default.date_dim (29)
 {code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to