[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17626: - Target Version/s: 2.3.0 (was: 2.2.0) > TPC-DS performance improvements using star-schema heuristics > > > Key: SPARK-17626 > URL: https://issues.apache.org/jira/browse/SPARK-17626 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ioana Delaney >Priority: Critical > Attachments: StarSchemaJoinReordering.pptx > > > *TPC-DS performance improvements using star-schema heuristics* > \\ > \\ > TPC-DS consists of multiple snowflake schema, which are multiple star schema > with dimensions linking to dimensions. A star schema consists of a fact table > referencing a number of dimension tables. Fact table holds the main data > about a business. Dimension table, a usually smaller table, describes data > reflecting the dimension/attribute of a business. > \\ > \\ > As part of the benchmark performance investigation, we observed a pattern of > sub-optimal execution plans of large fact tables joins. Manual rewrite of > some of the queries into selective fact-dimensions joins resulted in > significant performance improvement. This prompted us to develop a simple > join reordering algorithm based on star schema detection. The performance > testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. > \\ > \\ > *Summary of the results:* > {code} > Passed 99 > Failed 0 > Total q time (s) 14,962 > Max time1,467 > Min time3 > Mean time 145 > Geomean44 > {code} > *Compared to baseline* (Negative = improvement; Positive = Degradation): > {code} > End to end improved (%) -19% > Mean time improved (%) -19% > Geomean improved (%) -24% > End to end improved (seconds) -3,603 > Number of queries improved (>10%) 45 > Number of queries degraded (>10%) 6 > Number of queries unchanged48 > Top 10 queries improved (%) -20% > {code} > Cluster: 20-node cluster with each node having: > * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 > v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. > * Total memory for the cluster: 2.5TB > * Total storage: 400TB > * Total CPU cores: 480 > Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA > Database info: > * Schema: TPCDS > * Scale factor: 1TB total space > * Storage format: Parquet with Snappy compression > Our investigation and results are included in the attached document. > There are two parts to this improvement: > # Join reordering using star schema detection > # New selectivity hint to specify the selectivity of the predicates over base > tables. Selectivity hint is optional and it was not used in the above TPC-DS > tests. > \\ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17626: Target Version/s: 2.2.0 > TPC-DS performance improvements using star-schema heuristics > > > Key: SPARK-17626 > URL: https://issues.apache.org/jira/browse/SPARK-17626 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ioana Delaney >Priority: Critical > Attachments: StarSchemaJoinReordering.pptx > > > *TPC-DS performance improvements using star-schema heuristics* > \\ > \\ > TPC-DS consists of multiple snowflake schema, which are multiple star schema > with dimensions linking to dimensions. A star schema consists of a fact table > referencing a number of dimension tables. Fact table holds the main data > about a business. Dimension table, a usually smaller table, describes data > reflecting the dimension/attribute of a business. > \\ > \\ > As part of the benchmark performance investigation, we observed a pattern of > sub-optimal execution plans of large fact tables joins. Manual rewrite of > some of the queries into selective fact-dimensions joins resulted in > significant performance improvement. This prompted us to develop a simple > join reordering algorithm based on star schema detection. The performance > testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. > \\ > \\ > *Summary of the results:* > {code} > Passed 99 > Failed 0 > Total q time (s) 14,962 > Max time1,467 > Min time3 > Mean time 145 > Geomean44 > {code} > *Compared to baseline* (Negative = improvement; Positive = Degradation): > {code} > End to end improved (%) -19% > Mean time improved (%) -19% > Geomean improved (%) -24% > End to end improved (seconds) -3,603 > Number of queries improved (>10%) 45 > Number of queries degraded (>10%) 6 > Number of queries unchanged48 > Top 10 queries improved (%) -20% > {code} > Cluster: 20-node cluster with each node having: > * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 > v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. > * Total memory for the cluster: 2.5TB > * Total storage: 400TB > * Total CPU cores: 480 > Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA > Database info: > * Schema: TPCDS > * Scale factor: 1TB total space > * Storage format: Parquet with Snappy compression > Our investigation and results are included in the attached document. > There are two parts to this improvement: > # Join reordering using star schema detection > # New selectivity hint to specify the selectivity of the predicates over base > tables. Selectivity hint is optional and it was not used in the above TPC-DS > tests. > \\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ioana Delaney updated SPARK-17626: -- Description: *TPC-DS performance improvements using star-schema heuristics* \\ \\ TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business. \\ \\ As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. \\ \\ *Summary of the results:* {code} Passed 99 Failed 0 Total q time (s) 14,962 Max time1,467 Min time3 Mean time 145 Geomean44 {code} *Compared to baseline* (Negative = improvement; Positive = Degradation): {code} End to end improved (%) -19% Mean time improved (%) -19% Geomean improved (%) -24% End to end improved (seconds) -3,603 Number of queries improved (>10%) 45 Number of queries degraded (>10%) 6 Number of queries unchanged48 Top 10 queries improved (%) -20% {code} Cluster: 20-node cluster with each node having: * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. * Total memory for the cluster: 2.5TB * Total storage: 400TB * Total CPU cores: 480 Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA Database info: * Schema: TPCDS * Scale factor: 1TB total space * Storage format: Parquet with Snappy compression Our investigation and results are included in the attached document. There are two parts to this improvement: # Join reordering using star schema detection # New selectivity hint to specify the selectivity of the predicates over base tables. Selectivity hint is optional and it was not used in the above TPC-DS tests. \\ was: *TPC-DS performance improvements using star-schema heuristics* \\ \\ TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business. \\ \\ As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. \\ \\ *Summary of the results:* {code} Passed 99 Failed 0 Total q time (s) 14,962 Max time1,467 Min time3 Mean time 145 Geomean44 {code} *Compared to baseline* (Negative = improvement; Positive = Degradation): {code} End to end improved (%) -19% Mean time improved (%) -19% Geomean improved (%) -24% End to end improved (seconds) -3,603 Number of queries improved (>10%) 45 Number of queries degraded (>10%) 6 Number of queries unchanged48 Top 10 queries improved (%) -20% {code} Cluster: 20-node cluster with each node having: * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. * Total memory for the cluster: 2.5TB * Total storage: 400TB * Total CPU cores: 480 Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA Database info: * Schema: TPCDS * Scale factor: 1TB total space * Storage format: Parquet with Snappy compression Our investigation and results are included in the attached document. There are two parts to this improvement: # Join reordering using star schema detection # New selectivity hint to specify the selectivity of the predicates over base tables. \\ \\ > TPC-DS performance improvements using star-schema heuristics > > > Key: SPARK-17626 > URL: https
[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ioana Delaney updated SPARK-17626: -- Attachment: StarSchemaJoinReordering.pptx > TPC-DS performance improvements using star-schema heuristics > > > Key: SPARK-17626 > URL: https://issues.apache.org/jira/browse/SPARK-17626 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ioana Delaney >Priority: Critical > Attachments: StarSchemaJoinReordering.pptx > > > *TPC-DS performance improvements using star-schema heuristics* > \\ > \\ > TPC-DS consists of multiple snowflake schema, which are multiple star schema > with dimensions linking to dimensions. A star schema consists of a fact table > referencing a number of dimension tables. Fact table holds the main data > about a business. Dimension table, a usually smaller table, describes data > reflecting the dimension/attribute of a business. > \\ > \\ > As part of the benchmark performance investigation, we observed a pattern of > sub-optimal execution plans of large fact tables joins. Manual rewrite of > some of the queries into selective fact-dimensions joins resulted in > significant performance improvement. This prompted us to develop a simple > join reordering algorithm based on star schema detection. The performance > testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. > \\ > \\ > *Summary of the results:* > {code} > Passed 99 > Failed 0 > Total q time (s) 14,962 > Max time1,467 > Min time3 > Mean time 145 > Geomean44 > {code} > *Compared to baseline* (Negative = improvement; Positive = Degradation): > {code} > End to end improved (%) -19% > Mean time improved (%) -19% > Geomean improved (%) -24% > End to end improved (seconds) -3,603 > Number of queries improved (>10%) 45 > Number of queries degraded (>10%) 6 > Number of queries unchanged48 > Top 10 queries improved (%) -20% > {code} > Cluster: 20-node cluster with each node having: > * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 > v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. > * Total memory for the cluster: 2.5TB > * Total storage: 400TB > * Total CPU cores: 480 > Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA > Database info: > * Schema: TPCDS > * Scale factor: 1TB total space > * Storage format: Parquet with Snappy compression > Our investigation and results are included in the attached document. > There are two parts to this improvement: > # Join reordering using star schema detection > # New selectivity hint to specify the selectivity of the predicates over base > tables. > \\ > \\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org