[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17626:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-10-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17626:

Target Version/s: 2.2.0

> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-09-21 Thread Ioana Delaney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-17626:
--
Description: 
*TPC-DS performance improvements using star-schema heuristics*
\\
\\
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
\\
\\
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
\\
\\
*Summary of the results:*
{code}
Passed 99
Failed  0
Total q time (s)   14,962
Max time1,467
Min time3
Mean time 145
Geomean44
{code}

*Compared to baseline* (Negative = improvement; Positive = Degradation):
{code}
End to end improved (%)  -19%   
Mean time improved (%)   -19%
Geomean improved (%) -24%
End to end improved (seconds)  -3,603
Number of queries improved (>10%)  45
Number of queries degraded (>10%)   6
Number of queries unchanged48
Top 10 queries improved (%)  -20%
{code}

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 
tables. Selectivity hint is optional and it was not used in the above TPC-DS 
tests. 
\\



  was:
*TPC-DS performance improvements using star-schema heuristics*
\\
\\
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
\\
\\
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
\\
\\
*Summary of the results:*
{code}
Passed 99
Failed  0
Total q time (s)   14,962
Max time1,467
Min time3
Mean time 145
Geomean44
{code}

*Compared to baseline* (Negative = improvement; Positive = Degradation):
{code}
End to end improved (%)  -19%   
Mean time improved (%)   -19%
Geomean improved (%) -24%
End to end improved (seconds)  -3,603
Number of queries improved (>10%)  45
Number of queries degraded (>10%)   6
Number of queries unchanged48
Top 10 queries improved (%)  -20%
{code}

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 
tables.
\\
\\




> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https

[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-09-21 Thread Ioana Delaney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-17626:
--
Attachment: StarSchemaJoinReordering.pptx

> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables.
> \\
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org