[jira] [Commented] (SPARK-48483) Allow UnsafeExternalSorter to spill when other consumer requests memory

2024-06-06 Thread Ke Jia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853004#comment-17853004
 ] 

Ke Jia commented on SPARK-48483:


[~cloud_fan] [~ulyssesyou] Do you have any suggestion for this proposal? Thanks

> Allow UnsafeExternalSorter to spill when other consumer requests memory
> ---
>
> Key: SPARK-48483
> URL: https://issues.apache.org/jira/browse/SPARK-48483
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
> Environment: Ubuntu
>Reporter: Jin Chengcheng
>Priority: Major
> Fix For: 4.0.0
>
>
> The downstream Gluten(Native spark engine) meets an OOM exception.
>  
> {code:java}
> 24/04/27 11:42:59 ERROR [Executor task launch worker for task 403.0 in stage 
> 4.0 (TID 91404)] nmm.ManagedReservationListener: Error reserving memory from 
> target
> org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException:
>  Not enough spark off-heap execution memory. Acquired: 40.0 MiB, granted: 8.0 
> MiB. Try tweaking config option spark.memory.offHeap.size to get larger space 
> to run this application. 
> Current config settings: 
>   spark.gluten.memory.offHeap.size.in.bytes=50.0 GiB
>   spark.gluten.memory.task.offHeap.size.in.bytes=12.5 GiB
>   spark.gluten.memory.conservative.task.offHeap.size.in.bytes=6.3 GiB
> Memory consumer stats: 
>   Task.91404: 
>   Current used bytes:   12.5 GiB, peak bytes:N/A
>   +- 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@a7836d4: 
> Current used bytes:   12.4 GiB, peak bytes:N/A
>   \- Gluten.Tree.194: 
>   Current used bytes:   56.0 MiB, peak bytes:   11.7 GiB
>  \- root.194: 
>   Current used bytes:   56.0 MiB, peak bytes:   11.7 GiB
> +- WholeStageIterator.194:
>   Current used bytes:   32.0 MiB, peak bytes:9.0 GiB
> |  \- single: 
>   Current used bytes:   23.0 MiB, peak bytes:9.0 GiB
> | +- task.Gluten_Stage_4_TID_91404:   
>   Current used bytes:   23.0 MiB, peak bytes:9.0 GiB
> | |  +- node.3:   
>   Current used bytes:   21.0 MiB, peak bytes:9.0 GiB
> | |  |  +- op.3.1.0.HashBuild:
>   Current used bytes:   10.8 MiB, peak bytes:8.5 GiB
> | |  |  \- op.3.0.0.HashProbe:
>   Current used bytes:9.2 MiB, peak bytes:   21.6 MiB
> | |  +- node.5:   
>   Current used bytes: 1024.0 KiB, peak bytes:2.0 MiB
> | |  |  \- op.5.0.0.FilterProject:
>   Current used bytes:  129.4 KiB, peak bytes: 1232.0 KiB
> | |  +- node.2:   
>   Current used bytes: 1024.0 KiB, peak bytes: 1024.0 KiB
> | |  |  \- op.2.1.0.FilterProject:
>   Current used bytes:  128.4 KiB, peak bytes:  192.4 KiB
> | |  +- node.1:   
>   Current used bytes:  0.0 B, peak bytes:  0.0 B
> | |  |  \- op.1.1.0.ValueStream:  
>   Current used bytes:  0.0 B, peak bytes:  0.0 B
> | |  +- node.0:   
>   Current used bytes:  0.0 B, peak bytes:  0.0 B
> | |  |  \- op.0.0.0.ValueStream:  
>   Current used bytes:  0.0 B, peak bytes:  0.0 B
> | |  \- node.4:   
>   Current used bytes:  0.0 B, peak bytes:  0.0 B
> | | \- op.4.0.0.FilterProject:
>   Current used bytes:  0.0 B, peak bytes:  0.0 B
> | \- WholeStageIterator_default_leaf: 
>   Current used bytes:  0.0 B, peak bytes:  0.0 B
> +- ArrowContextInstance.0:
>   Current used bytes:8.0 MiB, peak bytes:8.0 MiB
> +- ColumnarToRow.2:   
>   Current used bytes:8.0 MiB, peak bytes:   16.0 MiB
> |  \- single: 

[jira] [Updated] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-22 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-47773:
---
Description: 
SPIP doc: 
[https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]

 

Refined SPIP doc:

[https://docs.google.com/document/d/1oY26KtqXoJJNHbAhtmVgaXSVt6NlO6t1iWYEuGvCc1s/edit#heading=h.ud7930xhlsm6]

 

This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase.

The integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 

  was:
SPIP doc: 
https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing

This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase.

The integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 


> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ke Jia
>Priority: Major
>
> SPIP doc: 
> [https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  
> Refined SPIP doc:
> [https://docs.google.com/document/d/1oY26KtqXoJJNHbAhtmVgaXSVt6NlO6t1iWYEuGvCc1s/edit#heading=h.ud7930xhlsm6]
>  
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical 

[jira] [Comment Edited] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-22 Thread Ke Jia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839497#comment-17839497
 ] 

Ke Jia edited comment on SPARK-47773 at 4/22/24 6:22 AM:
-

We have refined the above SPIP in accordance with the specifications from the 
Spark community. The latest version of the SPIP is now available 
[here|https://docs.google.com/document/d/1oY26KtqXoJJNHbAhtmVgaXSVt6NlO6t1iWYEuGvCc1s/edit?usp=sharing].
  Welcome and value your suggestions and comments.


was (Author: jk_self):
We have refined the above [SPIP  
|[https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]]in
 accordance with the specifications from the Spark community. The latest 
version of the SPIP is now available 
[here|https://docs.google.com/document/d/1oY26KtqXoJJNHbAhtmVgaXSVt6NlO6t1iWYEuGvCc1s/edit?usp=sharing].
  Welcome and value your suggestions and comments.

> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ke Jia
>Priority: Major
>
> SPIP doc: 
> https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency.
> The design proposal advocates for the incorporation of the TransformSupport 
> interface and its specialized variants—LeafTransformSupport, 
> UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
> streamlining the conversion of different operator types into a 
> Substrait-based common format. The validation phase entails a thorough 
> assessment of the Substrait plan against native backends to ensure 
> compatibility. In instances where validation does not succeed, Spark's native 
> operators will be deployed, with requisite transformations to adapt data 
> formats accordingly. The proposal emphasizes the centrality of the plan 
> transformation phase, positing it as the foundational step. The subsequent 
> validation and fallback procedures are slated for consideration upon the 
> successful establishment of the initial phase.
> The integration of Gluten into Spark has already shown significant 
> performance improvements with ClickHouse and Velox backends and has been 
> successfully deployed in production by several customers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-22 Thread Ke Jia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839497#comment-17839497
 ] 

Ke Jia edited comment on SPARK-47773 at 4/22/24 6:22 AM:
-

We have refined the above [SPIP  
|[https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]]in
 accordance with the specifications from the Spark community. The latest 
version of the SPIP is now available 
[here|https://docs.google.com/document/d/1oY26KtqXoJJNHbAhtmVgaXSVt6NlO6t1iWYEuGvCc1s/edit?usp=sharing].
  Welcome and value your suggestions and comments.


was (Author: jk_self):
We have refined the above 
[SPIP|[https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]]
  in accordance with the specifications from the Spark community. The latest 
version of the SPIP is now available 
[here|https://docs.google.com/document/d/1oY26KtqXoJJNHbAhtmVgaXSVt6NlO6t1iWYEuGvCc1s/edit?usp=sharing].
  Welcome and value your suggestions and comments.

> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ke Jia
>Priority: Major
>
> SPIP doc: 
> https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency.
> The design proposal advocates for the incorporation of the TransformSupport 
> interface and its specialized variants—LeafTransformSupport, 
> UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
> streamlining the conversion of different operator types into a 
> Substrait-based common format. The validation phase entails a thorough 
> assessment of the Substrait plan against native backends to ensure 
> compatibility. In instances where validation does not succeed, Spark's native 
> operators will be deployed, with requisite transformations to adapt data 
> formats accordingly. The proposal emphasizes the centrality of the plan 
> transformation phase, positing it as the foundational step. The subsequent 
> validation and fallback procedures are slated for consideration upon the 
> successful establishment of the initial phase.
> The integration of Gluten into Spark has already shown significant 
> performance improvements with ClickHouse and Velox backends and has been 
> successfully deployed in production by several customers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-22 Thread Ke Jia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839497#comment-17839497
 ] 

Ke Jia commented on SPARK-47773:


We have refined the above 
[SPIP|[https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]]
  in accordance with the specifications from the Spark community. The latest 
version of the SPIP is now available 
[here|https://docs.google.com/document/d/1oY26KtqXoJJNHbAhtmVgaXSVt6NlO6t1iWYEuGvCc1s/edit?usp=sharing].
  Welcome and value your suggestions and comments.

> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ke Jia
>Priority: Major
>
> SPIP doc: 
> https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency.
> The design proposal advocates for the incorporation of the TransformSupport 
> interface and its specialized variants—LeafTransformSupport, 
> UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
> streamlining the conversion of different operator types into a 
> Substrait-based common format. The validation phase entails a thorough 
> assessment of the Substrait plan against native backends to ensure 
> compatibility. In instances where validation does not succeed, Spark's native 
> operators will be deployed, with requisite transformations to adapt data 
> formats accordingly. The proposal emphasizes the centrality of the plan 
> transformation phase, positing it as the foundational step. The subsequent 
> validation and fallback procedures are slated for consideration upon the 
> successful establishment of the initial phase.
> The integration of Gluten into Spark has already shown significant 
> performance improvements with ClickHouse and Velox backends and has been 
> successfully deployed in production by several customers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Ke Jia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835171#comment-17835171
 ] 

Ke Jia commented on SPARK-47773:


[~viirya] Great, I have added comment permissions in the SPIP Doc. I'm very 
much looking forward to your feedback. Thanks.

> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ke Jia
>Priority: Major
>
> SPIP doc: 
> https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency.
> The design proposal advocates for the incorporation of the TransformSupport 
> interface and its specialized variants—LeafTransformSupport, 
> UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
> streamlining the conversion of different operator types into a 
> Substrait-based common format. The validation phase entails a thorough 
> assessment of the Substrait plan against native backends to ensure 
> compatibility. In instances where validation does not succeed, Spark's native 
> operators will be deployed, with requisite transformations to adapt data 
> formats accordingly. The proposal emphasizes the centrality of the plan 
> transformation phase, positing it as the foundational step. The subsequent 
> validation and fallback procedures are slated for consideration upon the 
> successful establishment of the initial phase.
> The integration of Gluten into Spark has already shown significant 
> performance improvements with ClickHouse and Velox backends and has been 
> successfully deployed in production by several customers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-08 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-47773:
---
Description: 
SPIP doc: 
https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing

This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase.

The integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 

  was:
This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase.

The integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 


> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ke Jia
>Priority: Major
>
> SPIP doc: 
> https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency.
> The design proposal advocates for the 

[jira] [Updated] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-08 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-47773:
---
Description: 
This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase.

The integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 

  was:
This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase.  The 
integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 


> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ke Jia
>Priority: Major
>
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency.
> The design proposal advocates for the incorporation of the TransformSupport 
> interface and its specialized variants—LeafTransformSupport, 
> UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
> streamlining the conversion of different 

[jira] [Updated] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-08 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-47773:
---
Description: 
This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase.  The 
integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 

  was:
This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase. 

The integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 


> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ke Jia
>Priority: Major
>
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency.
> The design proposal advocates for the incorporation of the TransformSupport 
> interface and its specialized variants—LeafTransformSupport, 
> UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
> streamlining the conversion of different 

[jira] [Updated] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-08 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-47773:
---
Description: 
This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency.

The design proposal advocates for the incorporation of the TransformSupport 
interface and its specialized variants—LeafTransformSupport, 
UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
streamlining the conversion of different operator types into a Substrait-based 
common format. The validation phase entails a thorough assessment of the 
Substrait plan against native backends to ensure compatibility. In instances 
where validation does not succeed, Spark's native operators will be deployed, 
with requisite transformations to adapt data formats accordingly. The proposal 
emphasizes the centrality of the plan transformation phase, positing it as the 
foundational step. The subsequent validation and fallback procedures are slated 
for consideration upon the successful establishment of the initial phase. 

The integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 

  was:This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency. The design 
proposal advocates for the incorporation of the TransformSupport interface and 
its specialized variants—LeafTransformSupport, UnaryTransformSupport, and 
BinaryTransformSupport. These are instrumental in streamlining the conversion 
of different operator types into a Substrait-based common format. The 
validation phase entails a thorough assessment of the Substrait plan against 
native backends to ensure compatibility. In instances where validation does not 
succeed, Spark's native operators will be deployed, with requisite 
transformations to adapt data formats accordingly. The proposal emphasizes the 
centrality of the plan transformation phase, positing it as the foundational 
step. The subsequent validation and fallback procedures are slated for 
consideration upon the successful establishment of the initial phase.  The 
integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 


> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ke Jia
>Priority: Major
>
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency.
> The design proposal advocates for the incorporation of the TransformSupport 
> interface and its specialized variants—LeafTransformSupport, 
> UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
> streamlining the conversion of different 

[jira] [Updated] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-08 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-47773:
---
Description: This 
[SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
 outlines the integration of Gluten's physical plan conversion, validation, and 
fallback framework into Apache Spark. The goal is to enhance Spark's 
flexibility and robustness in executing physical plans and to leverage Gluten's 
performance optimizations. Currently, Spark lacks an official cross-platform 
execution support for physical plans. Gluten's mechanism, which employs the 
Substrait standard, can convert and optimize Spark's physical plans, thus 
improving portability, interoperability, and execution efficiency. The design 
proposal advocates for the incorporation of the TransformSupport interface and 
its specialized variants—LeafTransformSupport, UnaryTransformSupport, and 
BinaryTransformSupport. These are instrumental in streamlining the conversion 
of different operator types into a Substrait-based common format. The 
validation phase entails a thorough assessment of the Substrait plan against 
native backends to ensure compatibility. In instances where validation does not 
succeed, Spark's native operators will be deployed, with requisite 
transformations to adapt data formats accordingly. The proposal emphasizes the 
centrality of the plan transformation phase, positing it as the foundational 
step. The subsequent validation and fallback procedures are slated for 
consideration upon the successful establishment of the initial phase.  The 
integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers.   (was: This SPIP outlines the 
integration of Gluten's physical plan conversion, validation, and fallback 
framework into Apache Spark. The goal is to enhance Spark's flexibility and 
robustness in executing physical plans and to leverage Gluten's performance 
optimizations. Currently, Spark lacks an official cross-platform execution 
support for physical plans. Gluten's mechanism, which employs the Substrait 
standard, can convert and optimize Spark's physical plans, thus improving 
portability, interoperability, and execution efficiency. The design proposal 
advocates for the incorporation of the TransformSupport interface and its 
specialized variants—LeafTransformSupport, UnaryTransformSupport, and 
BinaryTransformSupport. These are instrumental in streamlining the conversion 
of different operator types into a Substrait-based common format. The 
validation phase entails a thorough assessment of the Substrait plan against 
native backends to ensure compatibility. In instances where validation does not 
succeed, Spark's native operators will be deployed, with requisite 
transformations to adapt data formats accordingly. The proposal emphasizes the 
centrality of the plan transformation phase, positing it as the foundational 
step. The subsequent validation and fallback procedures are slated for 
consideration upon the successful establishment of the initial phase.  The 
integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. )

> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ke Jia
>Priority: Major
>
> This 
> [SPIP|https://docs.google.com/document/d/1v7sndtIHIBdzc4YvLPI8InXxhI7SnnAQ5HvmM2DGjVE/edit?usp=sharing]
>  outlines the integration of Gluten's physical plan conversion, validation, 
> and fallback framework into Apache Spark. The goal is to enhance Spark's 
> flexibility and robustness in executing physical plans and to leverage 
> Gluten's performance optimizations. Currently, Spark lacks an official 
> cross-platform execution support for physical plans. Gluten's mechanism, 
> which employs the Substrait standard, can convert and optimize Spark's 
> physical plans, thus improving portability, interoperability, and execution 
> efficiency. The design proposal advocates for the incorporation of the 
> TransformSupport interface and its specialized variants—LeafTransformSupport, 
> UnaryTransformSupport, and BinaryTransformSupport. These are instrumental in 
> streamlining the conversion of different operator types into a 
> Substrait-based common format. The validation phase entails a thorough 
> 

[jira] [Updated] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-08 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-47773:
---
Description: This SPIP outlines the integration of Gluten's physical plan 
conversion, validation, and fallback framework into Apache Spark. The goal is 
to enhance Spark's flexibility and robustness in executing physical plans and 
to leverage Gluten's performance optimizations. Currently, Spark lacks an 
official cross-platform execution support for physical plans. Gluten's 
mechanism, which employs the Substrait standard, can convert and optimize 
Spark's physical plans, thus improving portability, interoperability, and 
execution efficiency. The design proposal advocates for the incorporation of 
the TransformSupport interface and its specialized 
variants—LeafTransformSupport, UnaryTransformSupport, and 
BinaryTransformSupport. These are instrumental in streamlining the conversion 
of different operator types into a Substrait-based common format. The 
validation phase entails a thorough assessment of the Substrait plan against 
native backends to ensure compatibility. In instances where validation does not 
succeed, Spark's native operators will be deployed, with requisite 
transformations to adapt data formats accordingly. The proposal emphasizes the 
centrality of the plan transformation phase, positing it as the foundational 
step. The subsequent validation and fallback procedures are slated for 
consideration upon the successful establishment of the initial phase.  The 
integration of Gluten into Spark has already shown significant performance 
improvements with ClickHouse and Velox backends and has been successfully 
deployed in production by several customers. 

> Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on 
> Various Native Engines
> 
>
> Key: SPARK-47773
> URL: https://issues.apache.org/jira/browse/SPARK-47773
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Ke Jia
>Priority: Major
>
> This SPIP outlines the integration of Gluten's physical plan conversion, 
> validation, and fallback framework into Apache Spark. The goal is to enhance 
> Spark's flexibility and robustness in executing physical plans and to 
> leverage Gluten's performance optimizations. Currently, Spark lacks an 
> official cross-platform execution support for physical plans. Gluten's 
> mechanism, which employs the Substrait standard, can convert and optimize 
> Spark's physical plans, thus improving portability, interoperability, and 
> execution efficiency. The design proposal advocates for the incorporation of 
> the TransformSupport interface and its specialized 
> variants—LeafTransformSupport, UnaryTransformSupport, and 
> BinaryTransformSupport. These are instrumental in streamlining the conversion 
> of different operator types into a Substrait-based common format. The 
> validation phase entails a thorough assessment of the Substrait plan against 
> native backends to ensure compatibility. In instances where validation does 
> not succeed, Spark's native operators will be deployed, with requisite 
> transformations to adapt data formats accordingly. The proposal emphasizes 
> the centrality of the plan transformation phase, positing it as the 
> foundational step. The subsequent validation and fallback procedures are 
> slated for consideration upon the successful establishment of the initial 
> phase.  The integration of Gluten into Spark has already shown significant 
> performance improvements with ClickHouse and Velox backends and has been 
> successfully deployed in production by several customers. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47773) Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-08 Thread Ke Jia (Jira)
Ke Jia created SPARK-47773:
--

 Summary: Enhancing the Flexibility of Spark's Physical Plan to 
Enable Execution on Various Native Engines
 Key: SPARK-47773
 URL: https://issues.apache.org/jira/browse/SPARK-47773
 Project: Spark
  Issue Type: Epic
  Components: SQL
Affects Versions: 3.5.1
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43814) Spark cannot construct the DecimalType in CatalystTypeConverters.convertToCatalyst() API

2023-05-31 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43814:
---
Description: 
 

When constructing the DecimalType in 
CatalystTypeConverters.convertToCatalyst(), spark throw following exception:

 
{code:java}
Decimal scale (18) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
than precision (1).
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1671)
        at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:48)
        at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertToCatalyst(CatalystTypeConverters.scala:518)
        at 
org.apache.spark.sql.DataFrameFunctionsSuite.$anonfun$new$712(DataFrameFunctionsSuite.scala:3714){code}
 

 

This issue can be reproduced by the following case:

 
{code:java}
  val expression = Literal.default(DecimalType.SYSTEM_DEFAULT)
  val schema = StructType(
    StructField("a", IntegerType, nullable = true) :: Nil)
  val empData = Seq(Row(1))
  val df = spark.createDataFrame(spark.sparkContext.parallelize(empData), 
schema)
  val resultDF = df.select(Column(expression))
  val result = resultDF.collect().head.get(0)
  CatalystTypeConverters.convertToCatalyst(result)

{code}
 

It seems that the reason for the failure is that the value of precision is not 
set when the Decimal.toJavaBigDecimal() method is called. However, Java 
BigDecimal only provides an interface for modifying scale and does not provide 
an interface for modifying Precision.

 

  was:
 

When using the df.collect() result to construct the DecimalType in 
CatalystTypeConverters.convertToCatalyst()

 
{code:java}
Decimal scale (18) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
than precision (1).
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1671)
        at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:48)
        at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertToCatalyst(CatalystTypeConverters.scala:518)
        at 
org.apache.spark.sql.DataFrameFunctionsSuite.$anonfun$new$712(DataFrameFunctionsSuite.scala:3714){code}
 

 

This issue can be reproduced by the following case:

 
{code:java}
  val expression = Literal.default(DecimalType.SYSTEM_DEFAULT)
  val schema = StructType(
    StructField("a", IntegerType, nullable = true) :: Nil)
  val empData = Seq(Row(1))
  val df = spark.createDataFrame(spark.sparkContext.parallelize(empData), 
schema)
  val resultDF = df.select(Column(expression))
  val result = resultDF.collect().head.get(0)
  CatalystTypeConverters.convertToCatalyst(result)

{code}
 

It seems that the reason for the failure is that the value of precision is not 
set when the Decimal.toJavaBigDecimal() method is called. However, Java 
BigDecimal only provides an interface for modifying scale and does not provide 
an interface for modifying Precision.

 


> Spark cannot construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst() API
> 
>
> Key: SPARK-43814
> URL: https://issues.apache.org/jira/browse/SPARK-43814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.3.2
>Reporter: Ke Jia
>Priority: Major
>
>  
> When constructing the DecimalType in 
> CatalystTypeConverters.convertToCatalyst(), spark throw following exception:
>  
> {code:java}
> Decimal scale (18) cannot be greater than precision (1).
> org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
> than precision (1).
>         at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1671)
>         at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:48)
>         at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertToCatalyst(CatalystTypeConverters.scala:518)
>         at 
> org.apache.spark.sql.DataFrameFunctionsSuite.$anonfun$new$712(DataFrameFunctionsSuite.scala:3714){code}
>  
>  
> This issue can be reproduced by the following case:
>  
> {code:java}
>   val expression = Literal.default(DecimalType.SYSTEM_DEFAULT)
>   val schema = StructType(
>     StructField("a", IntegerType, nullable = true) :: Nil)
>   val empData = Seq(Row(1))
>   val df = spark.createDataFrame(spark.sparkContext.parallelize(empData), 
> schema)
>   val resultDF = df.select(Column(expression))
>   val result = resultDF.collect().head.get(0)
>   CatalystTypeConverters.convertToCatalyst(result)
> {code}
>  
> It seems that the reason for the failure is 

[jira] [Updated] (SPARK-43814) Spark cannot construct the DecimalType in CatalystTypeConverters.convertToCatalyst() API

2023-05-31 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43814:
---
Summary: Spark cannot construct the DecimalType in 
CatalystTypeConverters.convertToCatalyst() API  (was: Spark cannot use the 
df.collect() result to construct the DecimalType in 
CatalystTypeConverters.convertToCatalyst() API)

> Spark cannot construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst() API
> 
>
> Key: SPARK-43814
> URL: https://issues.apache.org/jira/browse/SPARK-43814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.3.2
>Reporter: Ke Jia
>Priority: Major
>
>  
> When using the df.collect() result to construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst()
>  
> {code:java}
> Decimal scale (18) cannot be greater than precision (1).
> org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
> than precision (1).
>         at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1671)
>         at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:48)
>         at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertToCatalyst(CatalystTypeConverters.scala:518)
>         at 
> org.apache.spark.sql.DataFrameFunctionsSuite.$anonfun$new$712(DataFrameFunctionsSuite.scala:3714){code}
>  
>  
> This issue can be reproduced by the following case:
>  
> {code:java}
>   val expression = Literal.default(DecimalType.SYSTEM_DEFAULT)
>   val schema = StructType(
>     StructField("a", IntegerType, nullable = true) :: Nil)
>   val empData = Seq(Row(1))
>   val df = spark.createDataFrame(spark.sparkContext.parallelize(empData), 
> schema)
>   val resultDF = df.select(Column(expression))
>   val result = resultDF.collect().head.get(0)
>   CatalystTypeConverters.convertToCatalyst(result)
> {code}
>  
> It seems that the reason for the failure is that the value of precision is 
> not set when the Decimal.toJavaBigDecimal() method is called. However, Java 
> BigDecimal only provides an interface for modifying scale and does not 
> provide an interface for modifying Precision.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43814) Spark cannot use the df.collect() result to construct the DecimalType in CatalystTypeConverters.convertToCatalyst() API

2023-05-26 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43814:
---
Description: 
 

When using the df.collect() result to construct the DecimalType in 
CatalystTypeConverters.convertToCatalyst()

 
{code:java}
Decimal scale (18) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
than precision (1).
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1671)
        at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:48)
        at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertToCatalyst(CatalystTypeConverters.scala:518)
        at 
org.apache.spark.sql.DataFrameFunctionsSuite.$anonfun$new$712(DataFrameFunctionsSuite.scala:3714){code}
 

 

This issue can be reproduced by the following case:

 
{code:java}
  val expression = Literal.default(DecimalType.SYSTEM_DEFAULT)
  val schema = StructType(
    StructField("a", IntegerType, nullable = true) :: Nil)
  val empData = Seq(Row(1))
  val df = spark.createDataFrame(spark.sparkContext.parallelize(empData), 
schema)
  val resultDF = df.select(Column(expression))
  val result = resultDF.collect().head.get(0)
  CatalystTypeConverters.convertToCatalyst(result)

{code}
 

It seems that the reason for the failure is that the value of precision is not 
set when the Decimal.toJavaBigDecimal() method is called. However, Java 
BigDecimal only provides an interface for modifying scale and does not provide 
an interface for modifying Precision.

 

  was:
When using the df.collect() result to construct the DecimalType in 

Decimal scale (18) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
than precision (1).


> Spark cannot use the df.collect() result to construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst() API
> ---
>
> Key: SPARK-43814
> URL: https://issues.apache.org/jira/browse/SPARK-43814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.3.2
>Reporter: Ke Jia
>Priority: Major
>
>  
> When using the df.collect() result to construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst()
>  
> {code:java}
> Decimal scale (18) cannot be greater than precision (1).
> org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
> than precision (1).
>         at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.decimalCannotGreaterThanPrecisionError(QueryCompilationErrors.scala:1671)
>         at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:48)
>         at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertToCatalyst(CatalystTypeConverters.scala:518)
>         at 
> org.apache.spark.sql.DataFrameFunctionsSuite.$anonfun$new$712(DataFrameFunctionsSuite.scala:3714){code}
>  
>  
> This issue can be reproduced by the following case:
>  
> {code:java}
>   val expression = Literal.default(DecimalType.SYSTEM_DEFAULT)
>   val schema = StructType(
>     StructField("a", IntegerType, nullable = true) :: Nil)
>   val empData = Seq(Row(1))
>   val df = spark.createDataFrame(spark.sparkContext.parallelize(empData), 
> schema)
>   val resultDF = df.select(Column(expression))
>   val result = resultDF.collect().head.get(0)
>   CatalystTypeConverters.convertToCatalyst(result)
> {code}
>  
> It seems that the reason for the failure is that the value of precision is 
> not set when the Decimal.toJavaBigDecimal() method is called. However, Java 
> BigDecimal only provides an interface for modifying scale and does not 
> provide an interface for modifying Precision.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43814) Spark cannot use the df.collect() result to construct the DecimalType in CatalystTypeConverters.convertToCatalyst() API

2023-05-26 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43814:
---
Description: 
When using the df.collect() result to construct the DecimalType in 

Decimal scale (18) cannot be greater than precision (1).
org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
than precision (1).

> Spark cannot use the df.collect() result to construct the DecimalType in 
> CatalystTypeConverters.convertToCatalyst() API
> ---
>
> Key: SPARK-43814
> URL: https://issues.apache.org/jira/browse/SPARK-43814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.3.2
>Reporter: Ke Jia
>Priority: Major
>
> When using the df.collect() result to construct the DecimalType in 
> Decimal scale (18) cannot be greater than precision (1).
> org.apache.spark.sql.AnalysisException: Decimal scale (18) cannot be greater 
> than precision (1).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43814) Spark cannot use the df.collect() result to construct the DecimalType in CatalystTypeConverters.convertToCatalyst() API

2023-05-26 Thread Ke Jia (Jira)
Ke Jia created SPARK-43814:
--

 Summary: Spark cannot use the df.collect() result to construct the 
DecimalType in CatalystTypeConverters.convertToCatalyst() API
 Key: SPARK-43814
 URL: https://issues.apache.org/jira/browse/SPARK-43814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.2.2
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43240) df.describe() method may- return wrong result if the last RDD is RDD[UnsafeRow]

2023-04-23 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43240:
---
Affects Version/s: 3.3.2
   (was: 3.2.2)

> df.describe() method may- return wrong result if the last RDD is 
> RDD[UnsafeRow]
> ---
>
> Key: SPARK-43240
> URL: https://issues.apache.org/jira/browse/SPARK-43240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2
>Reporter: Ke Jia
>Priority: Major
>
> When calling the df.describe() method, the result  maybe wrong when the last 
> RDD is RDD[UnsafeRow]. It is because the UnsafeRow will be released after the 
> row is used. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43240) df.describe() method may- return wrong result if the last RDD is RDD[UnsafeRow]

2023-04-22 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-43240:
---
Description: When calling the df.describe() method, the result  maybe wrong 
when the last RDD is RDD[UnsafeRow]. It is because the UnsafeRow will be 
released after the row is used. 

> df.describe() method may- return wrong result if the last RDD is 
> RDD[UnsafeRow]
> ---
>
> Key: SPARK-43240
> URL: https://issues.apache.org/jira/browse/SPARK-43240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: Ke Jia
>Priority: Major
>
> When calling the df.describe() method, the result  maybe wrong when the last 
> RDD is RDD[UnsafeRow]. It is because the UnsafeRow will be released after the 
> row is used. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43240) df.describe() method may- return wrong result if the last RDD is RDD[UnsafeRow]

2023-04-22 Thread Ke Jia (Jira)
Ke Jia created SPARK-43240:
--

 Summary: df.describe() method may- return wrong result if the last 
RDD is RDD[UnsafeRow]
 Key: SPARK-43240
 URL: https://issues.apache.org/jira/browse/SPARK-43240
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.2
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36898) Make the shuffle hash join factor configurable

2021-09-29 Thread Ke Jia (Jira)
Ke Jia created SPARK-36898:
--

 Summary: Make the shuffle hash join factor configurable
 Key: SPARK-36898
 URL: https://issues.apache.org/jira/browse/SPARK-36898
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: Ke Jia


Make the shuffle hash join factor configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35710) Support DPP + AQE when no reused broadcast exchange

2021-06-10 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-35710:
---
Description: Support DPP + AQE when no reused broadcast exchange.

> Support DPP + AQE when no reused broadcast exchange
> ---
>
> Key: SPARK-35710
> URL: https://issues.apache.org/jira/browse/SPARK-35710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ke Jia
>Priority: Major
>
> Support DPP + AQE when no reused broadcast exchange.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35710) Support DPP + AQE when no reused broadcast exchange

2021-06-10 Thread Ke Jia (Jira)
Ke Jia created SPARK-35710:
--

 Summary: Support DPP + AQE when no reused broadcast exchange
 Key: SPARK-35710
 URL: https://issues.apache.org/jira/browse/SPARK-35710
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34637) Support DPP in AQE when the boradcast exchange can reused

2021-03-04 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-34637:
---
Description: We have supported DPP in AQE when the join is Broadcast hash 
join before applying the AQE rules in SPARK-34168, which has some limitations. 
It only apply DPP when the small table side executed firstly and then the big 
table side can reuse the broadcast exchange in small table side. This Jira is 
to address the above limitations and can apply the DPP when the broadcast 
exchange can be reused.

> Support DPP in AQE when the boradcast exchange can reused
> -
>
> Key: SPARK-34637
> URL: https://issues.apache.org/jira/browse/SPARK-34637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ke Jia
>Priority: Major
>
> We have supported DPP in AQE when the join is Broadcast hash join before 
> applying the AQE rules in SPARK-34168, which has some limitations. It only 
> apply DPP when the small table side executed firstly and then the big table 
> side can reuse the broadcast exchange in small table side. This Jira is to 
> address the above limitations and can apply the DPP when the broadcast 
> exchange can be reused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34637) Support DPP in AQE when the boradcast exchange can be reused

2021-03-04 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-34637:
---
Summary: Support DPP in AQE when the boradcast exchange can be reused  
(was: Support DPP in AQE when the boradcast exchange can reused)

> Support DPP in AQE when the boradcast exchange can be reused
> 
>
> Key: SPARK-34637
> URL: https://issues.apache.org/jira/browse/SPARK-34637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ke Jia
>Priority: Major
>
> We have supported DPP in AQE when the join is Broadcast hash join before 
> applying the AQE rules in SPARK-34168, which has some limitations. It only 
> apply DPP when the small table side executed firstly and then the big table 
> side can reuse the broadcast exchange in small table side. This Jira is to 
> address the above limitations and can apply the DPP when the broadcast 
> exchange can be reused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34637) Support DPP in AQE when the boradcast exchange can reused

2021-03-04 Thread Ke Jia (Jira)
Ke Jia created SPARK-34637:
--

 Summary: Support DPP in AQE when the boradcast exchange can reused
 Key: SPARK-34637
 URL: https://issues.apache.org/jira/browse/SPARK-34637
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-01-19 Thread Ke Jia (Jira)
Ke Jia created SPARK-34168:
--

 Summary: Support DPP in AQE When the join is Broadcast hash join 
before applying the AQE rules
 Key: SPARK-34168
 URL: https://issues.apache.org/jira/browse/SPARK-34168
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1, 3.0.0
Reporter: Ke Jia


Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31524) Add metric to the split number for skew partition when enable AQE

2020-04-22 Thread Ke Jia (Jira)
Ke Jia created SPARK-31524:
--

 Summary: Add metric to the split  number for skew partition when 
enable AQE
 Key: SPARK-31524
 URL: https://issues.apache.org/jira/browse/SPARK-31524
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Add the details metrics for the split number in skewed partitions when enable 
AQE and skew join optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30922) Remove the max split config after changing the multi sub joins to multi sub partitions

2020-02-21 Thread Ke Jia (Jira)
Ke Jia created SPARK-30922:
--

 Summary: Remove the max split config after changing the multi sub 
joins to multi sub partitions
 Key: SPARK-30922
 URL: https://issues.apache.org/jira/browse/SPARK-30922
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


After merged PR#27493, we not need the 
"spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits" config to 
resolve the ui issue when split more sub joins. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30864) Add the user guide for Adaptive Query Execution

2020-02-17 Thread Ke Jia (Jira)
Ke Jia created SPARK-30864:
--

 Summary: Add the user guide for Adaptive Query Execution
 Key: SPARK-30864
 URL: https://issues.apache.org/jira/browse/SPARK-30864
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


This Jira will add the detail user guide to describe how to enable AQE and the 
three mainly features.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30188) Fix tests when enable Adaptive Query Execution

2020-02-04 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30188:
---
Description: Fix the failed unit tests when enable Adaptive Query 
Execution.  (was: Enable Adaptive Query Execution default)

> Fix tests when enable Adaptive Query Execution
> --
>
> Key: SPARK-30188
> URL: https://issues.apache.org/jira/browse/SPARK-30188
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> Fix the failed unit tests when enable Adaptive Query Execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30549) Fix the subquery metrics showing issue in UI When enable AQE

2020-01-17 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30549:
---
Summary: Fix the subquery metrics showing issue in UI When enable AQE  
(was: Fix the subquery metrics showing issue in UI)

> Fix the subquery metrics showing issue in UI When enable AQE
> 
>
> Key: SPARK-30549
> URL: https://issues.apache.org/jira/browse/SPARK-30549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> After merged [PR#25316|[https://github.com/apache/spark/pull/25316]], the 
> subquery metrics can not be shown in UI. This PR will fix the subquery shown 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30549) Fix the subquery metrics showing issue in UI When enable AQE

2020-01-17 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30549:
---
Description: After merged [https://github.com/apache/spark/pull/25316], the 
subquery metrics can not be shown in UI when enable AQE. This PR will fix the 
subquery shown issue.  (was: After merged 
[PR#25316|[https://github.com/apache/spark/pull/25316]], the subquery metrics 
can not be shown in UI. This PR will fix the subquery shown issue.)

> Fix the subquery metrics showing issue in UI When enable AQE
> 
>
> Key: SPARK-30549
> URL: https://issues.apache.org/jira/browse/SPARK-30549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> After merged [https://github.com/apache/spark/pull/25316], the subquery 
> metrics can not be shown in UI when enable AQE. This PR will fix the subquery 
> shown issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30549) Fix the subquery metrics showing issue in UI

2020-01-17 Thread Ke Jia (Jira)
Ke Jia created SPARK-30549:
--

 Summary: Fix the subquery metrics showing issue in UI
 Key: SPARK-30549
 URL: https://issues.apache.org/jira/browse/SPARK-30549
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


After merged [PR#25316|[https://github.com/apache/spark/pull/25316]], the 
subquery metrics can not be shown in UI. This PR will fix the subquery shown 
issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.

2020-01-15 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30524:
---
Description: The OptimizeSkewedJoin will break the outputPartitioning of 
origin SMJ. And it may introduce additional shuffle after apply the 
OptimizeSkewedJoin. This PR will disable "OptimizeSkewedJoin" rule if 
introducing additional shuffle.

> Disable OptimizeSkewJoin rule if introducing additional shuffle.
> 
>
> Key: SPARK-30524
> URL: https://issues.apache.org/jira/browse/SPARK-30524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> The OptimizeSkewedJoin will break the outputPartitioning of origin SMJ. And 
> it may introduce additional shuffle after apply the OptimizeSkewedJoin. This 
> PR will disable "OptimizeSkewedJoin" rule if introducing additional shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.

2020-01-15 Thread Ke Jia (Jira)
Ke Jia created SPARK-30524:
--

 Summary: Disable OptimizeSkewJoin rule if introducing additional 
shuffle.
 Key: SPARK-30524
 URL: https://issues.apache.org/jira/browse/SPARK-30524
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30407) reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe

2020-01-02 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30407:
---
Description: 
Working on [https://github.com/apache/spark/pull/26813]. With on AQE, the 
metric info of AdaptiveSparkPlanExec does not reset when running the test 
DataFrameCallbackSuite#get numRows metrics by callback.

 

  was:
Working on [PR#26813|[https://github.com/apache/spark/pull/26813]]. With on 
AQE, the metric info of AdaptiveSparkPlanExec does not reset when running the 
test DataFrameCallbackSuite#get numRows metrics by callback.

 


> reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe
> 
>
> Key: SPARK-30407
> URL: https://issues.apache.org/jira/browse/SPARK-30407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> Working on [https://github.com/apache/spark/pull/26813]. With on AQE, the 
> metric info of AdaptiveSparkPlanExec does not reset when running the test 
> DataFrameCallbackSuite#get numRows metrics by callback.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30407) reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe

2020-01-02 Thread Ke Jia (Jira)
Ke Jia created SPARK-30407:
--

 Summary: reset the metrics info of AdaptiveSparkPlanExec plan when 
enable aqe
 Key: SPARK-30407
 URL: https://issues.apache.org/jira/browse/SPARK-30407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Working on [PR#26813|[https://github.com/apache/spark/pull/26813]]. With on 
AQE, the metric info of AdaptiveSparkPlanExec does not reset when running the 
test DataFrameCallbackSuite#get numRows metrics by callback.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30403) Fix the NoSuchElementException exception when enable AQE with InSubquery use case

2020-01-01 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30403:
---
Description: After merged [https://github.com/apache/spark/pull/25854], we 
also need to handle the Insubquery case when build SubqueryMap in 
InsertAdaptiveSparkPlan rule. Otherwise we will  get the NoSuchElementException 
exception.  (was: After merged [link 
title|[https://github.com/apache/spark/pull/25854]], we also need to handle the 
Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan rule. 
Otherwise we will  get the NoSuchElementException exception.)

> Fix the NoSuchElementException exception when enable AQE with InSubquery use 
> case
> -
>
> Key: SPARK-30403
> URL: https://issues.apache.org/jira/browse/SPARK-30403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> After merged [https://github.com/apache/spark/pull/25854], we also need to 
> handle the Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan 
> rule. Otherwise we will  get the NoSuchElementException exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30403) Fix the NoSuchElementException exception when enable AQE with InSubquery use case

2020-01-01 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30403:
---
Description: After merged [link 
title|[https://github.com/apache/spark/pull/25854]], we also need to handle the 
Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan rule. 
Otherwise we will  get the NoSuchElementException exception.  (was: After 
merged [PR25854|[https://github.com/apache/spark/pull/25854]], we also need to 
handle the Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan 
rule. Otherwise we will  get the NoSuchElementException exception.)

> Fix the NoSuchElementException exception when enable AQE with InSubquery use 
> case
> -
>
> Key: SPARK-30403
> URL: https://issues.apache.org/jira/browse/SPARK-30403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> After merged [link title|[https://github.com/apache/spark/pull/25854]], we 
> also need to handle the Insubquery case when build SubqueryMap in 
> InsertAdaptiveSparkPlan rule. Otherwise we will  get the 
> NoSuchElementException exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30403) Fix the NoSuchElementException exception when enable AQE with InSubquery use case

2020-01-01 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30403:
---
Description: After merged 
[PR25854|[https://github.com/apache/spark/pull/25854]], we also need to handle 
the Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan rule. 
Otherwise we will  get the NoSuchElementException exception.  (was: After 
merged [PR#|[https://github.com/apache/spark/pull/25854]], we also need to 
handle the Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan 
rule. Otherwise we will  get the NoSuchElementException exception.)

> Fix the NoSuchElementException exception when enable AQE with InSubquery use 
> case
> -
>
> Key: SPARK-30403
> URL: https://issues.apache.org/jira/browse/SPARK-30403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> After merged [PR25854|[https://github.com/apache/spark/pull/25854]], we also 
> need to handle the Insubquery case when build SubqueryMap in 
> InsertAdaptiveSparkPlan rule. Otherwise we will  get the 
> NoSuchElementException exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30403) Fix the NoSuchElementException exception when enable AQE with InSubquery use case

2020-01-01 Thread Ke Jia (Jira)
Ke Jia created SPARK-30403:
--

 Summary: Fix the NoSuchElementException exception when enable AQE 
with InSubquery use case
 Key: SPARK-30403
 URL: https://issues.apache.org/jira/browse/SPARK-30403
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


After merged [PR#|[https://github.com/apache/spark/pull/25854]], we also need 
to handle the Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan 
rule. Otherwise we will  get the NoSuchElementException exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30291) Catch the exception when do materialize in AQE

2019-12-17 Thread Ke Jia (Jira)
Ke Jia created SPARK-30291:
--

 Summary: Catch the exception when do materialize in AQE
 Key: SPARK-30291
 URL: https://issues.apache.org/jira/browse/SPARK-30291
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


We need catch the exception when doing materialize in the QueryStage of AQE. 
Then user can get more information about the exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30232) Fix the the ArthmeticException by zero when enable AQE

2019-12-12 Thread Ke Jia (Jira)
Ke Jia created SPARK-30232:
--

 Summary: Fix the the ArthmeticException by zero when enable AQE
 Key: SPARK-30232
 URL: https://issues.apache.org/jira/browse/SPARK-30232
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Add a check for the divisor to avoid the ArthmeticException by zero.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30213) Remove the mutable status in QueryStage when enable AQE

2019-12-10 Thread Ke Jia (Jira)
Ke Jia created SPARK-30213:
--

 Summary: Remove the mutable status in QueryStage when enable AQE
 Key: SPARK-30213
 URL: https://issues.apache.org/jira/browse/SPARK-30213
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Currently ShuffleQueryStageExec contain the mutable status, eg 
mapOutputStatisticsFuture variable. So It is not easy to pass when we copy 
ShuffleQueryStageExec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30188) Enable Adaptive Query Execution default

2019-12-09 Thread Ke Jia (Jira)
Ke Jia created SPARK-30188:
--

 Summary: Enable Adaptive Query Execution default
 Key: SPARK-30188
 URL: https://issues.apache.org/jira/browse/SPARK-30188
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Enable Adaptive Query Execution default



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29954) collect the runtime statistics of row count in map stage

2019-12-01 Thread Ke Jia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985853#comment-16985853
 ] 

Ke Jia commented on SPARK-29954:


[~hyukjin.kwon] Add the Jira description. Thanks.

> collect the runtime statistics of row count in map stage
> 
>
> Key: SPARK-29954
> URL: https://issues.apache.org/jira/browse/SPARK-29954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> We need the row count info to more accurately estimate the data skew 
> situation when too many duplicated data. This PR will collect the row count 
> info in map stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29954) collect the runtime statistics of row count in map stage

2019-12-01 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-29954:
---
Description: We need the row count info to more accurately estimate the 
data skew situation when too many duplicated data. This PR will collect the row 
count info in map stage.

> collect the runtime statistics of row count in map stage
> 
>
> Key: SPARK-29954
> URL: https://issues.apache.org/jira/browse/SPARK-29954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> We need the row count info to more accurately estimate the data skew 
> situation when too many duplicated data. This PR will collect the row count 
> info in map stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29954) collect the runtime statistics of row count in map stage

2019-11-18 Thread Ke Jia (Jira)
Ke Jia created SPARK-29954:
--

 Summary: collect the runtime statistics of row count in map stage
 Key: SPARK-29954
 URL: https://issues.apache.org/jira/browse/SPARK-29954
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29893) Improve the local reader performance by changing the task number from 1 to multi

2019-11-14 Thread Ke Jia (Jira)
Ke Jia created SPARK-29893:
--

 Summary: Improve the local reader performance by changing the task 
number from 1 to multi
 Key: SPARK-29893
 URL: https://issues.apache.org/jira/browse/SPARK-29893
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


The currently local reader read all the partition of map stage only using 1 
task, which may cause the performance degradation. This PR will improve the 
performance by using multi tasks instead of one task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-11 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-29792:
---
Description: After merged 
[SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], the subqueries 
info can not be updated in AQE. And this Jira will fix it.

> SQL metrics cannot be updated to subqueries in AQE
> --
>
> Key: SPARK-29792
> URL: https://issues.apache.org/jira/browse/SPARK-29792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Ke Jia
>Priority: Major
>
> After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], 
> the subqueries info can not be updated in AQE. And this Jira will fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29552) Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins

2019-10-22 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-29552:
---
Description: AQE will optimize the logical plan once there is query stage 
finished. So for inner join, when two join side is all small to be the build 
side. The planner of converting logical plan to physical plan will select the 
build side as BuildRight if right side finished firstly or BuildLeft if left 
side finished firstly. In some case, when BuildRight or BuildLeft may introduce 
additional exchange to the parent node. The revert approach in 
OptimizeLocalShuffleReader rule may be too conservative, which revert all the 
local shuffle reader when introduce additional exchange not  revert the local 
shuffle reader introduced shuffle.  It may be expense to only revert the local 
shuffle reader introduced shuffle. The workaround is to apply the 
OptimizeLocalShuffleReader rule again when creating new query stage to further 
optimize the sub tree shuffle reader to local shuffle reader.  (was: AQE will 
optimize the logical plan once there is query stage finished. So for inner 
join, when two join side is all small to be the build side. The planner of 
converting logical plan to physical plan will select the build side as 
BuildRight if right side finished firstly or BuildLeft if left side finished 
firstly. In some case, when BuildRight or BuildLeft may introduce additioanl 
exchange to the parent node. The revert approach in OptimizeLocalShuffleReader 
rule may be too conservative, which revert all the local shuffle reader when 
introduce additional exchange not  revert the local shuffle reader introduced 
shuffle.  It may be expense to only revert the local shuffle reader introduced 
shuffle. The workaround is to apply the OptimizeLocalShuffleReader rule again 
when creating new query stage to further optimize the sub tree shuffle reader 
to local shuffle reader.)

> Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins
> 
>
> Key: SPARK-29552
> URL: https://issues.apache.org/jira/browse/SPARK-29552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> AQE will optimize the logical plan once there is query stage finished. So for 
> inner join, when two join side is all small to be the build side. The planner 
> of converting logical plan to physical plan will select the build side as 
> BuildRight if right side finished firstly or BuildLeft if left side finished 
> firstly. In some case, when BuildRight or BuildLeft may introduce additional 
> exchange to the parent node. The revert approach in 
> OptimizeLocalShuffleReader rule may be too conservative, which revert all the 
> local shuffle reader when introduce additional exchange not  revert the local 
> shuffle reader introduced shuffle.  It may be expense to only revert the 
> local shuffle reader introduced shuffle. The workaround is to apply the 
> OptimizeLocalShuffleReader rule again when creating new query stage to 
> further optimize the sub tree shuffle reader to local shuffle reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29552) Fix the flaky test failed in AdaptiveQueryExecSuite # multiple joins

2019-10-22 Thread Ke Jia (Jira)
Ke Jia created SPARK-29552:
--

 Summary: Fix the flaky test failed in AdaptiveQueryExecSuite # 
multiple joins
 Key: SPARK-29552
 URL: https://issues.apache.org/jira/browse/SPARK-29552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


AQE will optimize the logical plan once there is query stage finished. So for 
inner join, when two join side is all small to be the build side. The planner 
of converting logical plan to physical plan will select the build side as 
BuildRight if right side finished firstly or BuildLeft if left side finished 
firstly. In some case, when BuildRight or BuildLeft may introduce additioanl 
exchange to the parent node. The revert approach in OptimizeLocalShuffleReader 
rule may be too conservative, which revert all the local shuffle reader when 
introduce additional exchange not  revert the local shuffle reader introduced 
shuffle.  It may be expense to only revert the local shuffle reader introduced 
shuffle. The workaround is to apply the OptimizeLocalShuffleReader rule again 
when creating new query stage to further optimize the sub tree shuffle reader 
to local shuffle reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution

2019-10-22 Thread Ke Jia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956744#comment-16956744
 ] 

Ke Jia commented on SPARK-29544:


[~wenchen] The attachment  is the design doc of skew join optimization. We can 
have a discussion based on this doc. Thanks.

> Optimize skewed join at runtime with new Adaptive Execution
> ---
>
> Key: SPARK-29544
> URL: https://issues.apache.org/jira/browse/SPARK-29544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Skewed Join Optimization Design Doc.docx
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> [SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is 
> used to handle the skew join optimization based on the runtime statistics 
> (data size and row count).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution

2019-10-22 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-29544:
---
Attachment: Skewed Join Optimization Design Doc.docx

> Optimize skewed join at runtime with new Adaptive Execution
> ---
>
> Key: SPARK-29544
> URL: https://issues.apache.org/jira/browse/SPARK-29544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Skewed Join Optimization Design Doc.docx
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> [SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is 
> used to handle the skew join optimization based on the runtime statistics 
> (data size and row count).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution

2019-10-22 Thread Ke Jia (Jira)
Ke Jia created SPARK-29544:
--

 Summary: Optimize skewed join at runtime with new Adaptive 
Execution
 Key: SPARK-29544
 URL: https://issues.apache.org/jira/browse/SPARK-29544
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Implement a rule in the new adaptive execution framework introduced in 
[SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is 
used to handle the skew join optimization based on the runtime statistics (data 
size and row count).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28576) Fix the dead lock issue when enable new adaptive execution

2019-07-30 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-28576:
---
Description: After enable AE(SPARK-23128), we found the dead lock issue in 
Q6 1TB TPC-DS. The root cause is that the subquery thread and the main thread 
is waiting for the same object.  (was: After enable 
AE([lSPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]), we found 
the dead lock issue in Q6 1TB TPC-DS. The root cause is that the subquery 
thread and the main thread is waiting for the same object.)

> Fix the dead lock issue  when enable new adaptive execution
> ---
>
> Key: SPARK-28576
> URL: https://issues.apache.org/jira/browse/SPARK-28576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ke Jia
>Priority: Major
> Attachments: jstack.log, physical plan.txt, q6.sql
>
>
> After enable AE(SPARK-23128), we found the dead lock issue in Q6 1TB TPC-DS. 
> The root cause is that the subquery thread and the main thread is waiting for 
> the same object.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28576) Fix the dead lock issue when enable new adaptive execution

2019-07-30 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-28576:
---
Description: After enable 
AE([lSPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]), we found 
the dead lock issue in Q6 1TB TPC-DS. The root cause is that the subquery 
thread and the main thread is waiting for the same object.  (was: After enable 
AE, we found the dead lock issue in Q6 1TB TPC-DS. The root cause is that the 
subquery thread and the main thread is waiting for the same object. )

> Fix the dead lock issue  when enable new adaptive execution
> ---
>
> Key: SPARK-28576
> URL: https://issues.apache.org/jira/browse/SPARK-28576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ke Jia
>Priority: Major
> Attachments: jstack.log, physical plan.txt, q6.sql
>
>
> After enable 
> AE([lSPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]), we 
> found the dead lock issue in Q6 1TB TPC-DS. The root cause is that the 
> subquery thread and the main thread is waiting for the same object.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28576) Fix the dead lock issue when enable new adaptive execution

2019-07-30 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-28576:
---
Attachment: q6.sql
physical plan.txt
jstack.log

> Fix the dead lock issue  when enable new adaptive execution
> ---
>
> Key: SPARK-28576
> URL: https://issues.apache.org/jira/browse/SPARK-28576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ke Jia
>Priority: Major
> Attachments: jstack.log, physical plan.txt, q6.sql
>
>
> After enable AE, we found the dead lock issue in Q6 1TB TPC-DS. The root 
> cause is that the subquery thread and the main thread is waiting for the same 
> object. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28576) Fix the dead lock issue when enable new adaptive execution

2019-07-30 Thread Ke Jia (JIRA)
Ke Jia created SPARK-28576:
--

 Summary: Fix the dead lock issue  when enable new adaptive 
execution
 Key: SPARK-28576
 URL: https://issues.apache.org/jira/browse/SPARK-28576
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ke Jia


After enable AE, we found the dead lock issue in Q6 1TB TPC-DS. The root cause 
is that the subquery thread and the main thread is waiting for the same object. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution

2019-07-29 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-28560:
---
Description: Implement a rule in the new adaptive execution framework 
introduced in SPARK-23128. This rule is used to optimize the shuffle reader to 
local shuffle reader when smj is converted to bhj in adaptive execution.  (was: 
Implement a rule in the new adaptive execution framework introduced in 
[SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is 
aim to optimize the shuffle reader to local shuffle reader when smj is 
converted to bhj in adaptive execution.)

> Optimize shuffle reader to local shuffle reader when smj converted to bhj in 
> adaptive execution
> ---
>
> Key: SPARK-28560
> URL: https://issues.apache.org/jira/browse/SPARK-28560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ke Jia
>Priority: Major
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to optimize the shuffle reader to local 
> shuffle reader when smj is converted to bhj in adaptive execution.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution

2019-07-29 Thread Ke Jia (JIRA)
Ke Jia created SPARK-28560:
--

 Summary: Optimize shuffle reader to local shuffle reader when smj 
converted to bhj in adaptive execution
 Key: SPARK-28560
 URL: https://issues.apache.org/jira/browse/SPARK-28560
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ke Jia


Implement a rule in the new adaptive execution framework introduced in 
[SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is 
aim to optimize the shuffle reader to local shuffle reader when smj is 
converted to bhj in adaptive execution.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal

2019-06-13 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-28046:
---
Attachment: image-2019-06-14-10-34-53-379.png

> OOM caused by building hash table when the compressed ratio of small table is 
> normal
> 
>
> Key: SPARK-28046
> URL: https://issues.apache.org/jira/browse/SPARK-28046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Ke Jia
>Priority: Major
> Attachments: image-2019-06-14-10-34-53-379.png
>
>
> Currently, spark will convert the sort merge join to broadcast hash join when 
> the small table compressed  size <= the broadcast threshold.  Same with 
> Spark, AE also convert the smj to bhj based on the compressed size in 
> runtime.  In our test, when enable ae with 32M broadcast threshold, one smj 
> with 16M compressed size is converted to bhj. However, when building the hash 
> table, the 16M small table is decompressed with 2GB size and has 134485048 
> row count, which has a mount of continuous and repeated values. Therefore, 
> the following OOM exception occurs when building hash table:
> !image-2019-06-14-10-29-00-499.png!
> And based on this founding , it may be not reasonable to decide whether smj 
> be converted to bhj only by the compressed size. In AE, we add the condition 
> with the estimation  decompressed size based on the row counts. And in spark, 
> we may also need the decompressed size or row counts condition judgment not 
> only the compressed size when converting the smj to bhj.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal

2019-06-13 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-28046:
---
Description: 
Currently, spark will convert the sort merge join to broadcast hash join when 
the small table compressed  size <= the broadcast threshold.  Same with Spark, 
AE also convert the smj to bhj based on the compressed size in runtime.  In our 
test, when enable ae with 32M broadcast threshold, one smj with 16M compressed 
size is converted to bhj. However, when building the hash table, the 16M small 
table is decompressed with 2GB size and has 134485048 row count, which has a 
mount of continuous and repeated values. Therefore, the following OOM exception 
occurs when building hash table:

!image-2019-06-14-10-34-53-379.png!

And based on this founding , it may be not reasonable to decide whether smj be 
converted to bhj only by the compressed size. In AE, we add the condition with 
the estimation  decompressed size based on the row counts. And in spark, we may 
also need the decompressed size or row counts condition judgment not only the 
compressed size when converting the smj to bhj.

  was:
Currently, spark will convert the sort merge join to broadcast hash join when 
the small table compressed  size <= the broadcast threshold.  Same with Spark, 
AE also convert the smj to bhj based on the compressed size in runtime.  In our 
test, when enable ae with 32M broadcast threshold, one smj with 16M compressed 
size is converted to bhj. However, when building the hash table, the 16M small 
table is decompressed with 2GB size and has 134485048 row count, which has a 
mount of continuous and repeated values. Therefore, the following OOM exception 
occurs when building hash table:

!image-2019-06-14-10-29-00-499.png!

And based on this founding , it may be not reasonable to decide whether smj be 
converted to bhj only by the compressed size. In AE, we add the condition with 
the estimation  decompressed size based on the row counts. And in spark, we may 
also need the decompressed size or row counts condition judgment not only the 
compressed size when converting the smj to bhj.


> OOM caused by building hash table when the compressed ratio of small table is 
> normal
> 
>
> Key: SPARK-28046
> URL: https://issues.apache.org/jira/browse/SPARK-28046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Ke Jia
>Priority: Major
> Attachments: image-2019-06-14-10-34-53-379.png
>
>
> Currently, spark will convert the sort merge join to broadcast hash join when 
> the small table compressed  size <= the broadcast threshold.  Same with 
> Spark, AE also convert the smj to bhj based on the compressed size in 
> runtime.  In our test, when enable ae with 32M broadcast threshold, one smj 
> with 16M compressed size is converted to bhj. However, when building the hash 
> table, the 16M small table is decompressed with 2GB size and has 134485048 
> row count, which has a mount of continuous and repeated values. Therefore, 
> the following OOM exception occurs when building hash table:
> !image-2019-06-14-10-34-53-379.png!
> And based on this founding , it may be not reasonable to decide whether smj 
> be converted to bhj only by the compressed size. In AE, we add the condition 
> with the estimation  decompressed size based on the row counts. And in spark, 
> we may also need the decompressed size or row counts condition judgment not 
> only the compressed size when converting the smj to bhj.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal

2019-06-13 Thread Ke Jia (JIRA)
Ke Jia created SPARK-28046:
--

 Summary: OOM caused by building hash table when the compressed 
ratio of small table is normal
 Key: SPARK-28046
 URL: https://issues.apache.org/jira/browse/SPARK-28046
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Ke Jia


Currently, spark will convert the sort merge join to broadcast hash join when 
the small table compressed  size <= the broadcast threshold.  Same with Spark, 
AE also convert the smj to bhj based on the compressed size in runtime.  In our 
test, when enable ae with 32M broadcast threshold, one smj with 16M compressed 
size is converted to bhj. However, when building the hash table, the 16M small 
table is decompressed with 2GB size and has 134485048 row count, which has a 
mount of continuous and repeated values. Therefore, the following OOM exception 
occurs when building hash table:

!image-2019-06-14-10-29-00-499.png!

And based on this founding , it may be not reasonable to decide whether smj be 
converted to bhj only by the compressed size. In AE, we add the condition with 
the estimation  decompressed size based on the row counts. And in spark, we may 
also need the decompressed size or row counts condition judgment not only the 
compressed size when converting the smj to bhj.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-17 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744791#comment-16744791
 ] 

Ke Jia commented on SPARK-26639:


[~mgaido] Yes,  the current master also have above phenomenon.

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-16 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744620#comment-16744620
 ] 

Ke Jia commented on SPARK-26639:


[~hyukjin.kwon] Thanks for your interesting. 

As discussion in [https://github.com/apache/spark/pull/14548]  When I run Q23b 
of TPC-DS, I found the visualized plan do show the subquery is executed once as 
following.

!https://user-images.githubusercontent.com/11972570/51232955-813af880-19a3-11e9-9d1c-96bb9de0c130.png!

But the stage of same subquery execute maybe not once as following:

!https://user-images.githubusercontent.com/11972570/51233118-fb6b7d00-19a3-11e9-9b48-9cebfb74ebd1.png!
 So I guess the reuse subquery does not work in fact. Maybe I miss some 
knowledge.  Thanks.

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-16 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744606#comment-16744606
 ] 

Ke Jia commented on SPARK-26639:


[@davies|https://github.com/davies] [@hvanhovell|https://github.com/hvanhovell] 
[@gatorsmile|https://github.com/gatorsmile] can you help verify this issue? 
Thanks for your help!

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-16 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26639:
---
Description: 
The subquery reuse feature has done in 
[https://github.com/apache/spark/pull/14548]

In my test, I found the visualized plan do show the subquery is executed once. 
But the stage of same subquery execute maybe not once.

 

  was:
The subquery reuse feature has done in 
[PR#14548|[https://github.com/apache/spark/pull/14548]]

In my test, I found the visualized plan do show the subquery is executed once. 
But the stage of same subquery execute maybe not once.

 


> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-16 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26639:
---
Description: 
The subquery reuse feature has done in 
[PR#14548|[https://github.com/apache/spark/pull/14548]]

In my test, I found the visualized plan do show the subquery is executed once. 
But the stage of same subquery execute maybe not once.

 

  was:
The subquery reuse feature has done in 
[14548|[https://github.com/apache/spark/pull/14548]|https://github.com/apache/spark/pull/14548].]

In my test, I found the visualized plan do show the subquery is executed once. 
But the stage of same subquery execute maybe not once.

 


> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [PR#14548|[https://github.com/apache/spark/pull/14548]]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-16 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26639:
---
Description: 
The subquery reuse feature has done in 
[14548|[https://github.com/apache/spark/pull/14548]|https://github.com/apache/spark/pull/14548].]

In my test, I found the visualized plan do show the subquery is executed once. 
But the stage of same subquery execute maybe not once.

 

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [14548|[https://github.com/apache/spark/pull/14548]|https://github.com/apache/spark/pull/14548].]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-16 Thread Ke Jia (JIRA)
Ke Jia created SPARK-26639:
--

 Summary: The reuse subquery function maybe does not work in SPARK 
SQL
 Key: SPARK-26639
 URL: https://issues.apache.org/jira/browse/SPARK-26639
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 2.4.0, 2.3.2
Reporter: Ke Jia






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16958) Reuse subqueries within single query

2019-01-15 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743720#comment-16743720
 ] 

Ke Jia edited comment on SPARK-16958 at 1/16/19 7:39 AM:
-

[~davies] hi, I left some comments in 
[https://github.com/apache/spark/pull/14548]. Can you help to verify whether 
the subquery is reused or not? Thanks for your help!


was (Author: jk_self):
[~davies] hi, I left some comments in the 
PR[l14548|[https://github.com/apache/spark/pull/14548]]. Can you help to verify 
whether the subquery is reused or not? Thanks for your help!

> Reuse subqueries within single query
> 
>
> Key: SPARK-16958
> URL: https://issues.apache.org/jira/browse/SPARK-16958
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
> Fix For: 2.1.0
>
>
> There could be same subquery within a single query, we could reuse the result 
> without running it multiple times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16958) Reuse subqueries within single query

2019-01-15 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743720#comment-16743720
 ] 

Ke Jia commented on SPARK-16958:


[~davies] hi, I left some comments in the 
PR[l14548|[https://github.com/apache/spark/pull/14548]]. Can you help to verify 
whether the subquery is reused or not? Thanks for your help!

> Reuse subqueries within single query
> 
>
> Key: SPARK-16958
> URL: https://issues.apache.org/jira/browse/SPARK-16958
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
> Fix For: 2.1.0
>
>
> There could be same subquery within a single query, we could reuse the result 
> without running it multiple times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-09 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26316:
---
Description: The code of  
[L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
 and 
[L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
 in  SPARK-21052 cause performance degradation in spark2.3. The result of  all 
queries in TPC-DS with 1TB is in [TPC-DS 
result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]

> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-09 Thread Ke Jia (JIRA)
Ke Jia created SPARK-26316:
--

 Summary: Because of the perf degradation in TPC-DS, we currently 
partial revert SPARK-21052:Add hash map metrics to join,
 Key: SPARK-26316
 URL: https://issues.apache.org/jira/browse/SPARK-26316
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0
Reporter: Ke Jia






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-08 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713847#comment-16713847
 ] 

Ke Jia commented on SPARK-26155:


Upload the result of all queries in tpcds in 1TB data scale.

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql, tpcds.result.xlsx
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-08 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Attachment: tpcds.result.xlsx

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql, tpcds.result.xlsx
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708223#comment-16708223
 ] 

Ke Jia commented on SPARK-26155:


[~cloud_fan] [~viirya]  Spark2.3 with the optimized patch can have the same 
performance with spark2.1. 
||spark2.1||spark2.3 with patch||
|49s|47s|

 

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706837#comment-16706837
 ] 

Ke Jia commented on SPARK-26155:


[~cloud_fan]  sorry for the delay. The  revert PR is 
[23204|https://github.com/apache/spark/pull/23204]

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-28 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701533#comment-16701533
 ] 

Ke Jia commented on SPARK-26155:


[~viirya] [~adrian-wang] 

Here is the result in  spark2.1, spark2.3 and spark2.4 same with above cluster 
info except the slave node from 7 to 4.
|spark2.1|spark2.3 remove L486&487|spark2.3 addL486&487|spark2.4|
|49s|47s|307s|270s|

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-27 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701317#comment-16701317
 ] 

Ke Jia commented on SPARK-26155:


[~viirya] Thanks for your reply. 

> "Q19 analysis in Spark2.3 without L486 & 487.pdf" has Stage time and DAG in 
>Spark 2.1, but the document title is Spark 2.3. Which version Spark is used 
>for it?

My spark version is Spark2.3. And the "Stage time and DAG in Spark2.1" is my 
mistake. And I have re-uploaded.

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-27 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Attachment: (was: Q19 analysis in Spark2.3 without L486 & 487.pdf)

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-27 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Attachment: Q19 analysis in Spark2.3 without L486&487.pdf

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-23 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Attachment: q19.sql
Q19 analysis in Spark2.3 without L486 & 487.pdf
Q19 analysis in Spark2.3 with L486&487.pdf

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486 & 487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-23 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696476#comment-16696476
 ] 

Ke Jia edited comment on SPARK-26155 at 11/23/18 7:58 AM:
--

*Cluster info:*
| |*Master Node*|*Worker Nodes*|
|*Node*|1x |7x|
|*Processor*|Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz|Intel(R) Xeon(R) 
Platinum 8180 CPU @ 2.50GHz|
|*Memory*|192 GB|384 GB|
|*Storage Main*|8 x 960G SSD|8 x 960G SSD|
|*Network*|10Gbe|
|*Role*|CM Management 
 NameNode
 Secondary NameNode
 Resource Manager
 Hive Metastore Server|DataNode
 NodeManager|
|*OS Version*|CentOS 7.2| CentOS 7.2|
|*Hadoop*|Apache Hadoop 2.7.5| Apache Hadoop 2.7.5|
|*Hive*|Apache Hive 2.2.0| |
|*Spark*|Apache Spark 2.1.0  & Apache Spark2.3.0| |
|*JDK  version*|1.8.0_112| 1.8.0_112|

*Related parameters setting:*
|*Component*|*Parameter*|*Value*|
|*Yarn Resource Manager*|yarn.scheduler.maximum-allocation-mb|40GB|
| |yarn.scheduler.minimum-allocation-mb|1GB|
| |yarn.scheduler.maximum-allocation-vcores|121|
| |Yarn.resourcemanager.scheduler.class|Fair Scheduler|
|*Yarn Node Manager*|yarn.nodemanager.resource.memory-mb|40GB|
| |yarn.nodemanager.resource.cpu-vcores|121|
|*Spark*|spark.executor.memory|34GB|
| |spark.executor.cores|40|


was (Author: jk_self):
*Cluster info:*
| |*Master Node*|*Worker Nodes* |
|*Node*|1x |7x|
|*Processor*|Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz|Intel(R) Xeon(R) 
Platinum 8180 CPU @ 2.50GHz|
|*Memory*|192 GB|384 GB|
|*Storage Main*|8 x 960G SSD|8 x 960G SSD|
|*Network*|10Gbe|
|*Role*|CM Management 
 NameNode
 Secondary NameNode
 Resource Manager
 Hive Metastore Server|DataNode
 NodeManager|
|*OS Version*|CentOS 7.2|
|*Hadoop*|Apache Hadoop 2.7.5|
|*Hive*|Apache Hive 2.2.0|
|*Spark*|Apache Spark 2.1.0  VS Apache Spark2.3.0|
|*JDK  version*|1.8.0_112|

*Related parameters setting:*
|*Component*|*Parameter*|*Value*|
|*Yarn Resource Manager*|yarn.scheduler.maximum-allocation-mb|40GB|
|yarn.scheduler.minimum-allocation-mb|1GB|
|yarn.scheduler.maximum-allocation-vcores|121|
|Yarn.resourcemanager.scheduler.class|Fair Scheduler|
|*Yarn Node Manager*|yarn.nodemanager.resource.memory-mb|40GB|
|yarn.nodemanager.resource.cpu-vcores|121|
|*Spark*|spark.executor.memory|34GB|
|spark.executor.cores|40|

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-22 Thread Ke Jia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696476#comment-16696476
 ] 

Ke Jia commented on SPARK-26155:


*Cluster info:*
| |*Master Node*|*Worker Nodes* |
|*Node*|1x |7x|
|*Processor*|Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz|Intel(R) Xeon(R) 
Platinum 8180 CPU @ 2.50GHz|
|*Memory*|192 GB|384 GB|
|*Storage Main*|8 x 960G SSD|8 x 960G SSD|
|*Network*|10Gbe|
|*Role*|CM Management 
 NameNode
 Secondary NameNode
 Resource Manager
 Hive Metastore Server|DataNode
 NodeManager|
|*OS Version*|CentOS 7.2|
|*Hadoop*|Apache Hadoop 2.7.5|
|*Hive*|Apache Hive 2.2.0|
|*Spark*|Apache Spark 2.1.0  VS Apache Spark2.3.0|
|*JDK  version*|1.8.0_112|

*Related parameters setting:*
|*Component*|*Parameter*|*Value*|
|*Yarn Resource Manager*|yarn.scheduler.maximum-allocation-mb|40GB|
|yarn.scheduler.minimum-allocation-mb|1GB|
|yarn.scheduler.maximum-allocation-vcores|121|
|Yarn.resourcemanager.scheduler.class|Fair Scheduler|
|*Yarn Node Manager*|yarn.nodemanager.resource.memory-mb|40GB|
|yarn.nodemanager.resource.cpu-vcores|121|
|*Spark*|spark.executor.memory|34GB|
|spark.executor.cores|40|

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-11-22 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Summary: Spark SQL  performance degradation after apply SPARK-21052 with 
Q19 of TPC-DS in 3TB scale  (was: Spark SQL  performance degradation after 
apply SPARK-21052 with Q19 of TPC-DS in 2.6TB scale)

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 2.6TB scale

2018-11-22 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Description: In our test environment, we found a serious performance 
degradation issue in Spark2.3 when running TPC-DS on SKX 8180. Several queries 
have serious performance degradation. For example, TPC-DS Q19 needs 126 seconds 
with Spark 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We 
investigated this problem and figured out the root cause is in community patch 
SPARK-21052 which add metrics to hash join process. And the impact code is 
[L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
 and 
[L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
  . Q19 costs about 30 seconds without these two lines code and 126 seconds 
with these code.  (was: In our test environment, we found a serious performance 
degradation issue in Spark2.3 when running TPC-DS on SKX 8180. Several queries 
have serious performance degradation. For example, TPC-DS Q19 needs 126 seconds 
with Spark 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We 
investigated this problem and figured out the root cause is in community patch 
SPARK-21052 which add metrics to hash join process. And the impact code is 
[L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
 and 
[L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
  . Q19 cost  about 30 seconds without these two lines code and 126 seconds 
with these code.)

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 2.6TB scale
> 
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 2.6TB scale

2018-11-22 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Description: In our test environment, we found a serious performance 
degradation issue in Spark2.3 when running TPC-DS on SKX 8180. Several queries 
have serious performance degradation. For example, TPC-DS Q19 needs 126 seconds 
with Spark 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We 
investigated this problem and figured out the root cause is in community patch 
SPARK-21052 which add metrics to hash join process. And the impact code is 
[L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
 and 
[L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
  . Q19 cost  about 30 seconds without these two lines code and 126 seconds 
with these code.  (was: In our test environment, we found a serious performance 
degradation issue in Spark2.3 when running TPC-DS on SKX 8180. Several queries 
have serious performance degradation. For example, TPC-DS Q19 needs 126 seconds 
with Spark 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We 
investigated this problem and figured out the root cause is in community patch 
SPARK-21052 which add metrics to hash join process. And the impact code is 
[L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
 and 
[L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
 
[L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
 . Q19 cost  about 30 seconds without these two lines code and 126 seconds with 
these code.)

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 2.6TB scale
> 
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 cost  about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 2.6TB scale

2018-11-22 Thread Ke Jia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-26155:
---
Description: In our test environment, we found a serious performance 
degradation issue in Spark2.3 when running TPC-DS on SKX 8180. Several queries 
have serious performance degradation. For example, TPC-DS Q19 needs 126 seconds 
with Spark 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We 
investigated this problem and figured out the root cause is in community patch 
SPARK-21052 which add metrics to hash join process. And the impact code is 
[L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
 and 
[L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
 
[L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
 . Q19 cost  about 30 seconds without these two lines code and 126 seconds with 
these code.

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 2.6TB scale
> 
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  . Q19 cost  about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 2.6TB scale

2018-11-22 Thread Ke Jia (JIRA)
Ke Jia created SPARK-26155:
--

 Summary: Spark SQL  performance degradation after apply 
SPARK-21052 with Q19 of TPC-DS in 2.6TB scale
 Key: SPARK-26155
 URL: https://issues.apache.org/jira/browse/SPARK-26155
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0
Reporter: Ke Jia






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8321) Authorization Support(on all operations not only DDL) in Spark Sql

2016-03-29 Thread Ke Jia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-8321:
--
Attachment: SparkSQLauthorizationDesignDocument.pdf

> Authorization Support(on all operations not only DDL) in Spark Sql
> --
>
> Key: SPARK-8321
> URL: https://issues.apache.org/jira/browse/SPARK-8321
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Sunil
> Attachments: SparkSQLauthorizationDesignDocument.pdf, 
> SparkSQLauthorizationDesignDocument.pdf
>
>
> Currently If you run Spark SQL with thrift server it only support 
> Authentication and limited authorization support(DDL). Want to extend it to 
> provide full authorization or provide a plug able authorization like Apache 
> sentry so that user with proper roles can access data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8321) Authorization Support(on all operations not only DDL) in Spark Sql

2015-12-17 Thread Ke Jia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-8321:
--
Attachment: SparkSQLauthorizationDesignDocument.pdf

> Authorization Support(on all operations not only DDL) in Spark Sql
> --
>
> Key: SPARK-8321
> URL: https://issues.apache.org/jira/browse/SPARK-8321
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.3.0
>Reporter: Sunil
> Attachments: SparkSQLauthorizationDesignDocument.pdf
>
>
> Currently If you run Spark SQL with thrift server it only support 
> Authentication and limited authorization support(DDL). Want to extend it to 
> provide full authorization or provide a plug able authorization like Apache 
> sentry so that user with proper roles can access data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org