[jira] [Updated] (SPARK-40067) Add table name to Spark plan node in SparkUI

2022-08-12 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-40067:
---
Description: 
[SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] introduced 
`Scan#name()` API to expose the name of the TableScan in the `BatchScan` node 
in SparkUI.
However, a better suggestion was to use the `Table#name()`. Furthermore, we can 
also extract other useful information `Table` thus revert 
[SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] and use `Table` 
to fetch relevant information.

> Add table name to Spark plan node in SparkUI
> 
>
> Key: SPARK-40067
> URL: https://issues.apache.org/jira/browse/SPARK-40067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Sumeet
>Priority: Major
>
> [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] introduced 
> `Scan#name()` API to expose the name of the TableScan in the `BatchScan` node 
> in SparkUI.
> However, a better suggestion was to use the `Table#name()`. Furthermore, we 
> can also extract other useful information `Table` thus revert 
> [SPARK-39902|https://issues.apache.org/jira/browse/SPARK-39902] and use 
> `Table` to fetch relevant information.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40067) Add table name to Spark plan node in SparkUI

2022-08-12 Thread Sumeet (Jira)
Sumeet created SPARK-40067:
--

 Summary: Add table name to Spark plan node in SparkUI
 Key: SPARK-40067
 URL: https://issues.apache.org/jira/browse/SPARK-40067
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.4.0
Reporter: Sumeet






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-08-01 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574029#comment-17574029
 ] 

Sumeet commented on SPARK-39902:


Hi [~dongjoon] - Noted.

Thank you for updating the ticket and merging the changes.

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.4.0
>Reporter: Sumeet
>Assignee: Sumeet
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, 
> Screen Shot 2022-07-27 at 6.39.48 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572193#comment-17572193
 ] 

Sumeet commented on SPARK-39902:


An example of this change can be seen while viewing the Iceberg scans on 
SparkUI.
h2. Before this change:

!Screen Shot 2022-07-27 at 6.39.48 PM.png|width=415,height=211!

 
h2. After this change:

!Screen Shot 2022-07-27 at 6.38.56 PM.png|width=430,height=216!

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, 
> Screen Shot 2022-07-27 at 6.39.48 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.39.48 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, 
> Screen Shot 2022-07-27 at 6.39.48 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.38.56 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png, Screen Shot 2022-07-27 at 6.38.56 PM.png, 
> Screen Shot 2022-07-27 at 6.39.48 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Description: 
Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 

DSv1

!Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!

 

DSv2

!Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!

  was:
Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 

DSv1

 


> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
> !Screen Shot 2022-07-27 at 6.00.27 PM.png|width=356,height=212!
>  
> DSv2
> !Screen Shot 2022-07-27 at 6.00.50 PM.png|width=293,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.00.50 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png, Screen Shot 
> 2022-07-27 at 6.00.50 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.00.27 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
> Attachments: Screen Shot 2022-07-27 at 6.00.27 PM.png
>
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: (was: Screen Shot 2022-07-27 at 6.00.27 PM.png)

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Attachment: Screen Shot 2022-07-27 at 6.00.27 PM.png

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-39902:
---
Description: 
Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 

DSv1

 

  was:
Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 


> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  
> DSv1
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572186#comment-17572186
 ] 

Sumeet commented on SPARK-39902:


I'm working on it and will publish a patch soon.

> Add Scan details to spark plan scan node in SparkUI
> ---
>
> Key: SPARK-39902
> URL: https://issues.apache.org/jira/browse/SPARK-39902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.1
>Reporter: Sumeet
>Priority: Major
>
> Hi,
> For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
> as opposed to "Scan ".
> Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
> invoke to set the node name the plan. This nodeName will be eventually used 
> by "SparkPlanGraphNode" to display it in the header of the UI node.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39902) Add Scan details to spark plan scan node in SparkUI

2022-07-27 Thread Sumeet (Jira)
Sumeet created SPARK-39902:
--

 Summary: Add Scan details to spark plan scan node in SparkUI
 Key: SPARK-39902
 URL: https://issues.apache.org/jira/browse/SPARK-39902
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.3.1
Reporter: Sumeet


Hi,

For DSv2, the scan node in the spark plan on SparkUI simply shows "BatchScan" 
as opposed to "Scan ".

Add a method "String name()" to the Scan interface, that "BatchScanExec" can 
invoke to set the node name the plan. This nodeName will be eventually used by 
"SparkPlanGraphNode" to display it in the header of the UI node.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37175) Performance improvement to hash joins with many duplicate keys

2021-10-31 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436588#comment-17436588
 ] 

Sumeet commented on SPARK-37175:


I am working on this.

> Performance improvement to hash joins with many duplicate keys
> --
>
> Key: SPARK-37175
> URL: https://issues.apache.org/jira/browse/SPARK-37175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
> Attachments: hash_rel_examples.txt
>
>
> I noticed that HashedRelations with many duplicate keys perform significantly 
> slower than HashedRelations with similar number of entries but few or no 
> duplicate keys.
> A hypothesis:
>  * Because of the order in which rows are appended to the map, rows for a 
> given key are typically non-adjacent in memory, resulting in poor locality.
>  * The map would perform better if all rows for a given key are next to each 
> other in memory.
> To test this hypothesis, I made a [somewhat brute force change to 
> HashedRelation|https://github.com/apache/spark/compare/master...bersprockets:hash_rel_play]
>  to reorganize the map such that all rows for a given key are adjacent in 
> memory. This yielded some performance improvements, at least in my contrived 
> examples:
> (Run on a Intel-based MacBook Pro with 4 cores/8 hyperthreads):
> Example 1:
>  Shuffled Hash Join, LongHashedRelation:
>  Stream side: 300M rows
>  Build side: 90M rows, but only 200K unique keys
>  136G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled hash join (No reorganization)|1092| |
> |Shuffled hash join (with reorganization)|234|4.6 times faster than regular 
> SHJ|
> |Sort merge join|164|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 2:
>  Broadcast Hash Join, LongHashedRelation:
>  Stream side: 350M rows
>  Build side 9M rows, but only 18K unique keys
>  175G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|872| |
> |Broadcast hash join (with reorganization)|263|3 times faster than regular 
> BHJ|
> |Sort merge join|174|This beats the BHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 3:
>  Shuffled Hash Join, UnsafeHashedRelation
>  Stream side: 300M rows
>  Build side 90M rows, but only 200K unique keys
>  135G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Shuffled Hash Join (No reorganization)|3154| |
> |Shuffled Hash Join (with reorganization)|533|5.9 times faster|
> |Sort merge join|190|This beats the SHJ when there are lots of duplicate 
> keys, I presume because of better cache locality on both sides of the join|
> Example 4:
>  Broadcast Hash Join, UnsafeHashedRelation:
>  Stream side: 70M rows
>  Build side 9M rows, but only 18K unique keys
>  35G output rows
> |Join strategy|Time (in seconds)|Notes|
> |Broadcast hash join (No reorganization)|849| |
> |Broadcast hash join (with reorganization)|130|6.5 times faster|
> |Sort merge join|46|This beats the BHJ when there are lots of duplicate keys, 
> I presume because of better cache locality on both sides of the join|
> The code for these examples is attached here [^hash_rel_examples.txt]
> Even the brute force approach could be useful in production if
>  * Toggled by a feature flag
>  * Reorganizes only if the ratio of keys to rows drops below some threshold
>  * Falls back to using the original map if building the new map results in a 
> memory-related SparkException.
> Another incidental lesson is that sort merge join seems to outperform 
> broadcast hash join when the build side has lots of duplicate keys. So maybe 
> a long term improvement would be to avoid hash joins (broadcast or shuffle) 
> if there are many duplicate keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.

2021-08-20 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-35011:
---
Fix Version/s: 3.0.4

> Avoid Block Manager registerations when StopExecutor msg is in-flight.
> --
>
> Key: SPARK-35011
> URL: https://issues.apache.org/jira/browse/SPARK-35011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Sumeet
>Assignee: Sumeet
>Priority: Major
>  Labels: BlockManager, core
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
> driver reports dead executors as alive.
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
> executorEndpoint
>  * "CoarseGrainedSchedulerBackend" removes that executor from Driver's 
> internal data structures and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus".
>  * Executor has still not processed "StopExecutor" from the Driver
>  * Driver receives heartbeat from the Executor, since it cannot find the 
> "executorId" in its data structures, it responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" 
> and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
>  * Executor starts processing the "StopExecutor" and exits
>  * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
> updates "AppStatusStore"
>  * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list 
> of executors which returns the dead executor as alive.
>  
> *Proposed Solution:*
> Maintain a Cache of recently removed executors on Driver. During the 
> registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
> recently removed executor, return None indicating the registration is ignored 
> since the executor will be shutting down soon.
> On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
> executor, return true indicating the driver knows about it, thereby 
> preventing reregisteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36065) date_trunc returns incorrect output

2021-07-13 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380050#comment-17380050
 ] 

Sumeet commented on SPARK-36065:


cc [~maxgekk]

> date_trunc returns incorrect output
> ---
>
> Key: SPARK-36065
> URL: https://issues.apache.org/jira/browse/SPARK-36065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Sumeet
>Priority: Major
>  Labels: date_trunc, sql, timestamp
>
> Hi,
> Running date_trunc on any hour of "1891-10-01" returns incorrect output for 
> "Europe/Bratislava" timezone.
> Use the following steps in order to reproduce the issue:
>  * Run spark-shell using:
> {code:java}
> TZ="Europe/Bratislava" ./bin/spark-shell --conf 
> spark.driver.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
> spark.sql.session.timeZone="Europe/Bratislava"{code}
>  * Generate test data:
> {code:java}
> ((0 until 9).map(hour => s"1891-10-01 00:0$hour:00") ++ (10 until 
> 24).map(hour => s"1891-10-01 
> 00:$hour:00")).toDF("ts_string").createOrReplaceTempView("temp_ts")
> {code}
>  * Run query:
> {code:java}
> sql("select ts_string, cast(ts_string as TIMESTAMP) as ts, date_trunc('day', 
> ts_string) from temp_ts").show(false)
> {code}
>  * Output:
> {code:java}
> +---+---+--+
> |ts_string  |ts |date_trunc(day, ts_string)|
> +---+---+--+
> |1891-10-01 00:00:00|1891-10-01 00:02:16|1891-10-01 00:02:16   |
> |1891-10-01 00:01:00|1891-10-01 00:03:16|1891-10-01 00:02:16   |
> |1891-10-01 00:02:00|1891-10-01 00:04:16|1891-10-01 00:02:16   |
> |1891-10-01 00:03:00|1891-10-01 00:03:00|1891-10-01 00:02:16   |
> |1891-10-01 00:04:00|1891-10-01 00:04:00|1891-10-01 00:02:16   |
> |1891-10-01 00:05:00|1891-10-01 00:05:00|1891-10-01 00:02:16   |
> |1891-10-01 00:06:00|1891-10-01 00:06:00|1891-10-01 00:02:16   |
> |1891-10-01 00:07:00|1891-10-01 00:07:00|1891-10-01 00:02:16   |
> |1891-10-01 00:08:00|1891-10-01 00:08:00|1891-10-01 00:02:16   |
> |1891-10-01 00:10:00|1891-10-01 00:10:00|1891-10-01 00:02:16   |
> |1891-10-01 00:11:00|1891-10-01 00:11:00|1891-10-01 00:02:16   |
> |1891-10-01 00:12:00|1891-10-01 00:12:00|1891-10-01 00:02:16   |
> |1891-10-01 00:13:00|1891-10-01 00:13:00|1891-10-01 00:02:16   |
> |1891-10-01 00:14:00|1891-10-01 00:14:00|1891-10-01 00:02:16   |
> |1891-10-01 00:15:00|1891-10-01 00:15:00|1891-10-01 00:02:16   |
> |1891-10-01 00:16:00|1891-10-01 00:16:00|1891-10-01 00:02:16   |
> |1891-10-01 00:17:00|1891-10-01 00:17:00|1891-10-01 00:02:16   |
> |1891-10-01 00:18:00|1891-10-01 00:18:00|1891-10-01 00:02:16   |
> |1891-10-01 00:19:00|1891-10-01 00:19:00|1891-10-01 00:02:16   |
> |1891-10-01 00:20:00|1891-10-01 00:20:00|1891-10-01 00:02:16   |
> +---+---+--+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36065) date_trunc returns incorrect output

2021-07-13 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-36065:
---
Affects Version/s: 3.2.0

> date_trunc returns incorrect output
> ---
>
> Key: SPARK-36065
> URL: https://issues.apache.org/jira/browse/SPARK-36065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Sumeet
>Priority: Major
>  Labels: date_trunc, sql, timestamp
>
> Hi,
> Running date_trunc on any hour of "1891-10-01" returns incorrect output for 
> "Europe/Bratislava" timezone.
> Use the following steps in order to reproduce the issue:
>  * Run spark-shell using:
> {code:java}
> TZ="Europe/Bratislava" ./bin/spark-shell --conf 
> spark.driver.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
> spark.sql.session.timeZone="Europe/Bratislava"{code}
>  * Generate test data:
> {code:java}
> ((0 until 9).map(hour => s"1891-10-01 00:0$hour:00") ++ (10 until 
> 24).map(hour => s"1891-10-01 
> 00:$hour:00")).toDF("ts_string").createOrReplaceTempView("temp_ts")
> {code}
>  * Run query:
> {code:java}
> sql("select ts_string, cast(ts_string as TIMESTAMP) as ts, date_trunc('day', 
> ts_string) from temp_ts").show(false)
> {code}
>  * Output:
> {code:java}
> +---+---+--+
> |ts_string  |ts |date_trunc(day, ts_string)|
> +---+---+--+
> |1891-10-01 00:00:00|1891-10-01 00:02:16|1891-10-01 00:02:16   |
> |1891-10-01 00:01:00|1891-10-01 00:03:16|1891-10-01 00:02:16   |
> |1891-10-01 00:02:00|1891-10-01 00:04:16|1891-10-01 00:02:16   |
> |1891-10-01 00:03:00|1891-10-01 00:03:00|1891-10-01 00:02:16   |
> |1891-10-01 00:04:00|1891-10-01 00:04:00|1891-10-01 00:02:16   |
> |1891-10-01 00:05:00|1891-10-01 00:05:00|1891-10-01 00:02:16   |
> |1891-10-01 00:06:00|1891-10-01 00:06:00|1891-10-01 00:02:16   |
> |1891-10-01 00:07:00|1891-10-01 00:07:00|1891-10-01 00:02:16   |
> |1891-10-01 00:08:00|1891-10-01 00:08:00|1891-10-01 00:02:16   |
> |1891-10-01 00:10:00|1891-10-01 00:10:00|1891-10-01 00:02:16   |
> |1891-10-01 00:11:00|1891-10-01 00:11:00|1891-10-01 00:02:16   |
> |1891-10-01 00:12:00|1891-10-01 00:12:00|1891-10-01 00:02:16   |
> |1891-10-01 00:13:00|1891-10-01 00:13:00|1891-10-01 00:02:16   |
> |1891-10-01 00:14:00|1891-10-01 00:14:00|1891-10-01 00:02:16   |
> |1891-10-01 00:15:00|1891-10-01 00:15:00|1891-10-01 00:02:16   |
> |1891-10-01 00:16:00|1891-10-01 00:16:00|1891-10-01 00:02:16   |
> |1891-10-01 00:17:00|1891-10-01 00:17:00|1891-10-01 00:02:16   |
> |1891-10-01 00:18:00|1891-10-01 00:18:00|1891-10-01 00:02:16   |
> |1891-10-01 00:19:00|1891-10-01 00:19:00|1891-10-01 00:02:16   |
> |1891-10-01 00:20:00|1891-10-01 00:20:00|1891-10-01 00:02:16   |
> +---+---+--+
> only showing top 20 rows
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36065) date_trunc returns incorrect output

2021-07-08 Thread Sumeet (Jira)
Sumeet created SPARK-36065:
--

 Summary: date_trunc returns incorrect output
 Key: SPARK-36065
 URL: https://issues.apache.org/jira/browse/SPARK-36065
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Sumeet


Hi,

Running date_trunc on any hour of "1891-10-01" returns incorrect output for 
"Europe/Bratislava" timezone.
Use the following steps in order to reproduce the issue:
 * Run spark-shell using:

{code:java}
TZ="Europe/Bratislava" ./bin/spark-shell --conf 
spark.driver.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
spark.executor.extraJavaOptions='-Duser.timezone=Europe/Bratislava' --conf 
spark.sql.session.timeZone="Europe/Bratislava"{code}

 * Generate test data:

{code:java}
((0 until 9).map(hour => s"1891-10-01 00:0$hour:00") ++ (10 until 24).map(hour 
=> s"1891-10-01 
00:$hour:00")).toDF("ts_string").createOrReplaceTempView("temp_ts")
{code}

 * Run query:

{code:java}
sql("select ts_string, cast(ts_string as TIMESTAMP) as ts, date_trunc('day', 
ts_string) from temp_ts").show(false)
{code}

 * Output:

{code:java}
+---+---+--+
|ts_string  |ts |date_trunc(day, ts_string)|
+---+---+--+
|1891-10-01 00:00:00|1891-10-01 00:02:16|1891-10-01 00:02:16   |
|1891-10-01 00:01:00|1891-10-01 00:03:16|1891-10-01 00:02:16   |
|1891-10-01 00:02:00|1891-10-01 00:04:16|1891-10-01 00:02:16   |
|1891-10-01 00:03:00|1891-10-01 00:03:00|1891-10-01 00:02:16   |
|1891-10-01 00:04:00|1891-10-01 00:04:00|1891-10-01 00:02:16   |
|1891-10-01 00:05:00|1891-10-01 00:05:00|1891-10-01 00:02:16   |
|1891-10-01 00:06:00|1891-10-01 00:06:00|1891-10-01 00:02:16   |
|1891-10-01 00:07:00|1891-10-01 00:07:00|1891-10-01 00:02:16   |
|1891-10-01 00:08:00|1891-10-01 00:08:00|1891-10-01 00:02:16   |
|1891-10-01 00:10:00|1891-10-01 00:10:00|1891-10-01 00:02:16   |
|1891-10-01 00:11:00|1891-10-01 00:11:00|1891-10-01 00:02:16   |
|1891-10-01 00:12:00|1891-10-01 00:12:00|1891-10-01 00:02:16   |
|1891-10-01 00:13:00|1891-10-01 00:13:00|1891-10-01 00:02:16   |
|1891-10-01 00:14:00|1891-10-01 00:14:00|1891-10-01 00:02:16   |
|1891-10-01 00:15:00|1891-10-01 00:15:00|1891-10-01 00:02:16   |
|1891-10-01 00:16:00|1891-10-01 00:16:00|1891-10-01 00:02:16   |
|1891-10-01 00:17:00|1891-10-01 00:17:00|1891-10-01 00:02:16   |
|1891-10-01 00:18:00|1891-10-01 00:18:00|1891-10-01 00:02:16   |
|1891-10-01 00:19:00|1891-10-01 00:19:00|1891-10-01 00:02:16   |
|1891-10-01 00:20:00|1891-10-01 00:20:00|1891-10-01 00:02:16   |
+---+---+--+
only showing top 20 rows
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35429) Remove commons-httpclient due to EOL and CVEs

2021-06-14 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363131#comment-17363131
 ] 

Sumeet commented on SPARK-35429:


Re-opening this Jira since Spark upgraded to Hive 2.3.9 which no longer depends 
on commons-httpclient.

[https://github.com/apache/spark/commit/463daabd5afd9abfb8027ebcb2e608f169ad1e40]

> Remove commons-httpclient due to EOL and CVEs
> -
>
> Key: SPARK-35429
> URL: https://issues.apache.org/jira/browse/SPARK-35429
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Sumeet
>Priority: Major
>
> Spark is pulling in commons-httpclient as a dependency directly. See 
> dependency:tree:
> {code:java}
>  ./build/mvn dependency:tree | grep -i "commons-httpclient"   
> 
> Using `mvn` from path: 
> /Users/sumeet.gajjar/cloudera/upstream-spark/build/apache-maven-3.6.3/bin/mvn
> [INFO] +- commons-httpclient:commons-httpclient:jar:3.1:compile
> [INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:provided
> {code}
> commons-httpclient went EOL years ago and there are most likely CVEs not 
> being reported against it, thus we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35429) Remove commons-httpclient due to EOL and CVEs

2021-05-17 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-35429:
---
Affects Version/s: 3.2.0

> Remove commons-httpclient due to EOL and CVEs
> -
>
> Key: SPARK-35429
> URL: https://issues.apache.org/jira/browse/SPARK-35429
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Sumeet
>Priority: Major
>
> Spark is pulling in commons-httpclient as a dependency directly. See 
> dependency:tree:
> {code:java}
>  ./build/mvn dependency:tree | grep -i "commons-httpclient"   
> 
> Using `mvn` from path: 
> /Users/sumeet.gajjar/cloudera/upstream-spark/build/apache-maven-3.6.3/bin/mvn
> [INFO] +- commons-httpclient:commons-httpclient:jar:3.1:compile
> [INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:provided
> {code}
> commons-httpclient went EOL years ago and there are most likely CVEs not 
> being reported against it, thus we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35429) Remove commons-httpclient due to EOL and CVEs

2021-05-17 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346464#comment-17346464
 ] 

Sumeet commented on SPARK-35429:


I am working on this.

> Remove commons-httpclient due to EOL and CVEs
> -
>
> Key: SPARK-35429
> URL: https://issues.apache.org/jira/browse/SPARK-35429
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Sumeet
>Priority: Major
>
> Spark is pulling in commons-httpclient as a dependency directly. See 
> dependency:tree:
> {code:java}
>  ./build/mvn dependency:tree | grep -i "commons-httpclient"   
> 
> Using `mvn` from path: 
> /Users/sumeet.gajjar/cloudera/upstream-spark/build/apache-maven-3.6.3/bin/mvn
> [INFO] +- commons-httpclient:commons-httpclient:jar:3.1:compile
> [INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:provided
> {code}
> commons-httpclient went EOL years ago and there are most likely CVEs not 
> being reported against it, thus we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35429) Remove commons-httpclient due to EOL and CVEs

2021-05-17 Thread Sumeet (Jira)
Sumeet created SPARK-35429:
--

 Summary: Remove commons-httpclient due to EOL and CVEs
 Key: SPARK-35429
 URL: https://issues.apache.org/jira/browse/SPARK-35429
 Project: Spark
  Issue Type: Task
  Components: Spark Core, SQL
Affects Versions: 3.1.1, 3.0.0
Reporter: Sumeet


Spark is pulling in commons-httpclient as a dependency directly. See 
dependency:tree:
{code:java}
 ./build/mvn dependency:tree | grep -i "commons-httpclient" 
  
Using `mvn` from path: 
/Users/sumeet.gajjar/cloudera/upstream-spark/build/apache-maven-3.6.3/bin/mvn
[INFO] +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:provided
{code}
commons-httpclient went EOL years ago and there are most likely CVEs not being 
reported against it, thus we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.

2021-04-11 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318679#comment-17318679
 ] 

Sumeet commented on SPARK-35011:


[~holden], [~dongjoon], [~attilapiros] could you please take a look at this?

> Avoid Block Manager registerations when StopExecutor msg is in-flight.
> --
>
> Key: SPARK-35011
> URL: https://issues.apache.org/jira/browse/SPARK-35011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Sumeet
>Priority: Major
>  Labels: BlockManager, core
>
> *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
> driver reports dead executors as alive.
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
> executorEndpoint
>  * "CoarseGrainedSchedulerBackend" removes that executor from Driver's 
> internal data structures and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus".
>  * Executor has still not processed "StopExecutor" from the Driver
>  * Driver receives heartbeat from the Executor, since it cannot find the 
> "executorId" in its data structures, it responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" 
> and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
>  * Executor starts processing the "StopExecutor" and exits
>  * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
> updates "AppStatusStore"
>  * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list 
> of executors which returns the dead executor as alive.
>  
> *Proposed Solution:*
> Maintain a Cache of recently removed executors on Driver. During the 
> registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
> recently removed executor, return None indicating the registration is ignored 
> since the executor will be shutting down soon.
> On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
> executor, return true indicating the driver knows about it, thereby 
> preventing reregisteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.

2021-04-09 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-35011:
---
Description: 
*Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
driver reports dead executors as alive.

*Problem:*

I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
so, when the executors were torn down due to 
"spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods 
being removed from K8s, however, under the "Executors" tab in SparkUI, I could 
see some executors listed as alive. 

[spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
 also returned a value greater than 1. 

 

*Cause:*
 * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
executorEndpoint
 * "CoarseGrainedSchedulerBackend" removes that executor from Driver's internal 
data structures and publishes "SparkListenerExecutorRemoved" on the 
"listenerBus".
 * Executor has still not processed "StopExecutor" from the Driver
 * Driver receives heartbeat from the Executor, since it cannot find the 
"executorId" in its data structures, it responds with 
"HeartbeatResponse(reregisterBlockManager = true)"
 * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" and 
"SparkListenerBlockManagerAdded" is published on the "listenerBus"
 * Executor starts processing the "StopExecutor" and exits
 * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
updates "AppStatusStore"
 * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of 
executors which returns the dead executor as alive.

 

*Proposed Solution:*

Maintain a Cache of recently removed executors on Driver. During the 
registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
recently removed executor, return None indicating the registration is ignored 
since the executor will be shutting down soon.

On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
executor, return true indicating the driver knows about it, thereby preventing 
reregisteration.

  was:
*Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
driver reports dead executors as alive.



*Problem:*

I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
so, when the executors were torn down due to 
"spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods 
being removed from K8s, however, under the "Executors" tab in SparkUI, I could 
see some executors listed as alive. 

[spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
 also returned a value greater than 1. 

 

*Cause:*

 
 *  "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
executorEndpoint
 * "CoarseGrainedSchedulerBackend" removes that executor from Driver's internal 
data structures and publishes "SparkListenerExecutorRemoved" on the 
"listenerBus".
 * Executor has still not processed "StopExecutor" from the Driver
 * Driver receives heartbeat from the Executor, since it cannot find the 
"executorId" in its data structures, it responds with 
"HeartbeatResponse(reregisterBlockManager = true)"
 * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" and 
"SparkListenerBlockManagerAdded" is published on the "listenerBus"
 * Executor starts processing the "StopExecutor" and exits
 * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
updates "AppStatusStore"
 * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of 
executors which returns the dead executor as alive.

 

*Proposed Solution:*

Maintain a Cache of recently removed executors on Driver. During the 
registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
recently removed executor, return None indicating the registration is ignored 
since the executor will be shutting down soon.

On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
executor, return true indicating the driver knows about it, thereby preventing 
reregisteration.


> Avoid Block Manager registerations when StopExecutor msg is in-flight.
> --
>
> Key: SPARK-35011
> URL: https://issues.apache.org/jira/browse/SPARK-35011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Sumeet
>Priority: Major
>  Labels: BlockManager, core
>
> *Note:* This is a follow-up on SPARK-34949, even 

[jira] [Commented] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.

2021-04-09 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318174#comment-17318174
 ] 

Sumeet commented on SPARK-35011:


I am working on this.

> Avoid Block Manager registerations when StopExecutor msg is in-flight.
> --
>
> Key: SPARK-35011
> URL: https://issues.apache.org/jira/browse/SPARK-35011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Sumeet
>Priority: Major
>  Labels: BlockManager, core
>
> *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
> driver reports dead executors as alive.
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
> executorEndpoint
>  * "CoarseGrainedSchedulerBackend" removes that executor from Driver's 
> internal data structures and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus".
>  * Executor has still not processed "StopExecutor" from the Driver
>  * Driver receives heartbeat from the Executor, since it cannot find the 
> "executorId" in its data structures, it responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" 
> and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
>  * Executor starts processing the "StopExecutor" and exits
>  * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
> updates "AppStatusStore"
>  * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list 
> of executors which returns the dead executor as alive.
>  
> *Proposed Solution:*
> Maintain a Cache of recently removed executors on Driver. During the 
> registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
> recently removed executor, return None indicating the registration is ignored 
> since the executor will be shutting down soon.
> On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
> executor, return true indicating the driver knows about it, thereby 
> preventing reregisteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.

2021-04-09 Thread Sumeet (Jira)
Sumeet created SPARK-35011:
--

 Summary: Avoid Block Manager registerations when StopExecutor msg 
is in-flight.
 Key: SPARK-35011
 URL: https://issues.apache.org/jira/browse/SPARK-35011
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.1, 3.2.0
Reporter: Sumeet


*Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
driver reports dead executors as alive.



*Problem:*

I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
so, when the executors were torn down due to 
"spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods 
being removed from K8s, however, under the "Executors" tab in SparkUI, I could 
see some executors listed as alive. 

[spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
 also returned a value greater than 1. 

 

*Cause:*

 
 *  "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
executorEndpoint
 * "CoarseGrainedSchedulerBackend" removes that executor from Driver's internal 
data structures and publishes "SparkListenerExecutorRemoved" on the 
"listenerBus".
 * Executor has still not processed "StopExecutor" from the Driver
 * Driver receives heartbeat from the Executor, since it cannot find the 
"executorId" in its data structures, it responds with 
"HeartbeatResponse(reregisterBlockManager = true)"
 * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" and 
"SparkListenerBlockManagerAdded" is published on the "listenerBus"
 * Executor starts processing the "StopExecutor" and exits
 * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
updates "AppStatusStore"
 * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of 
executors which returns the dead executor as alive.

 

*Proposed Solution:*

Maintain a Cache of recently removed executors on Driver. During the 
registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
recently removed executor, return None indicating the registration is ignored 
since the executor will be shutting down soon.

On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
executor, return true indicating the driver knows about it, thereby preventing 
reregisteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-04 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-34949:
---
Affects Version/s: 3.1.1

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.1
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Priority: Major
>  Labels: Executor, heartbeat
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-03 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-34949:
---
Priority: Major  (was: Minor)

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Priority: Major
>  Labels: Executor, heartbeat
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-02 Thread Sumeet (Jira)
Sumeet created SPARK-34949:
--

 Summary: Executor.reportHeartBeat reregisters blockManager even 
when Executor is shutting down
 Key: SPARK-34949
 URL: https://issues.apache.org/jira/browse/SPARK-34949
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
 Environment: Resource Manager: K8s
Reporter: Sumeet


*Problem:*

I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
so, when the executors were torn down due to 
"spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods 
being removed from K8s, however, under the "Executors" tab in SparkUI, I could 
see some executors listed as alive. 

[spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
 also returned a value greater than 1. 

 

*Cause:*
 * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
"executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
"listenerBus"
 * "CoarseGrainedExecutorBackend" starts the executor shutdown
 * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
removes the executor from "executorLastSeen"
 * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
cannot find the "executorId" in "executorLastSeen" and hence responds with 
"HeartbeatResponse(reregisterBlockManager = true)"
 * The Executor now calls "env.blockManager.reregister()" and reregisters 
itself thus creating inconsistency

 

*Proposed Solution:*

The "reportHeartBeat" method is not aware of the fact that Executor is shutting 
down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-02 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314194#comment-17314194
 ] 

Sumeet commented on SPARK-34949:


I am working on this.

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Priority: Minor
>  Labels: Executor, heartbeat
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate

2020-08-13 Thread Sumeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-32611:
---
Description: 
*How to reproduce this behavior?*
 * TZ="America/Los_Angeles" ./bin/spark-shell
 * sql("set spark.sql.hive.convertMetastoreOrc=true")
 * sql("set spark.sql.orc.impl=hive")
 * sql("create table t_spark(col timestamp) stored as orc;")
 * sql("insert into t_spark values (cast('2100-01-01 
01:33:33.123America/Los_Angeles' as timestamp));")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return empty results, which is incorrect.*
 * sql("set spark.sql.orc.impl=native")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*

 

The above query using (True, hive) returns *correct results if pushdown filters 
are turned off*. 
 * sql("set spark.sql.orc.filterPushdown=false")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*

  was:
*How to reproduce this behavior?*
 * TZ="America/Los_Angeles" ./bin/spark-shell --conf 
spark.sql.catalogImplementation=hive
 * sql("set spark.sql.hive.convertMetastoreOrc=true")
 * sql("set spark.sql.orc.impl=hive")
 * sql("create table t_spark(col timestamp) stored as orc;")
 * sql("insert into t_spark values (cast('2100-01-01 
01:33:33.123America/Los_Angeles' as timestamp));")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return empty results, which is incorrect.*
 * sql("set spark.sql.orc.impl=native")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*

 

The above query using (True, hive) returns *correct results if pushdown filters 
are turned off*. 
 * sql("set spark.sql.orc.filterPushdown=false")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*


> Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect 
> when timestamp is present in predicate
> 
>
> Key: SPARK-32611
> URL: https://issues.apache.org/jira/browse/SPARK-32611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sumeet
>Priority: Major
>
> *How to reproduce this behavior?*
>  * TZ="America/Los_Angeles" ./bin/spark-shell
>  * sql("set spark.sql.hive.convertMetastoreOrc=true")
>  * sql("set spark.sql.orc.impl=hive")
>  * sql("create table t_spark(col timestamp) stored as orc;")
>  * sql("insert into t_spark values (cast('2100-01-01 
> 01:33:33.123America/Los_Angeles' as timestamp));")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return empty results, which is incorrect.*
>  * sql("set spark.sql.orc.impl=native")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*
>  
> The above query using (True, hive) returns *correct results if pushdown 
> filters are turned off*. 
>  * sql("set spark.sql.orc.filterPushdown=false")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate

2020-08-13 Thread Sumeet (Jira)
Sumeet created SPARK-32611:
--

 Summary: Querying ORC table in Spark3 using 
spark.sql.orc.impl=hive produces incorrect when timestamp is present in 
predicate
 Key: SPARK-32611
 URL: https://issues.apache.org/jira/browse/SPARK-32611
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.0.1
Reporter: Sumeet


*How to reproduce this behavior?*
 * TZ="America/Los_Angeles" ./bin/spark-shell --conf 
spark.sql.catalogImplementation=hive
 * sql("set spark.sql.hive.convertMetastoreOrc=true")
 * sql("set spark.sql.orc.impl=hive")
 * sql("create table t_spark(col timestamp) stored as orc;")
 * sql("insert into t_spark values (cast('2100-01-01 
01:33:33.123America/Los_Angeles' as timestamp));")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return empty results, which is incorrect.*
 * sql("set spark.sql.orc.impl=native")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*

 

The above query using (True, hive) returns *correct results if pushdown filters 
are turned off*. 
 * sql("set spark.sql.orc.filterPushdown=false")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org