[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-30 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780929#comment-17780929
 ] 

Lantao Jin commented on SPARK-44124:


Thanks, I created three sub-tasks so far. [~Junyu Chen], could you start to 
file a PR for the first one?

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45721) Upgrade AWS SDK to v2 for Hadoop dependency

2023-10-30 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-45721:
--

 Summary: Upgrade AWS SDK to v2 for Hadoop dependency
 Key: SPARK-45721
 URL: https://issues.apache.org/jira/browse/SPARK-45721
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.5.0
Reporter: Lantao Jin


Hadoop is planning to ship the SDKv2 upgrade in 3.4.0, as shown in this 
[HADOOP-18073|https://issues.apache.org/jira/browse/HADOOP-18073]. One of the 
Hadoop modules that Spark relies on is *hadoop-aws*, which comes with S3A 
connector that allows Spark to access data in S3 buckets. Hadoop-aws contains 
dependency on AWS SDKv1, thus we should also update the Hadoop version to 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45719) Upgrade AWS SDK to v2 for Kubernetes integration tests

2023-10-30 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-45719:
--

 Summary: Upgrade AWS SDK to v2 for Kubernetes integration tests
 Key: SPARK-45719
 URL: https://issues.apache.org/jira/browse/SPARK-45719
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes, Spark Core
Affects Versions: 3.5.0
Reporter: Lantao Jin


Sub-task of [SPARK-44124|https://issues.apache.org/jira/browse/SPARK-44124]. In 
this issue, we will upgrade AWS SDK in Credentials providers, AWS clients and 
related Kubernetes integration tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-25 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779407#comment-17779407
 ] 

Lantao Jin commented on SPARK-44124:


Is it possible to convert this JIRA to an issue for adding more sub-tasks under 
this one? [~dongjoon]

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-25 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779405#comment-17779405
 ] 

Lantao Jin commented on SPARK-44124:


Added a design doc (more like a plan) in description.

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44124) Upgrade AWS SDK to v2

2023-10-25 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-44124:
---
Description: 
Here is a design doc:
https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Here is a design doc:
> https://docs.google.com/document/d/1nGWbGTqxuFBG2ftfYYXxzrkipINILfWCOwse36yg7Ig/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44124) Upgrade AWS SDK to v2

2023-06-29 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738865#comment-17738865
 ] 

Lantao Jin commented on SPARK-44124:


Hi [~dongjoon], this is Lantao from AWS, we are working on upgrading AWS SDK v2 
for Spark recently. I am going to take this issue. Will upload a document next 
week.

> Upgrade AWS SDK to v2
> -
>
> Key: SPARK-44124
> URL: https://issues.apache.org/jira/browse/SPARK-44124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34122) Remove duplicated branches in case when

2021-01-14 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-34122:
--

 Summary: Remove duplicated branches in case when
 Key: SPARK-34122
 URL: https://issues.apache.org/jira/browse/SPARK-34122
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Lantao Jin


CaseWhen with duplicated branches could be dedup
{code}
SELECT CASE WHEN key = 1 THEN 1 WHEN key = 1 THEN 1 WHEN key = 1 THEN 1
ELSE 2 END FROM testData WHERE key = 1 group by key
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34082) Window expression with alias inside WHERE and HAVING clauses fail with non-descriptive exceptions

2021-01-12 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin resolved SPARK-34082.

Resolution: Invalid

Close it due to {{cannot resolve 'b' given input columns}} seems a correct 
error message. Filter should be resolved before Projection. I was confused with 
QUALIFY syntax in our internal Spark version.

> Window expression with alias inside WHERE and HAVING clauses fail with 
> non-descriptive exceptions
> -
>
> Key: SPARK-34082
> URL: https://issues.apache.org/jira/browse/SPARK-34082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> SPARK-24575 prohibits window expressions inside WHERE and HAVING clauses. But 
> if the window expression with alias inside WHERE and HAVING clauses, Spark 
> does not handle this explicitly and will fail with non-descriptive exceptions.
> {code}
> SELECT a, RANK() OVER(ORDER BY b) AS s FROM testData2 WHERE b = 2 AND s = 1
> {code}
> {code}
> cannot resolve '`s`' given input columns: [testdata2.a, testdata2.b]
> {code}
> {code}
> SELECT a, MAX(b), RANK() OVER(ORDER BY a) AS s
> FROM testData2
> GROUP BY a
> HAVING SUM(b) = 5 AND s = 1
> {code}
> {code}
> cannot resolve '`b`' given input columns: [testdata2.a, max(b)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-34082) Window expression with alias inside WHERE and HAVING clauses fail with non-descriptive exceptions

2021-01-12 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin closed SPARK-34082.
--

> Window expression with alias inside WHERE and HAVING clauses fail with 
> non-descriptive exceptions
> -
>
> Key: SPARK-34082
> URL: https://issues.apache.org/jira/browse/SPARK-34082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> SPARK-24575 prohibits window expressions inside WHERE and HAVING clauses. But 
> if the window expression with alias inside WHERE and HAVING clauses, Spark 
> does not handle this explicitly and will fail with non-descriptive exceptions.
> {code}
> SELECT a, RANK() OVER(ORDER BY b) AS s FROM testData2 WHERE b = 2 AND s = 1
> {code}
> {code}
> cannot resolve '`s`' given input columns: [testdata2.a, testdata2.b]
> {code}
> {code}
> SELECT a, MAX(b), RANK() OVER(ORDER BY a) AS s
> FROM testData2
> GROUP BY a
> HAVING SUM(b) = 5 AND s = 1
> {code}
> {code}
> cannot resolve '`b`' given input columns: [testdata2.a, max(b)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34082) Window expression with alias inside WHERE and HAVING clauses fail with non-descriptive exceptions

2021-01-11 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-34082:
--

 Summary: Window expression with alias inside WHERE and HAVING 
clauses fail with non-descriptive exceptions
 Key: SPARK-34082
 URL: https://issues.apache.org/jira/browse/SPARK-34082
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2, 3.2.0, 3.1.1
Reporter: Lantao Jin


SPARK-24575 prohibits window expressions inside WHERE and HAVING clauses. But 
if the window expression with alias inside WHERE and HAVING clauses, Spark does 
not handle this explicitly and will fail with non-descriptive exceptions.
{code}
SELECT a, RANK() OVER(ORDER BY b) AS s FROM testData2 WHERE b = 2 AND s = 1
{code}
{code}
cannot resolve '`s`' given input columns: [testdata2.a, testdata2.b]
{code}
{code}
SELECT a, MAX(b), RANK() OVER(ORDER BY a) AS s
FROM testData2
GROUP BY a
HAVING SUM(b) = 5 AND s = 1
{code}
{code}
cannot resolve '`b`' given input columns: [testdata2.a, max(b)]
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-10 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-34064:
---
Description: 
SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the problem 
that a broadcast job is not aborted when broadcast timeout happens. Since the 
runId is a random UUID, when a SQL statement is cancelled, these broadcast 
sub-jobs still not canceled as a whole.

 !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 


  was:
SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the problem 
that a broadcast job is not aborted when broadcast timeout happens. Since the 
runId is a random UUID, when a SQL statement is cancelled, these broadcast 
sub-jobs still not canceled as a whole.

 !Screen Shot 2021-01-11 at 12.03.13 PM.png | width=100%! 



> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-10 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-34064:
---
Description: 
SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the problem 
that a broadcast job is not aborted when broadcast timeout happens. Since the 
runId is a random UUID, when a SQL statement is cancelled, these broadcast 
sub-jobs still not canceled as a whole.

 !Screen Shot 2021-01-11 at 12.03.13 PM.png | width=100%! 


  was:
SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the problem 
that a broadcast job is not aborted when broadcast timeout happens. Since the 
runId is a random UUID, when a SQL statement is cancelled, these broadcast 
sub-jobs still not canceled as a whole.



> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png | width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-10 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-34064:
---
Attachment: Screen Shot 2021-01-11 at 12.03.13 PM.png

> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-10 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-34064:
--

 Summary: Broadcast job is not aborted even the SQL statement 
canceled
 Key: SPARK-34064
 URL: https://issues.apache.org/jira/browse/SPARK-34064
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.2.0, 3.1.1
Reporter: Lantao Jin
 Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png

SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the problem 
that a broadcast job is not aborted when broadcast timeout happens. Since the 
runId is a random UUID, when a SQL statement is cancelled, these broadcast 
sub-jobs still not canceled as a whole.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34000) ExecutorAllocationListener threw an exception java.util.NoSuchElementException

2021-01-04 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-34000:
---
Affects Version/s: 3.0.1

> ExecutorAllocationListener threw an exception java.util.NoSuchElementException
> --
>
> Key: SPARK-34000
> URL: https://issues.apache.org/jira/browse/SPARK-34000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0, 3.2.0
>Reporter: Lantao Jin
>Priority: Major
>
> 21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 
> : Lost task 306.1 in stage 600.0 (TID 283610, 
> hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): 
> TaskKilled (another attempt succeeded)
> 21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 
> : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be 
> re-executed (either because the task failed with a shuffle data fetch 
> failure, so the
> previous stage needs to be re-run, or because a different copy of the task 
> has already succeeded).
> 21/01/04 03:00:32,259 INFO [task-result-getter-2] 
> cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all 
> completed, from pool default
> 21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] 
> thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 
> 50 rows from offsets [5378600, 5378650) with 
> 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47
> 21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] 
> scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an 
> exception
> java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0)
> at scala.collection.MapLike.default(MapLike.scala:235)
> at scala.collection.MapLike.default$(MapLike.scala:234)
> at scala.collection.AbstractMap.default(Map.scala:63)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
> at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621)
> at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
> at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
> at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
> at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34000) ExecutorAllocationListener threw an exception java.util.NoSuchElementException

2021-01-04 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-34000:
---
Affects Version/s: (was: 3.0.1)

> ExecutorAllocationListener threw an exception java.util.NoSuchElementException
> --
>
> Key: SPARK-34000
> URL: https://issues.apache.org/jira/browse/SPARK-34000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Lantao Jin
>Priority: Major
>
> 21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 
> : Lost task 306.1 in stage 600.0 (TID 283610, 
> hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): 
> TaskKilled (another attempt succeeded)
> 21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 
> : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be 
> re-executed (either because the task failed with a shuffle data fetch 
> failure, so the
> previous stage needs to be re-run, or because a different copy of the task 
> has already succeeded).
> 21/01/04 03:00:32,259 INFO [task-result-getter-2] 
> cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all 
> completed, from pool default
> 21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] 
> thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 
> 50 rows from offsets [5378600, 5378650) with 
> 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47
> 21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] 
> scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an 
> exception
> java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0)
> at scala.collection.MapLike.default(MapLike.scala:235)
> at scala.collection.MapLike.default$(MapLike.scala:234)
> at scala.collection.AbstractMap.default(Map.scala:63)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
> at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621)
> at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
> at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
> at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
> at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34000) ExecutorAllocationListener threw an exception java.util.NoSuchElementException

2021-01-04 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-34000:
--

 Summary: ExecutorAllocationListener threw an exception 
java.util.NoSuchElementException
 Key: SPARK-34000
 URL: https://issues.apache.org/jira/browse/SPARK-34000
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1, 3.1.0, 3.2.0
Reporter: Lantao Jin


21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 : 
Lost task 306.1 in stage 600.0 (TID 283610, 
hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): 
TaskKilled (another attempt succeeded)
21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 : 
Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be 
re-executed (either because the task failed with a shuffle data fetch failure, 
so the
previous stage needs to be re-run, or because a different copy of the task has 
already succeeded).
21/01/04 03:00:32,259 INFO [task-result-getter-2] 
cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all 
completed, from pool default
21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] 
thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 50 
rows from offsets [5378600, 5378650) with 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47
21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] 
scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an 
exception
java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0)
at scala.collection.MapLike.default(MapLike.scala:235)
at scala.collection.MapLike.default$(MapLike.scala:234)
at scala.collection.AbstractMap.default(Map.scala:63)
at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
at 
org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621)
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38)
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
at 
org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116)
at 
org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33014) Multiple bucket column not works in DataSourceV2 table

2020-09-28 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-33014:
--

 Summary: Multiple bucket column not works in DataSourceV2 table
 Key: SPARK-33014
 URL: https://issues.apache.org/jira/browse/SPARK-33014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Lantao Jin


Multiple bucket columns are not supported in current DSV2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) Heavy external accumulators may lead driver full GC problem

2020-09-25 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Summary: Heavy external accumulators may lead driver full GC problem  (was: 
External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
lead driver full GC problem)

> Heavy external accumulators may lead driver full GC problem
> ---
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-24 at 5.19.26 PM.png, Screen Shot 
> 2020-09-24 at 5.19.58 PM.png, Screen Shot 2020-09-25 at 11.32.51 AM.png, 
> Screen Shot 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 
> AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
>  !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
>  !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 
> From above heap dump, Delta uses a SetAccumulator to records touched files 
> names
> {code}
> // Accumulator to collect all the distinct touched files
> val touchedFilesAccum = new SetAccumulator[String]()
> spark.sparkContext.register(touchedFilesAccum, TOUCHED_FILES_ACCUM_NAME)
> // UDFs to records touched files names and add them to the accumulator
> val recordTouchedFileName = udf { (fileName: String) => {
>   touchedFilesAccum.add(fileName)
>   1
> }}.asNondeterministic()
> {code}
> In a big query, each task may hold thousands of file names, and if a stage 
> contains dozens of thousands of tasks, DAGscheduler may hold millions of 
> `CompletionEvent`. And each `CompletionEvent` holds the thousands of file 
> names in its `accumUpdates`. All accumulator objects will use Spark listener 
> event to deliver to the event loop and even a full GC can not release memory.
> A PR will be submitted. With the patch, the memory problem was gone.
> Before the patch: A full GC doesn't help.
>  !Screen Shot 2020-09-24 at 5.19.58 PM.png|width=70%! 
> After the patch: No full GC and memory is not ramp up.
>  !Screen Shot 2020-09-24 at 5.19.26 PM.png|width=70%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Description: 
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
 !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 

>From above heap dump, Delta uses a SetAccumulator to records touched files 
>names
{code}
// Accumulator to collect all the distinct touched files
val touchedFilesAccum = new SetAccumulator[String]()
spark.sparkContext.register(touchedFilesAccum, TOUCHED_FILES_ACCUM_NAME)

// UDFs to records touched files names and add them to the accumulator
val recordTouchedFileName = udf { (fileName: String) => {
  touchedFilesAccum.add(fileName)
  1
}}.asNondeterministic()
{code}

In a big query, each task may hold thousands of file names, and if a stage 
contains dozens of thousands of tasks, DAGscheduler may hold millions of 
`CompletionEvent`. And each `CompletionEvent` holds the thousands of file names 
in its `accumUpdates`. All accumulator objects will use Spark listener event to 
deliver to the event loop and even a full GC can not release memory.

A PR will be submitted. With the patch, the memory problem was gone.
Before the patch: A full GC doesn't help.
 !Screen Shot 2020-09-24 at 5.19.58 PM.png|width=70%! 
After the patch: No full GC and memory is not ramp up.
 !Screen Shot 2020-09-24 at 5.19.26 PM.png|width=70%! 


  was:
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
 !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 

>From above heap dump, Delta uses a SetAccumulator to records touched files 
>names
{code}
// Accumulator to collect all the distinct touched files
val touchedFilesAccum = new SetAccumulator[String]()
spark.sparkContext.register(touchedFilesAccum, TOUCHED_FILES_ACCUM_NAME)

// UDFs to records touched files names and add them to the accumulator
val recordTouchedFileName = udf { (fileName: String) => {
  touchedFilesAccum.add(fileName)
  1
}}.asNondeterministic()
{code}

In a big query, each task may hold thousands of file names, and if a stage 
contains dozens of thousands of tasks, DAGscheduler may hold millions of 
`CompletionEvent`. And each `CompletionEvent` holds the thousands of file names 
in its `accumUpdates`. All accumulator objects will use Spark listener event to 
deliver to the event loop and even a full GC can not release memory.

A PR will be submitted. With the patch, the memory problem was gone.
Before the patch:
 !Screen Shot 2020-09-24 at 5.19.58 PM.png|width=70%! 
After the patch:
 !Screen Shot 2020-09-24 at 5.19.26 PM.png|width=70%! 



> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-24 at 5.19.26 PM.png, Screen Shot 
> 2020-09-24 at 5.19.58 PM.png, Screen Shot 2020-09-25 at 11.32.51 AM.png, 
> Screen Shot 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 
> AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
>  !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
>  !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 
> From above heap dump, Delta uses a SetAccumulator to records touched files 
> names
> {code}
> // Accumulator to collect all the distinct touched files
> val touchedFilesAccum = new SetAccumulator[String]()
> spark.sparkContext.register(touchedFilesAccum, TOUCHED_FILES_ACCUM_NAME)
> // UDFs to records touched files 

[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Description: 
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
 !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 

>From above heap dump, Delta uses a SetAccumulator to records touched files 
>names
{code}
// Accumulator to collect all the distinct touched files
val touchedFilesAccum = new SetAccumulator[String]()
spark.sparkContext.register(touchedFilesAccum, TOUCHED_FILES_ACCUM_NAME)

// UDFs to records touched files names and add them to the accumulator
val recordTouchedFileName = udf { (fileName: String) => {
  touchedFilesAccum.add(fileName)
  1
}}.asNondeterministic()
{code}

In a big query, each task may hold thousands of file names, and if a stage 
contains dozens of thousands of tasks, DAGscheduler may hold millions of 
`CompletionEvent`. And each `CompletionEvent` holds the thousands of file names 
in its `accumUpdates`. All accumulator objects will use Spark listener event to 
deliver to the event loop and even a full GC can not release memory.

A PR will be submitted. With the patch, the memory problem was gone.
Before the patch:
 !Screen Shot 2020-09-24 at 5.19.58 PM.png|width=70%! 
After the patch:
 !Screen Shot 2020-09-24 at 5.19.26 PM.png|width=70%! 


  was:
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
 !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 



> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-24 at 5.19.26 PM.png, Screen Shot 
> 2020-09-24 at 5.19.58 PM.png, Screen Shot 2020-09-25 at 11.32.51 AM.png, 
> Screen Shot 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 
> AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
>  !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
>  !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 
> From above heap dump, Delta uses a SetAccumulator to records touched files 
> names
> {code}
> // Accumulator to collect all the distinct touched files
> val touchedFilesAccum = new SetAccumulator[String]()
> spark.sparkContext.register(touchedFilesAccum, TOUCHED_FILES_ACCUM_NAME)
> // UDFs to records touched files names and add them to the accumulator
> val recordTouchedFileName = udf { (fileName: String) => {
>   touchedFilesAccum.add(fileName)
>   1
> }}.asNondeterministic()
> {code}
> In a big query, each task may hold thousands of file names, and if a stage 
> contains dozens of thousands of tasks, DAGscheduler may hold millions of 
> `CompletionEvent`. And each `CompletionEvent` holds the thousands of file 
> names in its `accumUpdates`. All accumulator objects will use Spark listener 
> event to deliver to the event loop and even a full GC can not release memory.
> A PR will be submitted. With the patch, the memory problem was gone.
> Before the patch:
>  !Screen Shot 2020-09-24 at 5.19.58 PM.png|width=70%! 
> After the patch:
>  !Screen Shot 2020-09-24 at 5.19.26 PM.png|width=70%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Attachment: Screen Shot 2020-09-24 at 5.19.26 PM.png

> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-24 at 5.19.26 PM.png, Screen Shot 
> 2020-09-24 at 5.19.58 PM.png, Screen Shot 2020-09-25 at 11.32.51 AM.png, 
> Screen Shot 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 
> AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
>  !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
>  !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Attachment: Screen Shot 2020-09-24 at 5.19.58 PM.png

> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-24 at 5.19.58 PM.png, Screen Shot 
> 2020-09-25 at 11.32.51 AM.png, Screen Shot 2020-09-25 at 11.35.01 AM.png, 
> Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
>  !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
>  !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Description: 
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=70%! 


  was:
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=100%! 



> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-25 at 11.32.51 AM.png, Screen Shot 
> 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
>  !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=70%! 
>  !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=70%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Description: 
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
 !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 


  was:
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=70%! 
 !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=70%! 



> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-25 at 11.32.51 AM.png, Screen Shot 
> 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=70%! 
>  !Screen Shot 2020-09-25 at 11.35.01 AM.png|width=100%! 
>  !Screen Shot 2020-09-25 at 11.36.48 AM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Description: 
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=100%! 


  was:
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png | width=100%! 



> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-25 at 11.32.51 AM.png, Screen Shot 
> 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Description: 
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png! 


  was:
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.




> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-25 at 11.32.51 AM.png, Screen Shot 
> 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Attachment: Screen Shot 2020-09-25 at 11.36.48 AM.png

> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-25 at 11.32.51 AM.png, Screen Shot 
> 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Attachment: Screen Shot 2020-09-25 at 11.32.51 AM.png

> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-25 at 11.32.51 AM.png, Screen Shot 
> 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Attachment: Screen Shot 2020-09-25 at 11.35.01 AM.png

> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-25 at 11.32.51 AM.png, Screen Shot 
> 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32994:
---
Description: 
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png | width=100%! 


  was:
We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.
 !Screen Shot 2020-09-25 at 11.32.51 AM.png! 



> External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may 
> lead driver full GC problem
> -
>
> Key: SPARK-32994
> URL: https://issues.apache.org/jira/browse/SPARK-32994
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2020-09-25 at 11.32.51 AM.png, Screen Shot 
> 2020-09-25 at 11.35.01 AM.png, Screen Shot 2020-09-25 at 11.36.48 AM.png
>
>
> We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
> problem (very heavy) when users submit a MERGE INTO query. The driver held 
> over 100GB memory (depends on how much the max heap size set) and can not be 
> GC forever. By making a heap dump we found the root cause.
>  !Screen Shot 2020-09-25 at 11.32.51 AM.png | width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32994) External accumulators (not start with InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem

2020-09-24 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32994:
--

 Summary: External accumulators (not start with 
InternalAccumulator.METRICS_PREFIX) may lead driver full GC problem
 Key: SPARK-32994
 URL: https://issues.apache.org/jira/browse/SPARK-32994
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.0.1, 2.4.7, 3.1.0
Reporter: Lantao Jin


We use Spark + Delta Lake, recently we find our Spark driver faced full GC 
problem (very heavy) when users submit a MERGE INTO query. The driver held over 
100GB memory (depends on how much the max heap size set) and can not be GC 
forever. By making a heap dump we found the root cause.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32715) Broadcast block pieces may memory leak

2020-08-31 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32715:
---
Affects Version/s: 2.4.6
   3.0.0

> Broadcast block pieces may memory leak
> --
>
> Key: SPARK-32715
> URL: https://issues.apache.org/jira/browse/SPARK-32715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> We use Spark thrift-server as a long-running service. A bad query submitted a 
> heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the 
> bad query but we found the driver's memory usage was still high and full GCs 
> had very frequency. By investigating with GC dump and log, we found the 
> broadcast may memory leak.
> 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 
> 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
> 116G->112G(170G), 184.9121920 secs]
> [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 
> 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]
> num #instances #bytes class name
> --
> 1: 676531691 72035438432 [B
> 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
> 3: 99551 12018117568 [Ljava.lang.Object;
> 4: 26570 4349629040 [I
> 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
> 6: 1708819 256299456 [C
> 7: 2338 179615208 [J
> 8: 1703669 54517408 java.lang.String
> 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
> 10: 177396 25545024 java.net.URI
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32715) Broadcast block pieces may memory leak

2020-08-27 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32715:
---
Description: 
We use Spark thrift-server as a long-running service. A bad query submitted a 
heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the 
bad query but we found the driver's memory usage was still high and full GCs 
had very frequency. By investigating with GC dump and log, we found the 
broadcast may memory leak.

2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 
2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
116G->112G(170G), 184.9121920 secs]
[Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 
116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]

num #instances #bytes class name
--
1: 676531691 72035438432 [B
2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
3: 99551 12018117568 [Ljava.lang.Object;
4: 26570 4349629040 [I
5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
6: 1708819 256299456 [C
7: 2338 179615208 [J
8: 1703669 54517408 java.lang.String
9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
10: 177396 25545024 java.net.URI
...

  was:
We use Spark thrift-server as a long-running service. A bad query submitted a 
heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the 
bad query but we found the driver memory usage still high and full GC was very 
frequency. By investigating with GC dump and log, we found the broadcast may 
memory leak.

2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 
2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
116G->112G(170G), 184.9121920 secs]
[Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 
116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]

num #instances #bytes class name
--
1: 676531691 72035438432 [B
2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
3: 99551 12018117568 [Ljava.lang.Object;
4: 26570 4349629040 [I
5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
6: 1708819 256299456 [C
7: 2338 179615208 [J
8: 1703669 54517408 java.lang.String
9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
10: 177396 25545024 java.net.URI
...


> Broadcast block pieces may memory leak
> --
>
> Key: SPARK-32715
> URL: https://issues.apache.org/jira/browse/SPARK-32715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> We use Spark thrift-server as a long-running service. A bad query submitted a 
> heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the 
> bad query but we found the driver's memory usage was still high and full GCs 
> had very frequency. By investigating with GC dump and log, we found the 
> broadcast may memory leak.
> 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 
> 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
> 116G->112G(170G), 184.9121920 secs]
> [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 
> 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]
> num #instances #bytes class name
> --
> 1: 676531691 72035438432 [B
> 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
> 3: 99551 12018117568 [Ljava.lang.Object;
> 4: 26570 4349629040 [I
> 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
> 6: 1708819 256299456 [C
> 7: 2338 179615208 [J
> 8: 1703669 54517408 java.lang.String
> 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
> 10: 177396 25545024 java.net.URI
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32715) Broadcast block pieces may memory leak

2020-08-27 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32715:
--

 Summary: Broadcast block pieces may memory leak
 Key: SPARK-32715
 URL: https://issues.apache.org/jira/browse/SPARK-32715
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Lantao Jin


We use Spark thrift-server as a long-running service. A bad query submitted a 
heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the 
bad query but we found the driver memory usage still high and full GC was very 
frequency. By investigating with GC dump and log, we found the broadcast may 
memory leak.

2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 
2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
116G->112G(170G), 184.9121920 secs]
[Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 
116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]

num #instances #bytes class name
--
1: 676531691 72035438432 [B
2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
3: 99551 12018117568 [Ljava.lang.Object;
4: 26570 4349629040 [I
5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
6: 1708819 256299456 [C
7: 2338 179615208 [J
8: 1703669 54517408 java.lang.String
9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
10: 177396 25545024 java.net.URI
...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181553#comment-17181553
 ] 

Lantao Jin commented on SPARK-32672:


Changed to Critical, Blocker is reserved for committer

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32672:
---
Priority: Critical  (was: Blocker)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32638) WidenSetOperationTypes in subquery attribute missing

2020-08-19 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180365#comment-17180365
 ] 

Lantao Jin commented on SPARK-32638:


Yes. This problem exists in 3.0 and master.

The problem occurs follow these steps:
In ResolveReference.apply()
When resolving the Project, its children are all resolved
'Project ['t.kpi_04]
+- SubqueryAlias t
   +- Union
  :- Project [a#44 AS kpi_04#46]
  :  +- SubqueryAlias test
  : +- LocalRelation , [a#44]
  +- Project [(a#44 + a#44) AS kpi_04#47]
 +- SubqueryAlias test
+- LocalRelation , [a#44]

-> 
Project [kpi_04#46]
+- SubqueryAlias t
   +- Union
  :- Project [a#44 AS kpi_04#46]
  :  +- SubqueryAlias test
  : +- LocalRelation , [a#44]
  +- Project [(a#44 + a#44) AS kpi_04#47]
 +- SubqueryAlias test
+- LocalRelation , [a#44]

After Project resolved. It child Union changes the children by 
WidenSetOperationTypes.

In the next iteration, Project won't be resolved again.
!Project [kpi_04#46]
+- SubqueryAlias t
   +- Union
  :- Project [cast(kpi_04#46 as decimal(22,1)) AS kpi_04#48]
  :  +- Project [a#44 AS kpi_04#46]
  : +- SubqueryAlias test
  :+- LocalRelation , [a#44]
  +- Project [kpi_04#47]
 +- Project [CheckOverflow((promote_precision(cast(a#44 as 
decimal(22,1))) + promote_precision(cast(a#44 as decimal(22,1, 
DecimalType(22,1), true) AS kpi_04#47]
+- SubqueryAlias test
   +- LocalRelation , [a#44]

> WidenSetOperationTypes in subquery  attribute  missing
> --
>
> Key: SPARK-32638
> URL: https://issues.apache.org/jira/browse/SPARK-32638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 3.0.0
>Reporter: Guojian Li
>Priority: Major
>
> I am migrating sql from mysql to spark sql, meet a very strange case. Below 
> is code to reproduce the exception:
>  
> {code:java}
> val spark = SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .getOrCreate()
> spark.sparkContext.setLogLevel("TRACE")
> val DecimalType = DataTypes.createDecimalType(20, 2)
> val schema = StructType(List(
>  StructField("a", DecimalType, true)
> ))
> val dataList = new util.ArrayList[Row]()
> val df=spark.createDataFrame(dataList,schema)
> df.printSchema()
> df.createTempView("test")
> val sql=
>  """
>  |SELECT t.kpi_04 FROM
>  |(
>  | SELECT a as `kpi_04` FROM test
>  | UNION ALL
>  | SELECT a+a as `kpi_04` FROM test
>  |) t
>  |
>  """.stripMargin
> spark.sql(sql)
> {code}
>  
> Exception Message:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved 
> attribute(s) kpi_04#2 missing from kpi_04#4 in operator !Project [kpi_04#2]. 
> Attribute(s) with the same name appear in the operation: kpi_04. Please check 
> if the right attribute(s) are used.;;
> !Project [kpi_04#2]
> +- SubqueryAlias t
>  +- Union
>  :- Project [cast(kpi_04#2 as decimal(21,2)) AS kpi_04#4]
>  : +- Project [a#0 AS kpi_04#2]
>  : +- SubqueryAlias test
>  : +- LocalRelation , [a#0]
>  +- Project [kpi_04#3]
>  +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
>  +- SubqueryAlias test
>  +- LocalRelation , [a#0]{code}
>  
>  
> Base the trace log ,seemly the WidenSetOperationTypes add new outer project 
> layer. It caused the parent query lose the reference to subquery. 
>  
>  
> {code:java}
>  
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes ===
> !'Project [kpi_04#2] !Project [kpi_04#2]
> !+- 'SubqueryAlias t +- SubqueryAlias t
> ! +- 'Union +- Union
> ! :- Project [a#0 AS kpi_04#2] :- Project [cast(kpi_04#2 as decimal(21,2)) AS 
> kpi_04#4]
> ! : +- SubqueryAlias test : +- Project [a#0 AS kpi_04#2]
> ! : +- LocalRelation , [a#0] : +- SubqueryAlias test
> ! +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3] : +- LocalRelation , [a#0]
> ! +- SubqueryAlias test +- Project [kpi_04#3]
> ! +- LocalRelation , [a#0] +- Project 
> [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
> ! +- SubqueryAlias test
> ! +- LocalRelation , [a#0]
> {code}
>  
>   in the source code ,WidenSetOperationTypes.scala. it is  a intent behavior, 
> but  possibly  miss this edge case. 
> I hope someone can help me out to fix it . 
>  
>  
> {code:java}
> if (targetTypes.nonEmpty) {
>  // Add an extra Project if the targetTypes are different from the original 
> types.
>  children.map(widenTypes(_, targetTypes))
> } 

[jira] [Updated] (SPARK-32638) WidenSetOperationTypes in subquery attribute missing

2020-08-19 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32638:
---
Affects Version/s: 3.0.0

> WidenSetOperationTypes in subquery  attribute  missing
> --
>
> Key: SPARK-32638
> URL: https://issues.apache.org/jira/browse/SPARK-32638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 3.0.0
>Reporter: Guojian Li
>Priority: Major
>
> I am migrating sql from mysql to spark sql, meet a very strange case. Below 
> is code to reproduce the exception:
>  
> {code:java}
> val spark = SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .getOrCreate()
> spark.sparkContext.setLogLevel("TRACE")
> val DecimalType = DataTypes.createDecimalType(20, 2)
> val schema = StructType(List(
>  StructField("a", DecimalType, true)
> ))
> val dataList = new util.ArrayList[Row]()
> val df=spark.createDataFrame(dataList,schema)
> df.printSchema()
> df.createTempView("test")
> val sql=
>  """
>  |SELECT t.kpi_04 FROM
>  |(
>  | SELECT a as `kpi_04` FROM test
>  | UNION ALL
>  | SELECT a+a as `kpi_04` FROM test
>  |) t
>  |
>  """.stripMargin
> spark.sql(sql)
> {code}
>  
> Exception Message:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved 
> attribute(s) kpi_04#2 missing from kpi_04#4 in operator !Project [kpi_04#2]. 
> Attribute(s) with the same name appear in the operation: kpi_04. Please check 
> if the right attribute(s) are used.;;
> !Project [kpi_04#2]
> +- SubqueryAlias t
>  +- Union
>  :- Project [cast(kpi_04#2 as decimal(21,2)) AS kpi_04#4]
>  : +- Project [a#0 AS kpi_04#2]
>  : +- SubqueryAlias test
>  : +- LocalRelation , [a#0]
>  +- Project [kpi_04#3]
>  +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
>  +- SubqueryAlias test
>  +- LocalRelation , [a#0]{code}
>  
>  
> Base the trace log ,seemly the WidenSetOperationTypes add new outer project 
> layer. It caused the parent query lose the reference to subquery. 
>  
>  
> {code:java}
>  
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes ===
> !'Project [kpi_04#2] !Project [kpi_04#2]
> !+- 'SubqueryAlias t +- SubqueryAlias t
> ! +- 'Union +- Union
> ! :- Project [a#0 AS kpi_04#2] :- Project [cast(kpi_04#2 as decimal(21,2)) AS 
> kpi_04#4]
> ! : +- SubqueryAlias test : +- Project [a#0 AS kpi_04#2]
> ! : +- LocalRelation , [a#0] : +- SubqueryAlias test
> ! +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3] : +- LocalRelation , [a#0]
> ! +- SubqueryAlias test +- Project [kpi_04#3]
> ! +- LocalRelation , [a#0] +- Project 
> [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
> ! +- SubqueryAlias test
> ! +- LocalRelation , [a#0]
> {code}
>  
>   in the source code ,WidenSetOperationTypes.scala. it is  a intent behavior, 
> but  possibly  miss this edge case. 
> I hope someone can help me out to fix it . 
>  
>  
> {code:java}
> if (targetTypes.nonEmpty) {
>  // Add an extra Project if the targetTypes are different from the original 
> types.
>  children.map(widenTypes(_, targetTypes))
> } else {
>  // Unable to find a target type to widen, then just return the original set.
>  children
> }{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32598) Not able to see driver logs in spark history server in standalone mode

2020-08-13 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177429#comment-17177429
 ] 

Lantao Jin commented on SPARK-32598:


[~sriramgr] PullRequest is welcome. Please commit on the master branch. Thanks.

> Not able to see driver logs in spark history server in standalone mode
> --
>
> Key: SPARK-32598
> URL: https://issues.apache.org/jira/browse/SPARK-32598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: image-2020-08-12-11-50-01-899.png
>
>
> Driver logs are not coming in history server in spark standalone mode. 
> Checked in the spark events logs it is not there. Is this by design or can I 
> fix it by creating a patch?. Not able to see any proper documentation 
> regarding this.
>  
> !image-2020-08-12-11-50-01-899.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32598) See driver logs in Spark history server in standalone mode

2020-08-12 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176186#comment-17176186
 ] 

Lantao Jin commented on SPARK-32598:


Does this problem exist in Spark3.0? I think branch-2.3 is not maintained.

> See driver logs in Spark history server in standalone mode
> --
>
> Key: SPARK-32598
> URL: https://issues.apache.org/jira/browse/SPARK-32598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: image-2020-08-12-11-50-01-899.png
>
>
> Driver logs are not coming in history server in spark standalone mode. 
> Checked in the spark events logs it is not there. Is this by design or can I 
> fix it by creating a patch?. Not able to see any proper documentation 
> regarding this.
>  
> !image-2020-08-12-11-50-01-899.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-11 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175981#comment-17175981
 ] 

Lantao Jin edited comment on SPARK-32582 at 8/12/20, 4:31 AM:
--

{quote}
 I remember I investigated this issue and Hadoop API itself lists in batch. 
There streaming way of listing isn't possible.
{quote}

Yes, we could list status in one partition if it is a partitioned table. For a 
non-partitioned table, it still lists all files. We assume too many files in a 
non-partitioned table is a bad design in data warehouse.

{quote}
We can add one more mode "INFER_WITH_SAMPLE".
{quote}

I am not sure it would be helpful since there is no API in Hadoop to list 
partial files in a folder.
But if you mean listing all files then pick some of them, yes it might be 
helpful for merging schema. But I thought the performance bottleneck is listing 
files in a mass partition table.


was (Author: cltlfcjin):
{quote}
 I remember I investigated this issue and Hadoop API itself lists in batch. 
There streaming way of listing isn't possible.
{quote}

Yes, we could list status in one partition if it is a partitioned table. For a 
non-partitioned table, it still lists all files. We assume too many files in a 
non-partitioned table is a bad design in data warehouse.

{quote}
We can add one more mode "INFER_WITH_SAMPLE".
{quote}

I am not sure it would be helpful since there is no API in Hadoop to list 
partial files in a folder.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-11 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175981#comment-17175981
 ] 

Lantao Jin commented on SPARK-32582:


{quote}
 I remember I investigated this issue and Hadoop API itself lists in batch. 
There streaming way of listing isn't possible.
{quote}

Yes, we could list status in one partition if it is a partitioned table. For a 
non-partitioned table, it still lists all files. We assume too many files in a 
non-partitioned table is a bad design in data warehouse.

{quote}
We can add one more mode "INFER_WITH_SAMPLE".
{quote}

I am not sure it would be helpful since there is no API in Hadoop to list 
partial files in a folder.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175203#comment-17175203
 ] 

Lantao Jin commented on SPARK-32582:


Maybe we could offer a new interface to break out in one iteration when 
mergeSchema is false. I am not sure.
{code}
  def inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  f: (FileIndex) => Seq[FileStatus]): Option[StructType]
{code}

Do you already have any fixing? PR is welcome.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175197#comment-17175197
 ] 

Lantao Jin commented on SPARK-32582:


I see. The implementation of {{inferSchema}} method depends on the underlay 
file format. Even for Orc, we still need all files since the given Orc files 
can have different schemas and we want to get a merged schema.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175143#comment-17175143
 ] 

Lantao Jin commented on SPARK-32582:


{code}
files.toIterator.map(file => readSchema(file.getPath, conf, 
ignoreCorruptFiles)).collectFirst
{code}
{{collectFirst()}} will break out when its iterator found a matching value.
{code}
  def collectFirst[B](pf: PartialFunction[A, B]): Option[B] = {
for (x <- self.toIterator) { // make sure to use an iterator or `seq`
  if (pf isDefinedAt x)
return Some(pf(x))
}
None
  }
{code}
So in most cases, it just reads only one file.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-09 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174060#comment-17174060
 ] 

Lantao Jin commented on SPARK-32536:


{quote}
I found that I mistake the issue condition, I reproduce it by delete  the hdfs 
partition directory of target table, but not drop the partition, then I get the 
wrong.
{quote}
Are you reproduce in official Spark-3.0?
Looks like there is no `Hive.cleanUpOneDirectoryForReplace()` method in Hive 
2.3.7 which used in Spark3.0 either.

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.
> and Hive can successfully execute the same sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-05 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171393#comment-17171393
 ] 

Lantao Jin commented on SPARK-32536:


{{org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace}} seems added 
from Hive-4.0. Could you reproduce it in official Spark-3.0 (download from 
https://www.apache.org/dyn/closer.lua/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz)?

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.
> and Hive can successfully execute the same sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-05 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171385#comment-17171385
 ] 

Lantao Jin commented on SPARK-32536:


Thanks for reporting this. Which Hive version do you use? I can't find the 
method {{org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace}} in 
your stack whatever from Hive-exec-1.2 or Hive-exec-2.3.7.

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.
> and Hive can successfully execute the same sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32537) Add a hint-specific suite for CTE for test coverage

2020-08-05 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32537:
---
Priority: Major  (was: Minor)

> Add a hint-specific suite for CTE for test coverage
> ---
>
> Key: SPARK-32537
> URL: https://issues.apache.org/jira/browse/SPARK-32537
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> This ticket is to address the below comments to help us understand the test 
> coverage of SQL HINT for CTE.
> https://github.com/apache/spark/pull/29062#discussion_r463247491
> https://github.com/apache/spark/pull/29062#discussion_r463248167



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32537) Add a hint-specific suite for CTE for test coverage

2020-08-05 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32537:
--

 Summary: Add a hint-specific suite for CTE for test coverage
 Key: SPARK-32537
 URL: https://issues.apache.org/jira/browse/SPARK-32537
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.1.0
Reporter: Lantao Jin


This ticket is to address the below comments to help us understand the test 
coverage of SQL HINT for CTE.
https://github.com/apache/spark/pull/29062#discussion_r463247491
https://github.com/apache/spark/pull/29062#discussion_r463248167



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32535) Query with broadcast hints fail when query has a WITH clause

2020-08-05 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin resolved SPARK-32535.

Resolution: Duplicate

> Query with broadcast hints fail when query has a WITH clause
> 
>
> Key: SPARK-32535
> URL: https://issues.apache.org/jira/browse/SPARK-32535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Arvind Krishnan
>Priority: Major
>
> If a query has a WITH clause and a query hint (like `BROADCAST`), the query 
> fails
> In the below code sample, executing `sql2` fails, but `sql1` passes.
> {code:java}
> import spark.implicits._
> val df = List(
>   ("1", "B", "C"),
>   ("A", "2", "C"),
>   ("A", "B", "3")
> ).toDF("COL_A", "COL_B", "COL_C")
> df.createOrReplaceTempView("table1")
> val df1 = List(
>   ("A", "2", "3"),
>   ("1", "B", "3"),
>   ("1", "2", "C")
> ).toDF("COL_A", "COL_B", "COL_C")
> df1.createOrReplaceTempView("table2")
> val sql1 = "select /*+ BROADCAST(a) */ a.COL_A from table1 a inner join 
> table2 b on a.COL_A = b.COL_A"
> val sql2 = "with X as (select /*+ BROADCAST(a) */ a.COL_A from table1 a inner 
> join table2 b on a.COL_A = b.COL_A) select X.COL_A from X"
> val df2 = spark.sql(sql2)
> println(s"Row Count ${df2.count()}")
> println("Rows... ")
> df2.show(false)
> {code}
>  
> I tried executing this sample program with spark2.4.0, and both sql 
> statements work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32535) Query with broadcast hints fail when query has a WITH clause

2020-08-05 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171300#comment-17171300
 ] 

Lantao Jin commented on SPARK-32535:


I think this issue has fixed by SPARK-32237. I cannot reproduce it in master.

> Query with broadcast hints fail when query has a WITH clause
> 
>
> Key: SPARK-32535
> URL: https://issues.apache.org/jira/browse/SPARK-32535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Arvind Krishnan
>Priority: Major
>
> If a query has a WITH clause and a query hint (like `BROADCAST`), the query 
> fails
> In the below code sample, executing `sql2` fails, but `sql1` passes.
> {code:java}
> import spark.implicits._
> val df = List(
>   ("1", "B", "C"),
>   ("A", "2", "C"),
>   ("A", "B", "3")
> ).toDF("COL_A", "COL_B", "COL_C")
> df.createOrReplaceTempView("table1")
> val df1 = List(
>   ("A", "2", "3"),
>   ("1", "B", "3"),
>   ("1", "2", "C")
> ).toDF("COL_A", "COL_B", "COL_C")
> df1.createOrReplaceTempView("table2")
> val sql1 = "select /*+ BROADCAST(a) */ a.COL_A from table1 a inner join 
> table2 b on a.COL_A = b.COL_A"
> val sql2 = "with X as (select /*+ BROADCAST(a) */ a.COL_A from table1 a inner 
> join table2 b on a.COL_A = b.COL_A) select X.COL_A from X"
> val df2 = spark.sql(sql2)
> println(s"Row Count ${df2.count()}")
> println("Rows... ")
> df2.show(false)
> {code}
>  
> I tried executing this sample program with spark2.4.0, and both sql 
> statements work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32362) AdaptiveQueryExecSuite misses verifying AE results

2020-07-19 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32362:
---
Description: 
{code}
QueryTest.sameRows(result.toSeq, df.collect().toSeq)
{code}
Even the results are different, no fail.

> AdaptiveQueryExecSuite misses verifying AE results
> --
>
> Key: SPARK-32362
> URL: https://issues.apache.org/jira/browse/SPARK-32362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> {code}
> QueryTest.sameRows(result.toSeq, df.collect().toSeq)
> {code}
> Even the results are different, no fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32362) AdaptiveQueryExecSuite misses verifying AE results

2020-07-19 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32362:
---
Summary: AdaptiveQueryExecSuite misses verifying AE results  (was: 
AdaptiveQueryExecSuite has problem)

> AdaptiveQueryExecSuite misses verifying AE results
> --
>
> Key: SPARK-32362
> URL: https://issues.apache.org/jira/browse/SPARK-32362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32362) AdaptiveQueryExecSuite has problem

2020-07-19 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32362:
--

 Summary: AdaptiveQueryExecSuite has problem
 Key: SPARK-32362
 URL: https://issues.apache.org/jira/browse/SPARK-32362
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160859#comment-17160859
 ] 

Lantao Jin commented on SPARK-32347:


Duplicates to SPARK-32237

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32283) Multiple Kryo registrators can't be used anymore

2020-07-15 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158018#comment-17158018
 ] 

Lantao Jin commented on SPARK-32283:


Thanks for reporting this. Will file a patch.

> Multiple Kryo registrators can't be used anymore
> 
>
> Key: SPARK-32283
> URL: https://issues.apache.org/jira/browse/SPARK-32283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lorenz Bühmann
>Priority: Minor
>
> This is a regression in Spark 3.0 as it is working with Spark 2.
> According to the docs, it should be possible to register multiple Kryo 
> registrators via Spark config option spark.kryo.registrator . 
> In Spark 3.0 the code to parse Kryo config options has been refactored into 
> Scala class 
> [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala].
>  The code to parse the registrators is in [Line 
> 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32]
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .createOptional
> {code}
> but it should be
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .toSequence
> .createOptional
> {code}
>  to split the comma seprated list.
> In previous Spark 2.x it was done differently directly in [KryoSerializer 
> Line 
> 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79]
> {code:scala}
> private val userRegistrators = conf.get("spark.kryo.registrator", "")
> .split(',').map(_.trim)
> .filter(!_.isEmpty)
> {code}
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32237) Cannot resolve column when put hint in the views of common table expression

2020-07-09 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155085#comment-17155085
 ] 

Lantao Jin edited comment on SPARK-32237 at 7/10/20, 4:21 AM:
--

Thanks to report this. I am going to fix it.


was (Author: cltlfcjin):
Thanks to report this. I am going to fix that.

> Cannot resolve column when put hint in the views of common table expression
> ---
>
> Key: SPARK-32237
> URL: https://issues.apache.org/jira/browse/SPARK-32237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Hadoop-2.7.7
> Hive-2.3.6
> Spark-3.0.0
>Reporter: Kernel Force
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Suppose we have a table:
> {code:sql}
> CREATE TABLE DEMO_DATA (
>   ID VARCHAR(10),
>   NAME VARCHAR(10),
>   BATCH VARCHAR(10),
>   TEAM VARCHAR(1)
> ) STORED AS PARQUET;
> {code}
> and some data in it:
> {code:sql}
> 0: jdbc:hive2://HOSTNAME:1> SELECT T.* FROM DEMO_DATA T;
> +---+-+-+-+
> | t.id  | t.name  |   t.batch   | t.team  |
> +---+-+-+-+
> | 1 | mike| 2020-07-08  | A   |
> | 2 | john| 2020-07-07  | B   |
> | 3 | rose| 2020-07-06  | B   |
> | |
> +---+-+-+-+
> {code}
> If I put query hint in va or vb and run it in spark-shell:
> {code:sql}
> sql("""
> WITH VA AS
>  (SELECT T.ID, T.NAME, T.BATCH, T.TEAM 
> FROM DEMO_DATA T WHERE T.TEAM = 'A'),
> VB AS
>  (SELECT /*+ REPARTITION(3) */ T.ID, T.NAME, T.BATCH, T.TEAM
> FROM VA T)
> SELECT T.ID, T.NAME, T.BATCH, T.TEAM 
>   FROM VB T
> """).show
> {code}
> In Spark-2.4.4 it works fine.
>  But in Spark-3.0.0, it throws AnalysisException with Unrecognized hint 
> warning:
> {code:scala}
> 20/07/09 13:51:14 WARN analysis.HintErrorLogger: Unrecognized hint: 
> REPARTITION(3)
> org.apache.spark.sql.AnalysisException: cannot resolve '`T.ID`' given input 
> columns: [T.BATCH, T.ID, T.NAME, T.TEAM]; line 8 pos 7;
> 'Project ['T.ID, 'T.NAME, 'T.BATCH, 'T.TEAM]
> +- SubqueryAlias T
>+- SubqueryAlias VB
>   +- Project [ID#0, NAME#1, BATCH#2, TEAM#3]
>  +- SubqueryAlias T
> +- SubqueryAlias VA
>+- Project [ID#0, NAME#1, BATCH#2, TEAM#3]
>   +- Filter (TEAM#3 = A)
>  +- SubqueryAlias T
> +- SubqueryAlias spark_catalog.default.demo_data
>+- Relation[ID#0,NAME#1,BATCH#2,TEAM#3] parquet
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:129)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:134)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:139)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:139)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:106)
>   at 
> 

[jira] [Commented] (SPARK-32237) Cannot resolve column when put hint in the views of common table expression

2020-07-09 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155085#comment-17155085
 ] 

Lantao Jin commented on SPARK-32237:


Thanks to report this. I am going to fix that.

> Cannot resolve column when put hint in the views of common table expression
> ---
>
> Key: SPARK-32237
> URL: https://issues.apache.org/jira/browse/SPARK-32237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Hadoop-2.7.7
> Hive-2.3.6
> Spark-3.0.0
>Reporter: Kernel Force
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Suppose we have a table:
> {code:sql}
> CREATE TABLE DEMO_DATA (
>   ID VARCHAR(10),
>   NAME VARCHAR(10),
>   BATCH VARCHAR(10),
>   TEAM VARCHAR(1)
> ) STORED AS PARQUET;
> {code}
> and some data in it:
> {code:sql}
> 0: jdbc:hive2://HOSTNAME:1> SELECT T.* FROM DEMO_DATA T;
> +---+-+-+-+
> | t.id  | t.name  |   t.batch   | t.team  |
> +---+-+-+-+
> | 1 | mike| 2020-07-08  | A   |
> | 2 | john| 2020-07-07  | B   |
> | 3 | rose| 2020-07-06  | B   |
> | |
> +---+-+-+-+
> {code}
> If I put query hint in va or vb and run it in spark-shell:
> {code:sql}
> sql("""
> WITH VA AS
>  (SELECT T.ID, T.NAME, T.BATCH, T.TEAM 
> FROM DEMO_DATA T WHERE T.TEAM = 'A'),
> VB AS
>  (SELECT /*+ REPARTITION(3) */ T.ID, T.NAME, T.BATCH, T.TEAM
> FROM VA T)
> SELECT T.ID, T.NAME, T.BATCH, T.TEAM 
>   FROM VB T
> """).show
> {code}
> In Spark-2.4.4 it works fine.
>  But in Spark-3.0.0, it throws AnalysisException with Unrecognized hint 
> warning:
> {code:scala}
> 20/07/09 13:51:14 WARN analysis.HintErrorLogger: Unrecognized hint: 
> REPARTITION(3)
> org.apache.spark.sql.AnalysisException: cannot resolve '`T.ID`' given input 
> columns: [T.BATCH, T.ID, T.NAME, T.TEAM]; line 8 pos 7;
> 'Project ['T.ID, 'T.NAME, 'T.BATCH, 'T.TEAM]
> +- SubqueryAlias T
>+- SubqueryAlias VB
>   +- Project [ID#0, NAME#1, BATCH#2, TEAM#3]
>  +- SubqueryAlias T
> +- SubqueryAlias VA
>+- Project [ID#0, NAME#1, BATCH#2, TEAM#3]
>   +- Filter (TEAM#3 = A)
>  +- SubqueryAlias T
> +- SubqueryAlias spark_catalog.default.demo_data
>+- Relation[ID#0,NAME#1,BATCH#2,TEAM#3] parquet
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:129)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:134)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:139)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:139)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:140)
>   at 
> 

[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2020-07-07 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152738#comment-17152738
 ] 

Lantao Jin edited comment on SPARK-29038 at 7/7/20, 1:14 PM:
-

Hi [~AidenZhang], our focusings of MV in recent months are two parts. One is 
the rewrite algothim optimization. Such as forbidding count distict post 
aggregation, avoid unnecessary rewrite when do relation replacement. Another is 
bugfix in MV refresh. Use a Spark listener to deliver the metastore events to 
refresh. Some parts depends on third part system. So maybe only interfaces are 
available in community Spark. I don't do the partial/incremental refresh since 
it's not a blocker for us. I am not sure the community are still interested the 
feature, but we are moving existing implementation to Spark3.0 now.


was (Author: cltlfcjin):
Hi [~AidenZhang], my focusings of MV in recent months are two parts. One is the 
rewrite algothim optimization. Such as forbidding count distict post 
aggregation, avoid unnecessary rewrite when do relation replacement. Another is 
bugfix in MV refresh. Use a Spark listener to deliver the metastore events to 
refresh. Some parts depends on third part system. So maybe only interfaces are 
available in community Spark. I don't do the partial/incremental refresh since 
it's not a blocker for us. I am not sure the community are still interested the 
feature, but we are moving existing implementation to Spark3.0 now.

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2020-07-07 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152738#comment-17152738
 ] 

Lantao Jin commented on SPARK-29038:


Hi [~AidenZhang], my focusings of MV in recent months are two parts. One is the 
rewrite algothim optimization. Such as forbidding count distict post 
aggregation, avoid unnecessary rewrite when do relation replacement. Another is 
bugfix in MV refresh. Use a Spark listener to deliver the metastore events to 
refresh. Some parts depends on third part system. So maybe only interfaces are 
available in community Spark. I don't do the partial/incremental refresh since 
it's not a blocker for us. I am not sure the community are still interested the 
feature, but we are moving existing implementation to Spark3.0 now.

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32201) More general skew join pattern matching

2020-07-06 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32201:
---
Summary: More general skew join pattern matching  (was: More general skew 
Join pattern matching)

> More general skew join pattern matching
> ---
>
> Key: SPARK-32201
> URL: https://issues.apache.org/jira/browse/SPARK-32201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Current the AQE skew join handling logic is very specified.
> It can only handle the pattern like this:
> {code}
>   SMJ
>  Sort
>Shuffle
>  Sort
>Shuffle
> {code}
> We propose a more general skew Join pattern matching approach.
> In this patch, we can handle 3-tablesjoin, join with aggregation, and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32201) More general skew Join pattern matching

2020-07-06 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32201:
--

 Summary: More general skew Join pattern matching
 Key: SPARK-32201
 URL: https://issues.apache.org/jira/browse/SPARK-32201
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin


Current the AQE skew join handling logic is very specified.
It can only handle the pattern like this:
{code}
  SMJ
 Sort
   Shuffle
 Sort
   Shuffle
{code}

We propose a more general skew Join pattern matching approach.
In this patch, we can handle 3-tablesjoin, join with aggregation, and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32147) Spark: PartitionBy changing the columns value

2020-07-01 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149403#comment-17149403
 ] 

Lantao Jin commented on SPARK-32147:


set spark.sql.sources.partitionColumnTypeInference.enabled to false will print 
the right values.

> Spark: PartitionBy changing the columns value 
> --
>
> Key: SPARK-32147
> URL: https://issues.apache.org/jira/browse/SPARK-32147
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0
>Reporter: Shankar Koirala
>Priority: Major
>  Labels: spark
>
> While saving dataframe as parquet or csv with partitionBy column having 'f' 
> and 'd' with numbers are changing the values.
> Below is the example 
> {code:java}
> scala> val df = Seq(
>  | ("9q", 1),
>  | ("3k", 2),
>  | ("6f", 3),
>  | ("7f", 4),
>  | ("7d", 5)
>  | ).toDF("value", "id")
> df: org.apache.spark.sql.DataFrame = [value: string, id: int]
> scala> df.show(false)
> +-+---+
> |value|id |
> +-+---+
> |  9q | 1 |
> |  3k | 2 |
> |  6f | 3 |
> |  7f | 4 |
> |  7d | 5 |
> +-+---+
> scala> 
> df.write.partitionBy("value").mode(SaveMode.Overwrite).parquet("tmp_parquet")
> scala> spark.read.parquet("tmp_parquet").show(false)
> +---+-+
> |id |value|
> +---+-+
> |5  | 7.0 |
> |3  | 6.0 |
> |2  | 3k  |
> |4  | 7.0 |
> |1  | 9q  |
> +---+-+
> {code}
> Same with the other format too, Is this a bug or is it normal.
> Taken from 
> [SO|[https://stackoverflow.com/questions/62671684/spark-incorrectly-intepret-partition-name-ending-with-d-or-f-as-number-when]]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32143:
---
Description: 
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew and split too many partitions, the plan 
generation may take a very long time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.

In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}

  was:
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the plan generation may take a very long 
time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.

In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}


> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew and split too many partitions, the plan 
> generation may take a very long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
> In below logs we can see that it took over 1 hour to generate the plan in AQE 
> when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32143:
---
Description: 
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the plan generation may take a very long 
time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.

In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}

  was:
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the the plan generation may take a very long 
time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.


 In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}


> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew, the plan generation may take a very long 
> time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
> In below logs we can see that it took over 1 hour to generate the plan in AQE 
> when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149084#comment-17149084
 ] 

Lantao Jin commented on SPARK-32143:


A PR will be submitted soon.

> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew, the the plan generation may take a very 
> long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
>  In below logs we can see that it took over 1 hour to generate the plan in 
> AQE when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32143:
---
Description: 
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the the plan generation may take a very long 
time.

Even fallback to normal SMJ, the query cannot success either. So we should fast 
fail this query.


 In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}

  was:
In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the the plan generation may take a very long 
time. Actually, this query cannot success.
 In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}


> Fast fail when the AQE skew join produce too many splits
> 
>
> Key: SPARK-32143
> URL: https://issues.apache.org/jira/browse/SPARK-32143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In handling skewed SortMergeJoin, when matching partitions from the left side 
> and the right side both have skew, the the plan generation may take a very 
> long time.
> Even fallback to normal SMJ, the query cannot success either. So we should 
> fast fail this query.
>  In below logs we can see that it took over 1 hour to generate the plan in 
> AQE when handle a skewed join which produced too many splits.
> {quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
> Thread-821384] adaptive.OptimizeSkewedJoin:54 :
>  20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, 
> split it into *39150* parts.
>  20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
> split it into *17022* parts.
>  20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, 
> split it into 17 parts.
> ...
>  20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
> adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32143) Fast fail when the AQE skew join produce too many splits

2020-06-30 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32143:
--

 Summary: Fast fail when the AQE skew join produce too many splits
 Key: SPARK-32143
 URL: https://issues.apache.org/jira/browse/SPARK-32143
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin


In handling skewed SortMergeJoin, when matching partitions from the left side 
and the right side both have skew, the the plan generation may take a very long 
time. Actually, this query cannot success.
 In below logs we can see that it took over 1 hour to generate the plan in AQE 
when handle a skewed join which produced too many splits.
{quote}20/06/30 *12:31:26,271* INFO [HiveServer2-Background-Pool: 
Thread-821384] adaptive.OptimizeSkewedJoin:54 :
 20/06/30 12:31:26,299 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Left side partition 1 (3 TB) is skewed, split 
it into *39150* parts.
 20/06/30 12:31:26,315 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 1 (11 TB) is skewed, 
split it into *17022* parts.
 20/06/30 12:32:24,952 INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.OptimizeSkewedJoin:54 : Right side partition 8 (1 GB) is skewed, split 
it into 17 parts.

...
 20/06/30 *13:27:25,158* INFO [HiveServer2-Background-Pool: Thread-821384] 
adaptive.AdaptiveSparkPlanExec:54 : Final plan: CollectLimit 1000
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32129) Support AQE skew join with Union

2020-06-29 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32129:
---
Description: 
Current, the AQE skew join only supports two tables join such as
{code}
SMJ
:-Sort
:+-Shuffle
+-Sort
 +-Shuffle
{code}
But if the plan contains a Union, the skew join handling not work:
{code}
Union
:-SMJ
:   :-Sort
:   : +-Shuffle
:   +-Sort
: +-Shuffle
+-SMJ
:   :-Sort
:   : +-Shuffle
:   +-Sort
 +-Shuffle
{code}


  was:
Current, the AQE skew join only supports two tables join such as
{code}
SMJ
:-Sort
:+-Shuffle
+-Sort
 +-Shuffle
{code}
But if the plan contains a Union, the skew join handling not work:
{code}
Union
:-SMJ
:   :-Sort
:   : +-Shuffle
:  +-Sort
: +-Shuffle
+-SMJ
:   :-Sort
:   : +-Shuffle
:  +-Sort
 +-Shuffle
{code}



> Support AQE skew join with Union
> 
>
> Key: SPARK-32129
> URL: https://issues.apache.org/jira/browse/SPARK-32129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Current, the AQE skew join only supports two tables join such as
> {code}
> SMJ
> :-Sort
> :+-Shuffle
> +-Sort
>  +-Shuffle
> {code}
> But if the plan contains a Union, the skew join handling not work:
> {code}
> Union
> :-SMJ
> :   :-Sort
> :   : +-Shuffle
> :   +-Sort
> : +-Shuffle
> +-SMJ
> :   :-Sort
> :   : +-Shuffle
> :   +-Sort
>  +-Shuffle
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32129) Support AQE skew join with Union

2020-06-29 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32129:
---
Description: 
Current, the AQE skew join only supports two tables join such as
{code}
SMJ
:-Sort
:+-Shuffle
+-Sort
 +-Shuffle
{code}
But if the plan contains a Union, the skew join handling not work:
{code}
Union
:-SMJ
:   :-Sort
:   :+-Shuffle
:   +-Sort
:+-Shuffle
+-SMJ
:   :-Sort
:   :+-Shuffle
:   +-Sort
 +-Shuffle
{code}


  was:
Current, the AQE skew join only supports two tables join such as
{code}
SMJ
:-Sort
:+-Shuffle
+-Sort
 +-Shuffle
{code}
But if the plan contains a Union, the skew join handling not work:
{code}
Union
:-SMJ
:   :-Sort
:   : +-Shuffle
:   +-Sort
: +-Shuffle
+-SMJ
:   :-Sort
:   : +-Shuffle
:   +-Sort
 +-Shuffle
{code}



> Support AQE skew join with Union
> 
>
> Key: SPARK-32129
> URL: https://issues.apache.org/jira/browse/SPARK-32129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Current, the AQE skew join only supports two tables join such as
> {code}
> SMJ
> :-Sort
> :+-Shuffle
> +-Sort
>  +-Shuffle
> {code}
> But if the plan contains a Union, the skew join handling not work:
> {code}
> Union
> :-SMJ
> :   :-Sort
> :   :+-Shuffle
> :   +-Sort
> :+-Shuffle
> +-SMJ
> :   :-Sort
> :   :+-Shuffle
> :   +-Sort
>  +-Shuffle
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32129) Support AQE skew join with Union

2020-06-29 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32129:
--

 Summary: Support AQE skew join with Union
 Key: SPARK-32129
 URL: https://issues.apache.org/jira/browse/SPARK-32129
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin


Current, the AQE skew join only supports two tables join such as
{code}
SMJ
:-Sort
:+-Shuffle
+-Sort
 +-Shuffle
{code}
But if the plan contains a Union, the skew join handling not work:
{code}
Union
:-SMJ
:   :-Sort
:   : +-Shuffle
:  +-Sort
: +-Shuffle
+-SMJ
:   :-Sort
:   : +-Shuffle
:  +-Sort
 +-Shuffle
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog

2020-06-28 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32118:
--

 Summary: Use fine-grained read write lock for each database in 
HiveExternalCatalog
 Key: SPARK-32118
 URL: https://issues.apache.org/jira/browse/SPARK-32118
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin


In HiveExternalCatalog, all metastore operations are synchronized by a same 
object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's 
queries may be stuck by any a long operation. For example, if a user is 
accessing a table which contains mass partitions, the operation 
loadDynamicPartitions() holds the object lock for a long time. All queries are 
blocking to wait for the lock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32117) Thread spark-listener-group-streams is cpu costing

2020-06-28 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin resolved SPARK-32117.

Resolution: Won't Fix

> Thread spark-listener-group-streams is cpu costing
> --
>
> Key: SPARK-32117
> URL: https://issues.apache.org/jira/browse/SPARK-32117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing 
> even though in a non-streaming application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32117) Thread spark-listener-group-streams is cpu costing

2020-06-28 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147278#comment-17147278
 ] 

Lantao Jin commented on SPARK-32117:


I think it might be fixed by SPARK-29423

> Thread spark-listener-group-streams is cpu costing
> --
>
> Key: SPARK-32117
> URL: https://issues.apache.org/jira/browse/SPARK-32117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing 
> even though in a non-streaming application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32117) Thread spark-listener-group-streams is cpu costing

2020-06-28 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32117:
--

 Summary: Thread spark-listener-group-streams is cpu costing
 Key: SPARK-32117
 URL: https://issues.apache.org/jira/browse/SPARK-32117
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Lantao Jin


In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing 
even though in a non-streaming application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32108) Silent mode of spark-sql is broken

2020-06-28 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147264#comment-17147264
 ] 

Lantao Jin commented on SPARK-32108:


[~maxgekk] I think it works. The INFO logs only print in spark-sql starting.

> Silent mode of spark-sql is broken
> --
>
> Key: SPARK-32108
> URL: https://issues.apache.org/jira/browse/SPARK-32108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. I download the recent release Spark 3.0 from 
> http://spark.apache.org/downloads.html
> 2. Run bin/spark-sql -S, it prints a lot of INFO
> {code}
> ➜  ~ ./spark-3.0/bin/spark-sql -S
> 20/06/26 20:43:38 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.hive.conf.HiveConf).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 20/06/26 20:43:39 INFO SharedState: spark.sql.warehouse.dir is not set, but 
> hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the 
> value of hive.metastore.warehouse.dir ('/user/hive/warehouse').
> 20/06/26 20:43:39 INFO SharedState: Warehouse path is '/user/hive/warehouse'.
> 20/06/26 20:43:39 INFO SessionState: Created HDFS directory: 
> /tmp/hive/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a
> 20/06/26 20:43:39 INFO SessionState: Created local directory: 
> /var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a
> 20/06/26 20:43:39 INFO SessionState: Created HDFS directory: 
> /tmp/hive/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a/_tmp_space.db
> 20/06/26 20:43:39 INFO SparkContext: Running Spark version 3.0.0
> 20/06/26 20:43:39 INFO ResourceUtils: 
> ==
> 20/06/26 20:43:39 INFO ResourceUtils: Resources for spark.driver:
> 20/06/26 20:43:39 INFO ResourceUtils: 
> ==
> 20/06/26 20:43:39 INFO SparkContext: Submitted application: 
> SparkSQL::192.168.1.78
> 20/06/26 20:43:39 INFO SecurityManager: Changing view acls to: maximgekk
> 20/06/26 20:43:39 INFO SecurityManager: Changing modify acls to: maximgekk
> 20/06/26 20:43:39 INFO SecurityManager: Changing view acls groups to:
> 20/06/26 20:43:39 INFO SecurityManager: Changing modify acls groups to:
> 20/06/26 20:43:39 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(maximgekk); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(maximgekk); groups with modify permissions: Set()
> 20/06/26 20:43:39 INFO Utils: Successfully started service 'sparkDriver' on 
> port 59414.
> 20/06/26 20:43:39 INFO SparkEnv: Registering MapOutputTracker
> 20/06/26 20:43:39 INFO SparkEnv: Registering BlockManagerMaster
> 20/06/26 20:43:39 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 20/06/26 20:43:39 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 20/06/26 20:43:39 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 20/06/26 20:43:39 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/blockmgr-c1d041ad-dd46-4d11-bbd0-e8ba27d3bf69
> 20/06/26 20:43:39 INFO MemoryStore: MemoryStore started with capacity 408.9 
> MiB
> 20/06/26 20:43:39 INFO SparkEnv: Registering OutputCommitCoordinator
> 20/06/26 20:43:40 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 20/06/26 20:43:40 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://192.168.1.78:4040
> 20/06/26 20:43:40 INFO Executor: Starting executor ID driver on host 
> 192.168.1.78
> 20/06/26 20:43:40 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59415.
> 20/06/26 20:43:40 INFO NettyBlockTransferService: Server created on 
> 192.168.1.78:59415
> 20/06/26 20:43:40 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 20/06/26 20:43:40 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, 192.168.1.78, 59415, None)
> 20/06/26 20:43:40 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.1.78:59415 with 408.9 MiB RAM, BlockManagerId(driver, 192.168.1.78, 
> 59415, None)
> 20/06/26 20:43:40 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, 192.168.1.78, 59415, None)

[jira] [Comment Edited] (SPARK-32063) Spark native temporary table

2020-06-24 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143575#comment-17143575
 ] 

Lantao Jin edited comment on SPARK-32063 at 6/24/20, 6:53 AM:
--

[~viirya] For 1, even RDD cache or table cache can improve performance, but I 
still think they have totally different scopes. Besides, we also can cache a 
temporary table to memory to get more performance improvement. In production 
usage, I found our data engineers and data scientists do not always remember to 
uncached cached tables or views. This situation became worse in the Spark 
thrift-server (sharing Spark driver). 

For 2, we found when Adaptive Query Execution enabled, complex views are easily 
stuck in the optimization step. Cache this view couldn't help.

For 3, the scenario is in our migration case, move SQL from Teradata to Spark. 
Without the temporary table, TD users have to create permanent tables and drop 
them at the end of a script as an alternate of TD volatile table, if JDBC 
session closed or script failed before cleaning up, no mechanism guarantee to 
drop the intermediate data. If they use Spark temporary view, many logic 
couldn't work well. For example, they want to execute UPDATE/DELETE op on 
intermediate tables but we cannot convert a temporary view to Delta table or 
Hudi table ...


was (Author: cltlfcjin):
For 1, even RDD cache or table cache can improve performance, but I still think 
they have totally different scopes. Besides, we also can cache a temporary 
table to memory to get more performance improvement. In production usage, I 
found our data engineers and data scientists do not always remember to uncached 
cached tables or views. This situation became worse in the Spark thrift-server 
(sharing Spark driver). 

For 2, we found when Adaptive Query Execution enabled, complex views are easily 
stuck in the optimization step. Cache this view couldn't help.

For 3, the scenario is in our migration case, move SQL from Teradata to Spark. 
Without the temporary table, TD users have to create permanent tables and drop 
them at the end of a script as an alternate of TD volatile table, if JDBC 
session closed or script failed before cleaning up, no mechanism guarantee to 
drop the intermediate data. If they use Spark temporary view, many logic 
couldn't work well. For example, they want to execute UPDATE/DELETE op on 
intermediate tables but we cannot convert a temporary view to Delta table or 
Hudi table ...

> Spark native temporary table
> 
>
> Key: SPARK-32063
> URL: https://issues.apache.org/jira/browse/SPARK-32063
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Many databases and data warehouse SQL engines support temporary tables. A 
> temporary table, as its named implied, is a short-lived table that its life 
> will be only for current session.
> In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS 
> SELECT” will create a temporary view. A temporary view is totally different 
> with a temporary table. 
> A temporary view is just a VIEW. It doesn’t materialize data in storage. So 
> it has below shortage:
>  # View will not give improved performance. Materialize intermediate data in 
> temporary tables for a complex query will accurate queries, especially in an 
> ETL pipeline.
>  # View which calls other views can cause severe performance issues. Even, 
> executing a very complex view may fail in Spark. 
>  # Temporary view has no database namespace. In some complex ETL pipelines or 
> data warehouse applications, without database prefix is not convenient. It 
> needs some tables which only used in current session.
>  
> More details are described in [Design 
> Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32063) Spark native temporary table

2020-06-24 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143575#comment-17143575
 ] 

Lantao Jin commented on SPARK-32063:


For 1, even RDD cache or table cache can improve performance, but I still think 
they have totally different scopes. Besides, we also can cache a temporary 
table to memory to get more performance improvement. In production usage, I 
found our data engineers and data scientists do not always remember to uncached 
cached tables or views. This situation became worse in the Spark thrift-server 
(sharing Spark driver). 

For 2, we found when Adaptive Query Execution enabled, complex views are easily 
stuck in the optimization step. Cache this view couldn't help.

For 3, the scenario is in our migration case, move SQL from Teradata to Spark. 
Without the temporary table, TD users have to create permanent tables and drop 
them at the end of a script as an alternate of TD volatile table, if JDBC 
session closed or script failed before cleaning up, no mechanism guarantee to 
drop the intermediate data. If they use Spark temporary view, many logic 
couldn't work well. For example, they want to execute UPDATE/DELETE op on 
intermediate tables but we cannot convert a temporary view to Delta table or 
Hudi table ...

> Spark native temporary table
> 
>
> Key: SPARK-32063
> URL: https://issues.apache.org/jira/browse/SPARK-32063
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Many databases and data warehouse SQL engines support temporary tables. A 
> temporary table, as its named implied, is a short-lived table that its life 
> will be only for current session.
> In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS 
> SELECT” will create a temporary view. A temporary view is totally different 
> with a temporary table. 
> A temporary view is just a VIEW. It doesn’t materialize data in storage. So 
> it has below shortage:
>  # View will not give improved performance. Materialize intermediate data in 
> temporary tables for a complex query will accurate queries, especially in an 
> ETL pipeline.
>  # View which calls other views can cause severe performance issues. Even, 
> executing a very complex view may fail in Spark. 
>  # Temporary view has no database namespace. In some complex ETL pipelines or 
> data warehouse applications, without database prefix is not convenient. It 
> needs some tables which only used in current session.
>  
> More details are described in [Design 
> Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32065) Supporting analyze temporary table

2020-06-22 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32065:
--

 Summary: Supporting analyze temporary table
 Key: SPARK-32065
 URL: https://issues.apache.org/jira/browse/SPARK-32065
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lantao Jin


Supporting analyze temporary table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32066) Supporting create temporary table LIKE

2020-06-22 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32066:
--

 Summary: Supporting create temporary table LIKE
 Key: SPARK-32066
 URL: https://issues.apache.org/jira/browse/SPARK-32066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lantao Jin


Supporting create temporary table LIKE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32064) Supporting create temporary table

2020-06-22 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32064:
--

 Summary: Supporting create temporary table
 Key: SPARK-32064
 URL: https://issues.apache.org/jira/browse/SPARK-32064
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lantao Jin


The basic code to implement the Spark native temporary table. See SPARK-32063



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32063) Spark native temporary table

2020-06-22 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32063:
--

 Summary: Spark native temporary table
 Key: SPARK-32063
 URL: https://issues.apache.org/jira/browse/SPARK-32063
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lantao Jin


Many databases and data warehouse SQL engines support temporary tables. A 
temporary table, as its named implied, is a short-lived table that its life 
will be only for current session.

In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS 
SELECT” will create a temporary view. A temporary view is totally different 
with a temporary table. 

A temporary view is just a VIEW. It doesn’t materialize data in storage. So it 
has below shortage:
 # View will not give improved performance. Materialize intermediate data in 
temporary tables for a complex query will accurate queries, especially in an 
ETL pipeline.
 # View which calls other views can cause severe performance issues. Even, 
executing a very complex view may fail in Spark. 
 # Temporary view has no database namespace. In some complex ETL pipelines or 
data warehouse applications, without database prefix is not convenient. It 
needs some tables which only used in current session.

 

More details are described in [Design 
Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31904) Char and varchar partition columns throw MetaException

2020-06-04 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-31904:
---
Comment: was deleted

(was: [https://github.com/apache/spark/pull/28724])

> Char and varchar partition columns throw MetaException
> --
>
> Key: SPARK-31904
> URL: https://issues.apache.org/jira/browse/SPARK-31904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> {code}
> CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet;
> CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1;
> SELECT * FROM t2 WHERE b = 'A';
> {code}
> Above SQL throws MetaException
> {quote}
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810)
>   ... 114 more
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string, or integral types)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101)
>   at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
>   at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>   at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2679)
>   ... 119 more
> {quote}



--
This message was sent by Atlassian Jira

[jira] [Commented] (SPARK-31904) Char and varchar partition columns throw MetaException

2020-06-04 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125627#comment-17125627
 ] 

Lantao Jin commented on SPARK-31904:


[https://github.com/apache/spark/pull/28724]

> Char and varchar partition columns throw MetaException
> --
>
> Key: SPARK-31904
> URL: https://issues.apache.org/jira/browse/SPARK-31904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> {code}
> CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet;
> CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1;
> SELECT * FROM t2 WHERE b = 'A';
> {code}
> Above SQL throws MetaException
> {quote}
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810)
>   ... 114 more
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string, or integral types)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101)
>   at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
>   at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>   at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2679)
>   ... 119 more
> {quote}



--
This message was sent by Atlassian 

[jira] [Updated] (SPARK-31904) Char and varchar partition columns throw MetaException

2020-06-04 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-31904:
---
Description: 
{code}
CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet;
CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1;
SELECT * FROM t2 WHERE b = 'A';
{code}

Above SQL throws MetaException

{quote}
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810)
... 114 more
Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string, or integral types)
at 
org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184)
at 
org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439)
at 
org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356)
at 
org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278)
at 
org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583)
at 
org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768)
at 
org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182)
at 
org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248)
at 
org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232)
at 
org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101)
at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2679)
... 119 more
{quote}

> Char and varchar partition columns throw MetaException
> --
>
> Key: SPARK-31904
> URL: https://issues.apache.org/jira/browse/SPARK-31904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> {code}
> CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet;
> CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1;
> SELECT * FROM t2 WHERE b = 

[jira] [Created] (SPARK-31904) Char and varchar partition columns throw MetaException

2020-06-04 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-31904:
--

 Summary: Char and varchar partition columns throw MetaException
 Key: SPARK-31904
 URL: https://issues.apache.org/jira/browse/SPARK-31904
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094215#comment-17094215
 ] 

Lantao Jin commented on SPARK-31591:


https://github.com/apache/spark/pull/28385

> namePrefix could be null in Utils.createDirectory
> -
>
> Key: SPARK-31591
> URL: https://issues.apache.org/jira/browse/SPARK-31591
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In our production, we find that many shuffle files could be located in
> /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a
> The Util.createDirectory() uses a default parameter "spark"
> {code}
>   def createDirectory(root: String, namePrefix: String = "spark"): File = {
> {code}
> But in some cases, the actual namePrefix is null. If the method is called 
> with null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094214#comment-17094214
 ] 

Lantao Jin commented on SPARK-31591:


[~Ankitraj] I have already filed a PR.

> namePrefix could be null in Utils.createDirectory
> -
>
> Key: SPARK-31591
> URL: https://issues.apache.org/jira/browse/SPARK-31591
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In our production, we find that many shuffle files could be located in
> /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a
> The Util.createDirectory() uses a default parameter "spark"
> {code}
>   def createDirectory(root: String, namePrefix: String = "spark"): File = {
> {code}
> But in some cases, the actual namePrefix is null. If the method is called 
> with null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31591) namePrefix could be null in Utils.createDirectory

2020-04-28 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-31591:
--

 Summary: namePrefix could be null in Utils.createDirectory
 Key: SPARK-31591
 URL: https://issues.apache.org/jira/browse/SPARK-31591
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Lantao Jin


In our production, we find that many shuffle files could be located in
/hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a

The Util.createDirectory() uses a default parameter "spark"
{code}
  def createDirectory(root: String, namePrefix: String = "spark"): File = {
{code}
But in some cases, the actual namePrefix is null. If the method is called with 
null, then the default value would not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31154) Expose basic write metrics for InsertIntoDataSourceCommand

2020-03-14 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-31154:
---
Description: 
Spark provides interface `InsertableRelation` and the 
`InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
source. Unlike `DataWritingCommand`, the metrics in InsertIntoDataSourceCommand 
is empty and has no chance to update. So we cannot get "number of written 
files" or "number of output rows" from its metrics.

For example, if a table is a Spark parquet table. We can get the writing 
metrics by:
{code}
val df = sql("INSERT INTO TABLE test_table SELECT 1, 'a'")
val numFiles = df.queryExecution.sparkPlan.metrics("numFiles").value
{code}
But if it is a Delta table, we cannot.

  was:
Spark provides interface `InsertableRelation` and the 
`InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
source. Unlike `DataWritingCommand`, the metrics in InsertIntoDataSourceCommand 
is empty and has no chance to update. So we cannot get "number of written 
files" or "number of output rows" from its metrics.

For example, if a table is a Spark parquet table. We can get the writing 
metrics by:
{code}
val df = sql("INSERT INTO TABLE test_table SELECT 1, 'a'")
df.executionP
{code}
But if it is a Delta table, we cannot.


> Expose basic write metrics for InsertIntoDataSourceCommand
> --
>
> Key: SPARK-31154
> URL: https://issues.apache.org/jira/browse/SPARK-31154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Spark provides interface `InsertableRelation` and the 
> `InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
> source. Unlike `DataWritingCommand`, the metrics in 
> InsertIntoDataSourceCommand is empty and has no chance to update. So we 
> cannot get "number of written files" or "number of output rows" from its 
> metrics.
> For example, if a table is a Spark parquet table. We can get the writing 
> metrics by:
> {code}
> val df = sql("INSERT INTO TABLE test_table SELECT 1, 'a'")
> val numFiles = df.queryExecution.sparkPlan.metrics("numFiles").value
> {code}
> But if it is a Delta table, we cannot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31154) Expose basic write metrics for InsertIntoDataSourceCommand

2020-03-14 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-31154:
---
Description: 
Spark provides interface `InsertableRelation` and the 
`InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
source. Unlike `DataWritingCommand`, the metrics in InsertIntoDataSourceCommand 
is empty and has no chance to update. So we cannot get "number of written 
files" or "number of output rows" from its metrics.

For example, if a table is a Spark parquet table. We can get the writing 
metrics by:
{code}
val df = sql("INSERT INTO TABLE test_table SELECT 1, 'a'")
df.executionP
{code}
But if it is a Delta table, we cannot.

  was:Spark provides interface `InsertableRelation` and the 
`InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
source. Unlike `DataWritingCommand`, the metrics in InsertIntoDataSourceCommand 
is empty and has no chance to update. So we cannot get "number of written 
files" or "number of output rows" from its metrics.


> Expose basic write metrics for InsertIntoDataSourceCommand
> --
>
> Key: SPARK-31154
> URL: https://issues.apache.org/jira/browse/SPARK-31154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Spark provides interface `InsertableRelation` and the 
> `InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
> source. Unlike `DataWritingCommand`, the metrics in 
> InsertIntoDataSourceCommand is empty and has no chance to update. So we 
> cannot get "number of written files" or "number of output rows" from its 
> metrics.
> For example, if a table is a Spark parquet table. We can get the writing 
> metrics by:
> {code}
> val df = sql("INSERT INTO TABLE test_table SELECT 1, 'a'")
> df.executionP
> {code}
> But if it is a Delta table, we cannot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31154) Expose basic write metrics for InsertIntoDataSourceCommand

2020-03-14 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-31154:
--

 Summary: Expose basic write metrics for InsertIntoDataSourceCommand
 Key: SPARK-31154
 URL: https://issues.apache.org/jira/browse/SPARK-31154
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin


Spark provides interface `InsertableRelation` and the 
`InsertIntoDataSourceCommand` to delegate the inserting processing to a data 
source. Unlike `DataWritingCommand`, the metrics in InsertIntoDataSourceCommand 
is empty and has no chance to update. So we cannot get "number of written 
files" or "number of output rows" from its metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31068) IllegalArgumentException in BroadcastExchangeExec

2020-03-05 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-31068:
--

 Summary: IllegalArgumentException in BroadcastExchangeExec
 Key: SPARK-31068
 URL: https://issues.apache.org/jira/browse/SPARK-31068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


{code}
Caused by: org.apache.spark.SparkException: Failed to materialize query stage: 
BroadcastQueryStage 0
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true], 
input[1, bigint, true], input[2, int, true]))
   +- *(1) Project [guid#138126, session_skey#138127L, seqnum#138132]
  +- *(1) Filter isnotnull(session_start_dt#138129) && 
(session_start_dt#138129 = 2020-01-01)) && isnotnull(seqnum#138132)) && 
isnotnull(session_skey#138127L)) && isnotnull(guid#138126))
 +- *(1) FileScan parquet p_soj_cl_t.clav_events[guid#138126, 
session_skey#138127L, session_start_dt#138129, seqnum#138132] DataFilters: 
[isnotnull(session_start_dt#138129), (session_start_dt#138129 = 2020-01-01), 
isnotnull(seqnum#138..., Format: Parquet, Location: 
TahoeLogFileIndex[hdfs://hermes-rno/workspaces/P_SOJ_CL_T/clav_events], 
PartitionFilters: [], PushedFilters: [IsNotNull(session_start_dt), 
EqualTo(session_start_dt,2020-01-01), IsNotNull(seqnum), IsNotNull(..., 
ReadSchema: 
struct, 
SelectedBucketsCount: 1000 out of 1000, UsedIndexes: []

at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$anonfun$generateFinalPlan$3.apply(AdaptiveSparkPlanExec.scala:230)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$anonfun$generateFinalPlan$3.apply(AdaptiveSparkPlanExec.scala:225)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.generateFinalPlan(AdaptiveSparkPlanExec.scala:225)
... 48 more
Caused by: java.lang.IllegalArgumentException: Initial capacity 670166426 
exceeds maximum capacity of 536870912
at org.apache.spark.unsafe.map.BytesToBytesMap.(BytesToBytesMap.java:196)
at 
org.apache.spark.unsafe.map.BytesToBytesMap.(BytesToBytesMap.java:219)
at 
org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:340)
at 
org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:123)
at 
org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:964)
at 
org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:952)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$9.apply(BroadcastExchangeExec.scala:220)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$9.apply(BroadcastExchangeExec.scala:207)
at 
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:128)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:206)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:172)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
... 3 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30785) Create table like should keep tracksPartitionsInCatalog same with source table

2020-02-10 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-30785:
--

 Summary: Create table like should keep tracksPartitionsInCatalog 
same with source table
 Key: SPARK-30785
 URL: https://issues.apache.org/jira/browse/SPARK-30785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


Table generated by CTL a partitioned table is a partitioned table. But when run 
ALTER TABLE ADD PARTITION, it will throw AnalysisException: ALTER TABLE ADD 
PARTITION is not allowed. That's because the default value of 
{{tracksPartitionsInCatalog}} from CTL always is {{false}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30494) Duplicates cached RDD when create or replace an existing view

2020-01-12 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013976#comment-17013976
 ] 

Lantao Jin commented on SPARK-30494:


I will file a PR soon.

> Duplicates cached RDD when create or replace an existing view
> -
>
> Key: SPARK-30494
> URL: https://issues.apache.org/jira/browse/SPARK-30494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> We can reproduce by below commands:
> {code}
> beeline> create or replace temporary view temp1 as select 1
> beeline> cache table tempView
> beeline> create or replace temporary view temp1 as select 1, 2
> beeline> cache table tempView
> {code}
> The cached RDD for plan "select 1" stays in memory forever until the session 
> close. This cached data cannot be used since the view temp1 has been replaced 
> by another plan. It's a memory leak.
> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 
> 2")).isDefined)
> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 
> 1")).isDefined)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30494) Duplicates cached RDD when create or replace an existing view

2020-01-12 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-30494:
--

 Summary: Duplicates cached RDD when create or replace an existing 
view
 Key: SPARK-30494
 URL: https://issues.apache.org/jira/browse/SPARK-30494
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


We can reproduce by below commands:
{code}
beeline> create or replace temporary view temp1 as select 1
beeline> cache table tempView
beeline> create or replace temporary view temp1 as select 1, 2
beeline> cache table tempView

The cached RDD for plan "select 1" stays in memory forever until the session 
close. This cached data cannot be used since the view temp1 has been replaced 
by another plan. It's a memory leak.

assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 
2")).isDefined)
assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 
1")).isDefined)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30494) Duplicates cached RDD when create or replace an existing view

2020-01-12 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-30494:
---
Description: 
We can reproduce by below commands:
{code}
beeline> create or replace temporary view temp1 as select 1
beeline> cache table tempView
beeline> create or replace temporary view temp1 as select 1, 2
beeline> cache table tempView
{code}

The cached RDD for plan "select 1" stays in memory forever until the session 
close. This cached data cannot be used since the view temp1 has been replaced 
by another plan. It's a memory leak.

assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 
2")).isDefined)
assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 
1")).isDefined)

  was:
We can reproduce by below commands:
{code}
beeline> create or replace temporary view temp1 as select 1
beeline> cache table tempView
beeline> create or replace temporary view temp1 as select 1, 2
beeline> cache table tempView

The cached RDD for plan "select 1" stays in memory forever until the session 
close. This cached data cannot be used since the view temp1 has been replaced 
by another plan. It's a memory leak.

assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 
2")).isDefined)
assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 
1")).isDefined)


> Duplicates cached RDD when create or replace an existing view
> -
>
> Key: SPARK-30494
> URL: https://issues.apache.org/jira/browse/SPARK-30494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> We can reproduce by below commands:
> {code}
> beeline> create or replace temporary view temp1 as select 1
> beeline> cache table tempView
> beeline> create or replace temporary view temp1 as select 1, 2
> beeline> cache table tempView
> {code}
> The cached RDD for plan "select 1" stays in memory forever until the session 
> close. This cached data cannot be used since the view temp1 has been replaced 
> by another plan. It's a memory leak.
> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 
> 2")).isDefined)
> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 
> 1")).isDefined)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >