[jira] [Comment Edited] (SPARK-32063) Spark native temporary table

Lantao Jin (Jira) Tue, 23 Jun 2020 23:55:02 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143575#comment-17143575
 ]


Lantao Jin edited comment on SPARK-32063 at 6/24/20, 6:53 AM:
--------------------------------------------------------------

[~viirya] For 1, even RDD cache or table cache can improve performance, but I 
still think they have totally different scopes. Besides, we also can cache a 
temporary table to memory to get more performance improvement. In production 
usage, I found our data engineers and data scientists do not always remember to 
uncached cached tables or views. This situation became worse in the Spark 
thrift-server (sharing Spark driver). 

For 2, we found when Adaptive Query Execution enabled, complex views are easily 
stuck in the optimization step. Cache this view couldn't help.

For 3, the scenario is in our migration case, move SQL from Teradata to Spark. 
Without the temporary table, TD users have to create permanent tables and drop 
them at the end of a script as an alternate of TD volatile table, if JDBC 
session closed or script failed before cleaning up, no mechanism guarantee to 
drop the intermediate data. If they use Spark temporary view, many logic 
couldn't work well. For example, they want to execute UPDATE/DELETE op on 
intermediate tables but we cannot convert a temporary view to Delta table or 
Hudi table ...


was (Author: cltlfcjin):
For 1, even RDD cache or table cache can improve performance, but I still think 
they have totally different scopes. Besides, we also can cache a temporary 
table to memory to get more performance improvement. In production usage, I 
found our data engineers and data scientists do not always remember to uncached 
cached tables or views. This situation became worse in the Spark thrift-server 
(sharing Spark driver). 

For 2, we found when Adaptive Query Execution enabled, complex views are easily 
stuck in the optimization step. Cache this view couldn't help.

For 3, the scenario is in our migration case, move SQL from Teradata to Spark. 
Without the temporary table, TD users have to create permanent tables and drop 
them at the end of a script as an alternate of TD volatile table, if JDBC 
session closed or script failed before cleaning up, no mechanism guarantee to 
drop the intermediate data. If they use Spark temporary view, many logic 
couldn't work well. For example, they want to execute UPDATE/DELETE op on 
intermediate tables but we cannot convert a temporary view to Delta table or 
Hudi table ...

> Spark native temporary table
> ----------------------------
>
>                 Key: SPARK-32063
>                 URL: https://issues.apache.org/jira/browse/SPARK-32063
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Lantao Jin
>            Priority: Major
>
> Many databases and data warehouse SQL engines support temporary tables. A 
> temporary table, as its named implied, is a short-lived table that its life 
> will be only for current session.
> In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS 
> SELECT” will create a temporary view. A temporary view is totally different 
> with a temporary table. 
> A temporary view is just a VIEW. It doesn’t materialize data in storage. So 
> it has below shortage:
>  # View will not give improved performance. Materialize intermediate data in 
> temporary tables for a complex query will accurate queries, especially in an 
> ETL pipeline.
>  # View which calls other views can cause severe performance issues. Even, 
> executing a very complex view may fail in Spark. 
>  # Temporary view has no database namespace. In some complex ETL pipelines or 
> data warehouse applications, without database prefix is not convenient. It 
> needs some tables which only used in current session.
>  
> More details are described in [Design 
> Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32063) Spark native temporary table

Reply via email to