[jira] [Commented] (SPARK-29035) unpersist() ignoring cache/persist()

2019-09-16 Thread Jose Silva (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930297#comment-16930297
 ] 

Jose Silva commented on SPARK-29035:


[~hyukjin.kwon]

What do you mean with "full reproducer"?

> unpersist() ignoring cache/persist()
> 
>
> Key: SPARK-29035
> URL: https://issues.apache.org/jira/browse/SPARK-29035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Amazon EMR - Spark 2.4.3
>Reporter: Jose Silva
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Calling {{unpersist()}}, even though the {{DataFrame}} is not used anymore 
> removes all the InMemoryTableScan from the DAG.
> Here's a simplified version of the code i'm using:
> {code}
> df = spark.read(...).where(...).cache()
> df_a = union(df.select(...), df.select(...), df.select(...))
> df_b = df.select(...)
> df_c = df.select(...)
> df_d = df.select(...)
> df.unpersist()
> join(df_a, df_b, df_c, df_d).write()
> {code}
> I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with 
> and without the {{unpersist()}} call.
> I call unpersist in order to prevent OOM during the join. From what I 
> understand even though all the DataFrames come from df, unpersisting df after 
> doing the selects shouldn't ignore the cache call, right?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29035) unpersist() ignoring cache/persist()

2019-09-10 Thread Jose Silva (Jira)
Jose Silva created SPARK-29035:
--

 Summary: unpersist() ignoring cache/persist()
 Key: SPARK-29035
 URL: https://issues.apache.org/jira/browse/SPARK-29035
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.3
 Environment: Amazon EMR - Spark 2.4.3
Reporter: Jose Silva


Calling unpersist(), even though the DataFrame is not used anymore removes all 
the InMemoryTableScan from the DAG.

 

Here's a simplified version of the code i'm using:

 

df = spark.read(...).where(...).cache()

df_a = union(df.select(...), df.select(...), df.select(...))

df_b = df.select(...)

df_c = df.select(...)

df_d = df.select(...)

df.unpersist()

join(df_a, df_b, df_c, df_d).write()

 

 

I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with and 
without the unpersist() call.

 

I call unpersist in order to prevent OoM during the join. From what I 
understand even though all the DataFrames come from df, unpersisting df after 
doing the selects shouldn't ignore the cache call, right?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org