Guillem Palou created SPARK-20598:
-------------------------------------

             Summary: Iterative checkpoints do not get removed from HDFS
                 Key: SPARK-20598
                 URL: https://issues.apache.org/jira/browse/SPARK-20598
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core, YARN
    Affects Versions: 2.1.0
            Reporter: Guillem Palou


I am running a pyspark  application that makes use of dataframe.checkpoint() 
because Spark needs exponential time to compute the plan and eventually I had 
to stop it. Using {{checkpoint}} allowed the application to proceed with the 
computation, but I noticed that the HDFS cluster was filling up with RDD files. 
Spark is running on YARN client mode. 

I managed to reproduce the problem in a toy example as below:

{code}
df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint()

for i in range(4):

    # either line of the following 2 will produce the error   
    df = df.select('*', F.concat(*df.columns)).cache().checkpoint()
    df = df.join(df, on='a').cache().checkpoint()

    # the following two lines do not seem to have an effect
    gc.collect()
    sc._jvm.System.gc()
{code}

After running the code and {{sc.top()}}, I can still see the rdd's checkpointed 
in HDFS:
{quote}
guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH
5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12
5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18
5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24
5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30
5.2 K  $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6
{quote}

The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set to 
{{true}}. I would expect Spark to clean up all RDDs that can't be accessed. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to