Guillem Palou created SPARK-20598: ------------------------------------- Summary: Iterative checkpoints do not get removed from HDFS Key: SPARK-20598 URL: https://issues.apache.org/jira/browse/SPARK-20598 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, YARN Affects Versions: 2.1.0 Reporter: Guillem Palou
I am running a pyspark application that makes use of dataframe.checkpoint() because Spark needs exponential time to compute the plan and eventually I had to stop it. Using {{checkpoint}} allowed the application to proceed with the computation, but I noticed that the HDFS cluster was filling up with RDD files. Spark is running on YARN client mode. I managed to reproduce the problem in a toy example as below: {code} df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint() for i in range(4): # either line of the following 2 will produce the error df = df.select('*', F.concat(*df.columns)).cache().checkpoint() df = df.join(df, on='a').cache().checkpoint() # the following two lines do not seem to have an effect gc.collect() sc._jvm.System.gc() {code} After running the code and {{sc.top()}}, I can still see the rdd's checkpointed in HDFS: {quote} guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6 {quote} The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set to {{true}}. I would expect Spark to clean up all RDDs that can't be accessed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org