Re: how to clean shuffle write each iteration
in ALS, I guess all the iteration’s rdds are referenced by its next iteration’s rdd, so all the shuffle data will not be deleted until the als job finished… I guess checkpoint could solve my problem, do you know checkpoint? 在 2015年3月3日,下午4:18,nitin [via Apache Spark User List] ml-node+s1001560n21889...@n3.nabble.com 写道: Shuffle write will be cleaned if it is not referenced by any object directly/indirectly. There is a garbage collector written inside spark which periodically checks for weak references to RDDs/shuffle write/broadcast and deletes them. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-clean-shuffle-write-each-iteration-tp21886p21889.html http://apache-spark-user-list.1001560.n3.nabble.com/how-to-clean-shuffle-write-each-iteration-tp21886p21889.html To unsubscribe from how to clean shuffle write each iteration, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=21886code=bGlzZW5kb25nQDE2My5jb218MjE4ODZ8MjQ0MTU2NDA4. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-clean-shuffle-write-each-iteration-tp21886p21890.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to clean shuffle write each iteration
Shuffle write will be cleaned if it is not referenced by any object directly/indirectly. There is a garbage collector written inside spark which periodically checks for weak references to RDDs/shuffle write/broadcast and deletes them. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-clean-shuffle-write-each-iteration-tp21886p21889.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
how to clean shuffle write each iteration
I 'm using spark als. I set the iteration number to 30. And in each iteration, tasks will produce nearly 1TB shuffle write. To my surprise, this shuffle data will not be cleaned until the total job finished, which means, I need 30TB disk to store the shuffle data. I think after each iteration, we can delete the shuffle data before current iteration, right? how to do this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-clean-shuffle-write-each-iteration-tp21886.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org