Hi all, I am applying MLlib LDA for topic modelling. I am setting up the the lda parameter as follow:
lda.setOptimizer(optimizer) .setK(params.k) .setMaxIterations(params.maxIterations) .setDocConcentration(params.docConcentration) .setTopicConcentration(params.topicConcentration) .setCheckpointInterval(params.checkpointInterval) if (params.checkpointDir.nonEmpty) { sc.setCheckpointDir(params.checkpointDir.get) } I am running the LDA algorithm on my local MacOS machine, on a corpus of 800,000 english text documents (total size 9GB), and my machine has 8 cores with 16GB or RAM and 500GB or hard disk. Here are my Spark configurations: val conf = new SparkConf().setMaster("local[6]").setAppName("LDAExample") val sc = new SparkContext(conf) When calling the LDA with a large number of iteration (100) (i.e. by calling val ldaModel = lda.run(corpus)), the algorithm start to create shuffle files on my disk at at point that it fills it up till there is space left. I am using spark-submit to run my program as follow: spark-submit --driver-memory 14G --class com.heystaks.spark.ml.topicmodelling.LDAExample ./target/scala-2.10/lda-assembly-1.0.jar path/to/copurs/file --k 100 --maxIterations 100 --checkpointDir /Users/ramialbatal/checkpoints --checkpointInterval 1 Where 'K' is the number of topics to extract, when the number of iterations and topics are small everything is fine, but when there is large iteration number like 100, no matter what is the value of --checkpointInterval the phenomenon is the same: disk will fill up after about 25 iteration. Everything seems to run correctly and the checkpoints files are created on my disk but the shuffle files are not removed at all. I am using Spark and MLlib 1.5.0, and my machine is Mac Yosemite 10.10.5. Any help is highly appreciated. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-not-removing-shuffle-files-from-local-disk-tp24857.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org