Hi all,

I am applying MLlib LDA for topic modelling. I am setting up the the lda
parameter as follow:

lda.setOptimizer(optimizer)
  .setK(params.k)
  .setMaxIterations(params.maxIterations)
  .setDocConcentration(params.docConcentration)
  .setTopicConcentration(params.topicConcentration)
  .setCheckpointInterval(params.checkpointInterval)
  if (params.checkpointDir.nonEmpty) {
      sc.setCheckpointDir(params.checkpointDir.get)
 }


I am running the LDA algorithm on my local MacOS machine, on a corpus of
800,000 english text documents (total size 9GB), and my machine has 8 cores
with 16GB or RAM and 500GB or hard disk.

Here are my Spark configurations:

val conf = new SparkConf().setMaster("local[6]").setAppName("LDAExample")
val sc = new SparkContext(conf)


When calling the LDA with a large number of iteration (100) (i.e. by calling
val ldaModel = lda.run(corpus)), the algorithm start to create shuffle files
on my disk at at point that it fills it up till there is space left.

I am using spark-submit to run my program as follow:

spark-submit --driver-memory 14G --class
com.heystaks.spark.ml.topicmodelling.LDAExample
./target/scala-2.10/lda-assembly-1.0.jar path/to/copurs/file --k 100
--maxIterations 100 --checkpointDir /Users/ramialbatal/checkpoints
--checkpointInterval 1


Where 'K' is the number of topics to extract, when the number of iterations
and topics are small everything is fine, but when there is large iteration
number like 100, no matter what is the value of --checkpointInterval the
phenomenon is the same: disk will fill up after about 25 iteration.

Everything seems to run correctly and the checkpoints files are created on
my disk but the shuffle files are not removed at all.

I am using Spark and MLlib 1.5.0, and my machine is Mac Yosemite 10.10.5.

Any help is highly appreciated. Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-not-removing-shuffle-files-from-local-disk-tp24857.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to