Hi all, *PROBLEM:*
I'm using spark 1.5.0 distributedLDA to do topic modelling. It looks like after 20 iterations, the whole disk space is exhausted and the application broke down. *DETAILS:* I'm using 4 m3.2xlarge (each has 30G memory and 2x80G disk space) machines as data nodes. I monitored the disk space usage and it looks like the temporary data generated in each iteration in em algorithm is not cleaned up. If I increase the disk space, I could manage to run more iterations. Since I want to make sure that the algorithm converges, I set the maximum iterations to be 100 and checkInterval to be 5. Increasing the disk space for me is not so scalable as the data increases dramatically. *QUESTIONS:* 1) Have you ever encountered this problem? 2) if so, how do you solve this problem? 3) how to check convergence? (I used to print out perplexity to check convergency, I don't know if it's possible to print out such info) *ADDITIONAL INFO:* *CORPUS* -- size 3.1G bzip2 compressed -- num of docs 1.1million *LDA PARAMETERS* -- num of topics: 500 -- prior: symmetric Dirichlet distribution *HADOOP CONFIGURATION:* dfs.data.dir=/mnt/var/lib/hadoop/dfs,/mnt1/var/lib/hadoop/dfs dfs.name.dir=/mnt/var/lib/hadoop/dfs-name,/mnt1/var/lib/hadoop/dfs-name yarn.nodemanager.local-dirs= /mnt/var/lib/hadoop/tmp/nm-local-dir,/mnt1/var/lib/hadoop/tmp/nm-local-dir *DATA NODE LOST AND ERROR:* After the disk space is full, the data node becomes unhealthy and gets lost soon. Here is the simple error. 1/1 local-dirs are bad: /mnt/var/lib/hadoop/tmp/nm-local-dir,/mnt1/var/lib/hadoop/tmp/nm-local-dir *OTHERS:* I checked out onlineLDA. But it only generates inferred topics. I also need doc-topic distribution. Here is a few links related to this problem I found online, but none answers my question. https://issues.apache.org/jira/browse/SPARK-5560 http://stackoverflow.com/questions/32838903/spark-mllib-checkpointing-not-removing-shuffle-files-from-local-disk Thanks very much Xuan