Hi all,

*PROBLEM:*

I'm using spark 1.5.0 distributedLDA to do topic modelling. It looks like
after 20 iterations, the whole disk space is exhausted and the application
broke down.

*DETAILS:*

I'm using 4 m3.2xlarge (each has 30G memory and 2x80G disk space) machines
as data nodes. I monitored the disk space usage and it looks like the
temporary data generated in each iteration in em algorithm is not cleaned
up. If I increase the disk space, I could manage to run more iterations.
Since I want to make sure that the algorithm converges, I set the maximum
iterations to be 100 and checkInterval to be 5. Increasing the disk space
for me is not so scalable as the data increases dramatically.

*QUESTIONS:*

1) Have you ever encountered this problem?
2) if so, how do you solve this problem?
3) how to check convergence? (I used to print out perplexity to check
convergency, I don't know if it's possible to print out such info)

*ADDITIONAL INFO:*

*CORPUS*

-- size 3.1G bzip2 compressed
-- num of docs 1.1million

*LDA PARAMETERS*

-- num of topics: 500
-- prior: symmetric Dirichlet distribution

*HADOOP CONFIGURATION:*

dfs.data.dir=/mnt/var/lib/hadoop/dfs,/mnt1/var/lib/hadoop/dfs
dfs.name.dir=/mnt/var/lib/hadoop/dfs-name,/mnt1/var/lib/hadoop/dfs-name
yarn.nodemanager.local-dirs=
/mnt/var/lib/hadoop/tmp/nm-local-dir,/mnt1/var/lib/hadoop/tmp/nm-local-dir

*DATA NODE LOST AND ERROR:*

After the disk space is full, the data node becomes unhealthy and gets lost
soon. Here is the simple error.

1/1 local-dirs are bad:
/mnt/var/lib/hadoop/tmp/nm-local-dir,/mnt1/var/lib/hadoop/tmp/nm-local-dir

*OTHERS:*

I checked out onlineLDA. But it only generates inferred topics. I also need
doc-topic distribution.

Here is a few links related to this problem I found online, but none
answers my question.

https://issues.apache.org/jira/browse/SPARK-5560

http://stackoverflow.com/questions/32838903/spark-mllib-checkpointing-not-removing-shuffle-files-from-local-disk

Thanks very much

Xuan

Reply via email to