Hi,

I am running Mahout's implementation of LDA on a Hadoop cluster of *8 nodes*
.
Each with *4 cores and 8GB of RAM*. (8 A3 nodes on Azure actually).

My dataset is composed of *3.5M documents* (~ 7GB, ~130 words/doc) which
are converted to a matrix of ~2.4 GB.
The size of the *vocabulary is ~150k terms*.
The training is done with *500 topics*.

Each node uses 2 containers, with 3GB and 2 cores each.
The max split size is 32MB, which creates 76 splits.


Despite trying multiple configs (# containers, # mappers, # reducers, split
size), each iteration takes roughly *20 hours*, which is probably far above
the expected time.

Each map takes several hours to complete.

Based on these informations (I can provide more), what is roughly the
expected time per iteration ? (I could not find benchmarks)
If it can be dramatically optimized, what can be modified/monitored to
understand and improve performance ? (I use HDinsight on Linux with Ambari)
The commands are available below.


In advance, thanks a lot for your help.

Best,
Bernard


Here are my commands :





*mahout seqdirectory -i docs -o out/sequencedmahout seq2sparse -i
out/sequenced     -o out/sparseVectors     --namedVector --minSupport
100mahout rowid     -i out/sparseVectors/tf-vectors/     -o
out/matrixmahout cvb -i out/matrix/matrix     -dict
out/sparseVectors/dictionary.file-0 -k 500 -x 10 -dt out/cvb/do_out     -o
out/cvb/to_out -ow -mt out/tempmahout vectordump     -i out/cvb/to_out
--dictionary out/sparseVectors/dictionary.file-0 --dictionaryType
sequencefile     --vectorSize 10     -sort out/cvb/to_out*

Reply via email to