Hi,

We're considering using Spark MLlib (v >= 1.5) LDA implementation for topic
modelling. We plan to train the model using a data set of about 12 M
documents and vocabulary size of 200-300 k items. Documents are relatively
short, typically containing less than 10 words, but the number can range up
to tens of words. The model would be updated periodically using e.g. a
batch process while predictions will be queried by a long-running
application process in which we plan to embed MLlib.

Is the MLlib LDA implementation considered to be well-suited to this kind
of use case?

I did some prototyping based on the code samples on "MLlib - Clustering"
page and noticed that the topics matrix values seem to vary quite a bit
across training runs even with the exact same input data set. During
prediction I observed similar behaviour.
Is this due to the probabilistic nature of the LDA algorithm?

Any caveats to be aware of with the LDA implementation?

For reference, my prototype code can be found here:
https://github.com/marko-asplund/tech-protos/blob/master/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala


thanks,
marko

Reply via email to