With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would work. You might try increasing the minCount to decrease the vocabulary size and reduce the vector size. I'm interested in learning the trade-off between the model size and the model quality. If you have done some experiments, please let me know. Thanks! -Xiangrui
On Wed, May 13, 2015 at 11:17 AM, Shilad Sen <s...@macalester.edu> wrote: > Hi all, > > I'm experimenting with Spark's Word2Vec implementation for a relatively > large (5B word, vocabulary size 4M, 400-dimensional vectors) corpora. Has > anybody had success running it at this scale? > > Thanks in advance for your guidance! > > -Shilad > > -- > Shilad W. Sen > Associate Professor > Mathematics, Statistics, and Computer Science Dept. > Macalester College > s...@macalester.edu > http://www.shilad.com > https://www.linkedin.com/in/shilad > 651-696-6273 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org