Might also want to look at Y! post, looks like they are experimenting with similar efforts in large scale word2vec:
http://yahooeng.tumblr.com/post/118860853846/distributed-word2vec-on-top-of-pistachio -----Original Message----- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, May 19, 2015 1:25 PM To: Shilad Sen Cc: user Subject: Re: Word2Vec with billion-word corpora With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would work. You might try increasing the minCount to decrease the vocabulary size and reduce the vector size. I'm interested in learning the trade-off between the model size and the model quality. If you have done some experiments, please let me know. Thanks! -Xiangrui On Wed, May 13, 2015 at 11:17 AM, Shilad Sen <s...@macalester.edu> wrote: > Hi all, > > I'm experimenting with Spark's Word2Vec implementation for a > relatively large (5B word, vocabulary size 4M, 400-dimensional > vectors) corpora. Has anybody had success running it at this scale? > > Thanks in advance for your guidance! > > -Shilad > > -- > Shilad W. Sen > Associate Professor > Mathematics, Statistics, and Computer Science Dept. > Macalester College > s...@macalester.edu > http://www.shilad.com > https://www.linkedin.com/in/shilad > 651-696-6273 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org