Might also want to look at Y! post, looks like they are experimenting with 
similar efforts in large scale word2vec:

http://yahooeng.tumblr.com/post/118860853846/distributed-word2vec-on-top-of-pistachio



-----Original Message-----
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Tuesday, May 19, 2015 1:25 PM
To: Shilad Sen
Cc: user
Subject: Re: Word2Vec with billion-word corpora

With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to 
store the model. That is 64GB. We store the model on the driver node in the 
current implementation. So I don't think it would work. You might try 
increasing the minCount to decrease the vocabulary size and reduce the vector 
size. I'm interested in learning the trade-off between the model size and the 
model quality. If you have done some experiments, please let me know. Thanks! 
-Xiangrui

On Wed, May 13, 2015 at 11:17 AM, Shilad Sen <s...@macalester.edu> wrote:
> Hi all,
>
> I'm experimenting with Spark's Word2Vec implementation for a 
> relatively large (5B word, vocabulary size 4M, 400-dimensional 
> vectors) corpora. Has anybody had success running it at this scale?
>
> Thanks in advance for your guidance!
>
> -Shilad
>
> --
> Shilad W. Sen
> Associate Professor
> Mathematics, Statistics, and Computer Science Dept.
> Macalester College
> s...@macalester.edu
> http://www.shilad.com
> https://www.linkedin.com/in/shilad
> 651-696-6273

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to