Re: Word2Vec with billion-word corpora

Xiangrui Meng Tue, 19 May 2015 13:25:57 -0700

With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B
floats to store the model. That is 64GB. We store the model on the
driver node in the current implementation. So I don't think it would
work. You might try increasing the minCount to decrease the vocabulary
size and reduce the vector size. I'm interested in learning the
trade-off between the model size and the model quality. If you have
done some experiments, please let me know. Thanks! -Xiangrui


On Wed, May 13, 2015 at 11:17 AM, Shilad Sen <s...@macalester.edu> wrote:
> Hi all,
>
> I'm experimenting with Spark's Word2Vec implementation for a relatively
> large (5B word, vocabulary size 4M, 400-dimensional vectors) corpora. Has
> anybody had success running it at this scale?
>
> Thanks in advance for your guidance!
>
> -Shilad
>
> --
> Shilad W. Sen
> Associate Professor
> Mathematics, Statistics, and Computer Science Dept.
> Macalester College
> s...@macalester.edu
> http://www.shilad.com
> https://www.linkedin.com/in/shilad
> 651-696-6273

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Word2Vec with billion-word corpora

Reply via email to