Oops, just kidding, this method is not in the current release. However, it is
included in the latest commit on git if you want to do a build.
> On Jan 6, 2015, at 2:56 PM, Ganon Pierce wrote:
>
> Two billion words is a very large vocabulary… You can try solving this issue
> by by setting the
Two billion words is a very large vocabulary… You can try solving this issue by
by setting the number of times words must occur in order to be included in the
vocabulary using setMinCount, this will be prevent common misspellings,
websites, and other things from being included and may improve th
Thanks Zhan, I'm also confused about the jstack output, why the driver gets
stuck at "org.apache.spark.SparkContext.clean" ?
On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang wrote:
> I think it is overflow. The training data is quite big. The algorithms
> scalability highly depends on the vocabSize.
I think it is overflow. The training data is quite big. The algorithms
scalability highly depends on the vocabSize. Even without overflow, there are
still other bottlenecks, for example, syn0Global and syn1Global, each of them
has vocabSize * vectorSize elements.
Thanks.
Zhan Zhang
On Jan 5
Hi Xiangrui,
Our dataset is about 80GB(10B lines).
In the driver's log, we foud this:
*INFO Word2Vec: trainWordsCount = -1610413239*
it seems that there is a integer overflow?
On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng wrote:
> How big is your dataset, and what is the vocabulary size? -X
How big is your dataset, and what is the vocabulary size? -Xiangrui
On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen wrote:
> Hi,
>
> When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
> usage. Here is the jstack output:
>
> "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
Hi,
When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
usage. Here is the jstack output:
"main" prio=10 tid=0x40112800 nid=0x46f2 runnable
[0x4162e000]
java.lang.Thread.State: RUNNABLE
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(Object