Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Oops, just kidding, this method is not in the current release. However, it is included in the latest commit on git if you want to do a build. > On Jan 6, 2015, at 2:56 PM, Ganon Pierce wrote: > > Two billion words is a very large vocabulary… You can try solving this issue > by by setting the

Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Two billion words is a very large vocabulary… You can try solving this issue by by setting the number of times words must occur in order to be included in the vocabulary using setMinCount, this will be prevent common misspellings, websites, and other things from being included and may improve th

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Eric Zhen
Thanks Zhan, I'm also confused about the jstack output, why the driver gets stuck at "org.apache.spark.SparkContext.clean" ? On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang wrote: > I think it is overflow. The training data is quite big. The algorithms > scalability highly depends on the vocabSize.

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Zhan Zhang
I think it is overflow. The training data is quite big. The algorithms scalability highly depends on the vocabSize. Even without overflow, there are still other bottlenecks, for example, syn0Global and syn1Global, each of them has vocabSize * vectorSize elements. Thanks. Zhan Zhang On Jan 5

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Eric Zhen
Hi Xiangrui, Our dataset is about 80GB(10B lines). In the driver's log, we foud this: *INFO Word2Vec: trainWordsCount = -1610413239* it seems that there is a integer overflow? On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng wrote: > How big is your dataset, and what is the vocabulary size? -X

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Xiangrui Meng
How big is your dataset, and what is the vocabulary size? -Xiangrui On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen wrote: > Hi, > > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup > usage. Here is the jstack output: > > "main" prio=10 tid=0x40112800 nid=0x46f2 runnable

Driver hangs on running mllib word2vec

2015-01-04 Thread Eric Zhen
Hi, When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup usage. Here is the jstack output: "main" prio=10 tid=0x40112800 nid=0x46f2 runnable [0x4162e000] java.lang.Thread.State: RUNNABLE at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Object