Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Oops, just kidding, this method is not in the current release. However, it is 
included in the latest commit on git if you want to do a build.


> On Jan 6, 2015, at 2:56 PM, Ganon Pierce  wrote:
> 
> Two billion words is a very large vocabulary… You can try solving this issue 
> by by setting the number of times words must occur in order to be included in 
> the vocabulary using setMinCount, this will be prevent common misspellings, 
> websites, and other things from being included and may improve the quality of 
> your model overall.
> 
>  
>> On Jan 6, 2015, at 12:59 AM, Eric Zhen > > wrote:
>> 
>> Thanks Zhan, I'm also confused about the jstack output, why the driver gets 
>> stuck at  "org.apache.spark.SparkContext.clean" ?
>> 
>> On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang > > wrote:
>> I think it is overflow. The training data is quite big. The algorithms  
>> scalability highly depends on the vocabSize. Even without overflow, there 
>> are still other bottlenecks, for example, syn0Global and syn1Global, each of 
>> them has vocabSize * vectorSize elements.
>> 
>> Thanks.
>> 
>> Zhan Zhang
>> 
>> 
>> 
>> On Jan 5, 2015, at 7:47 PM, Eric Zhen > > wrote:
>> 
>>> Hi Xiangrui,
>>> 
>>> Our dataset is about 80GB(10B lines). 
>>> 
>>> In the driver's log, we foud this:
>>> 
>>> INFO Word2Vec: trainWordsCount = -1610413239
>>> 
>>> it seems that there is a integer overflow?
>>> 
>>> 
>>> On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng >> > wrote:
>>> How big is your dataset, and what is the vocabulary size? -Xiangrui
>>> 
>>> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen >> > wrote:
>>> > Hi,
>>> >
>>> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
>>> > usage. Here is the jstack output:
>>> >
>>> > "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
>>> > [0x4162e000]
>>> >java.lang.Thread.State: RUNNABLE
>>> > at
>>> > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
>>> > at
>>> > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
>>> > at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
>>> > at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
>>> > at
>>> > java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
>>> > at
>>> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
>>> > at
>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
>>> > at
>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>>> > at
>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>>> > at
>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>>> > at
>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>>> > at
>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>>> > at
>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>>> > at
>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>>> > at
>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>>> > at
>>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>>> > at
>>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>>> > at
>>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>>> > at
>>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>>> > at
>>> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
>>> > at
>>> > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
>>> > at
>>> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
>>> > at
>>> > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
>>> > at
>>> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>>> > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
>>> > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
>>> > at
>>> > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
>>> > at 
>>> > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>> > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
>>> > at com.baidu.inf.WordCount$.main(WordCount.scala:31)
>>> > at com.baidu.inf.WordCount.main(WordCount.scala)
>>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Nat

Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Two billion words is a very large vocabulary… You can try solving this issue by 
by setting the number of times words must occur in order to be included in the 
vocabulary using setMinCount, this will be prevent common misspellings, 
websites, and other things from being included and may improve the quality of 
your model overall.

 
> On Jan 6, 2015, at 12:59 AM, Eric Zhen  wrote:
> 
> Thanks Zhan, I'm also confused about the jstack output, why the driver gets 
> stuck at  "org.apache.spark.SparkContext.clean" ?
> 
> On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang  > wrote:
> I think it is overflow. The training data is quite big. The algorithms  
> scalability highly depends on the vocabSize. Even without overflow, there are 
> still other bottlenecks, for example, syn0Global and syn1Global, each of them 
> has vocabSize * vectorSize elements.
> 
> Thanks.
> 
> Zhan Zhang
> 
> 
> 
> On Jan 5, 2015, at 7:47 PM, Eric Zhen  > wrote:
> 
>> Hi Xiangrui,
>> 
>> Our dataset is about 80GB(10B lines). 
>> 
>> In the driver's log, we foud this:
>> 
>> INFO Word2Vec: trainWordsCount = -1610413239
>> 
>> it seems that there is a integer overflow?
>> 
>> 
>> On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng > > wrote:
>> How big is your dataset, and what is the vocabulary size? -Xiangrui
>> 
>> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen > > wrote:
>> > Hi,
>> >
>> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
>> > usage. Here is the jstack output:
>> >
>> > "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
>> > [0x4162e000]
>> >java.lang.Thread.State: RUNNABLE
>> > at
>> > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
>> > at
>> > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
>> > at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
>> > at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
>> > at
>> > java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
>> > at
>> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
>> > at
>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
>> > at
>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>> > at
>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>> > at
>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>> > at
>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>> > at
>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>> > at
>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>> > at
>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>> > at
>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>> > at
>> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>> > at
>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>> > at
>> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>> > at
>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>> > at
>> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
>> > at
>> > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
>> > at
>> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
>> > at
>> > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
>> > at
>> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>> > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
>> > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
>> > at
>> > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
>> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
>> > at com.baidu.inf.WordCount$.main(WordCount.scala:31)
>> > at com.baidu.inf.WordCount.main(WordCount.scala)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > at
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > at java.lang.reflect.Method.invoke(Method.java:597)
>> > at
>> > org.apache.spark.deploy.SparkSubmit$.launch(Spar

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Eric Zhen
Thanks Zhan, I'm also confused about the jstack output, why the driver gets
stuck at  "org.apache.spark.SparkContext.clean" ?

On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang  wrote:

> I think it is overflow. The training data is quite big. The algorithms
>  scalability highly depends on the vocabSize. Even without overflow, there
> are still other bottlenecks, for example, syn0Global and syn1Global, each
> of them has vocabSize * vectorSize elements.
>
> Thanks.
>
> Zhan Zhang
>
>
>
> On Jan 5, 2015, at 7:47 PM, Eric Zhen  wrote:
>
> Hi Xiangrui,
>
> Our dataset is about 80GB(10B lines).
>
> In the driver's log, we foud this:
>
> *INFO Word2Vec: trainWordsCount = -1610413239*
>
> it seems that there is a integer overflow?
>
>
> On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng  wrote:
>
>> How big is your dataset, and what is the vocabulary size? -Xiangrui
>>
>> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen  wrote:
>> > Hi,
>> >
>> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
>> > usage. Here is the jstack output:
>> >
>> > "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
>> > [0x4162e000]
>> >java.lang.Thread.State: RUNNABLE
>> > at
>> >
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
>> > at
>> >
>> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
>> > at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
>> > at
>> java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
>> > at
>> >
>> java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
>> > at
>> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
>> > at
>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
>> > at
>> >
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>> > at
>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>> > at
>> >
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>> > at
>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>> > at
>> >
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>> > at
>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>> > at
>> >
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>> > at
>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>> > at
>> >
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
>> > at
>> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
>> > at
>> >
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
>> > at
>> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
>> > at
>> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
>> > at
>> >
>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
>> > at
>> >
>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
>> > at
>> >
>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
>> > at
>> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>> > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
>> > at
>> org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
>> > at
>> >
>> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
>> > at
>> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> > at
>> org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
>> > at com.baidu.inf.WordCount$.main(WordCount.scala:31)
>> > at com.baidu.inf.WordCount.main(WordCount.scala)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > at java.lang.reflect.Method.invoke(Method.java:597)
>> > at
>> > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>> > at
>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> >
>> > --
>> > Best Regards
>>
>
>
>
> --
> Best Regards
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If 

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Zhan Zhang
I think it is overflow. The training data is quite big. The algorithms  
scalability highly depends on the vocabSize. Even without overflow, there are 
still other bottlenecks, for example, syn0Global and syn1Global, each of them 
has vocabSize * vectorSize elements.

Thanks.

Zhan Zhang


On Jan 5, 2015, at 7:47 PM, Eric Zhen  wrote:

> Hi Xiangrui,
> 
> Our dataset is about 80GB(10B lines). 
> 
> In the driver's log, we foud this:
> 
> INFO Word2Vec: trainWordsCount = -1610413239
> 
> it seems that there is a integer overflow?
> 
> 
> On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng  wrote:
> How big is your dataset, and what is the vocabulary size? -Xiangrui
> 
> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen  wrote:
> > Hi,
> >
> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
> > usage. Here is the jstack output:
> >
> > "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
> > [0x4162e000]
> >java.lang.Thread.State: RUNNABLE
> > at
> > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
> > at
> > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
> > at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
> > at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
> > at
> > java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
> > at
> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
> > at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
> > at
> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> > at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> > at
> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> > at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> > at
> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> > at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> > at
> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> > at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> > at
> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> > at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> > at
> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> > at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> > at
> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
> > at
> > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> > at
> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> > at
> > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> > at
> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> > at
> > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
> > at com.baidu.inf.WordCount$.main(WordCount.scala:31)
> > at com.baidu.inf.WordCount.main(WordCount.scala)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at
> > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >
> > --
> > Best Regards
> 
> 
> 
> -- 
> Best Regards


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. T

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Eric Zhen
Hi Xiangrui,

Our dataset is about 80GB(10B lines).

In the driver's log, we foud this:

*INFO Word2Vec: trainWordsCount = -1610413239*

it seems that there is a integer overflow?


On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng  wrote:

> How big is your dataset, and what is the vocabulary size? -Xiangrui
>
> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen  wrote:
> > Hi,
> >
> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
> > usage. Here is the jstack output:
> >
> > "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
> > [0x4162e000]
> >java.lang.Thread.State: RUNNABLE
> > at
> >
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
> > at
> >
> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
> > at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
> > at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
> > at
> >
> java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
> > at
> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
> > at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
> > at
> >
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> > at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> > at
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> > at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> > at
> >
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> > at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> > at
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> > at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> > at
> >
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> > at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> > at
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> > at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> > at
> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
> > at
> >
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> > at
> >
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> > at
> >
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> > at
> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> > at
> >
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> > at
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> > at
> org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
> > at com.baidu.inf.WordCount$.main(WordCount.scala:31)
> > at com.baidu.inf.WordCount.main(WordCount.scala)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at
> > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> > at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >
> > --
> > Best Regards
>



-- 
Best Regards


Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Xiangrui Meng
How big is your dataset, and what is the vocabulary size? -Xiangrui

On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen  wrote:
> Hi,
>
> When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
> usage. Here is the jstack output:
>
> "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
> [0x4162e000]
>java.lang.Thread.State: RUNNABLE
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
> at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
> at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
> at
> java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> at
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> at
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
> at com.baidu.inf.WordCount$.main(WordCount.scala:31)
> at com.baidu.inf.WordCount.main(WordCount.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> --
> Best Regards

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org