Re: Why Spark having OutOfMemory Exception?

2016-04-21 Thread Zhan Zhang
The data may be not large, but the driver need to do a lot of bookkeeping. In 
your case,  it is possible the driver control plane takes too much memory.

I think you can find a java developer to look at the coredump. Otherwise, it is 
hard to tell exactly which part are using all the memory.

Thanks.

Zhan Zhang


On Apr 20, 2016, at 1:38 AM, 李明伟 
> wrote:

Hi

the input data size is less than 10M. The task result size should be less I 
think. Because I am doing aggregation on the data





At 2016-04-20 16:18:31, "Jeff Zhang" 
> wrote:
Do you mean the input data size as 10M or the task result size ?

>>> But my way is to setup a forever loop to handle continued income data. Not 
>>> sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in the 
forever loop ?



On Wed, Apr 20, 2016 at 3:55 PM, 李明伟 
> wrote:
Hi Jeff

The total size of my data is less than 10M. I already set the driver memory to 
4GB.







在 2016-04-20 13:42:25,"Jeff Zhang" > 
写道:
Seems it is OOM in driver side when fetching task result.

You can try to increase spark.driver.memory and spark.driver.maxResultSize

On Tue, Apr 19, 2016 at 4:06 PM, 李明伟 
> wrote:
Hi Zhan Zhang


Please see the exception trace below. It is saying some GC overhead limit error
I am not a java or scala developer so it is hard for me to understand these 
infor.
Also reading coredump is too difficult to me..

I am not sure if the way I am using spark is correct. I understand that spark 
can do batch or stream calculation. But my way is to setup a forever loop to 
handle continued income data.
Not sure if it is the right way to use spark


16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

Re:Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread 李明伟
Hi 


the input data size is less than 10M. The task result size should be less I 
think. Because I am doing aggregation on the data 






At 2016-04-20 16:18:31, "Jeff Zhang"  wrote:

Do you mean the input data size as 10M or the task result size ?


>>> But my way is to setup a forever loop to handle continued income data. Not 
>>> sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in the 
forever loop ?






On Wed, Apr 20, 2016 at 3:55 PM, 李明伟  wrote:

Hi Jeff


The total size of my data is less than 10M. I already set the driver memory to 
4GB.











在 2016-04-20 13:42:25,"Jeff Zhang"  写道:

Seems it is OOM in driver side when fetching task result. 


You can try to increase spark.driver.memory and spark.driver.maxResultSize


On Tue, Apr 19, 2016 at 4:06 PM, 李明伟  wrote:

Hi Zhan Zhang




Please see the exception trace below. It is saying some GC overhead limit error
I am not a java or scala developer so it is hard for me to understand these 
infor.
Also reading coredump is too difficult to me..


I am not sure if the way I am using spark is correct. I understand that spark 
can do batch or stream calculation. But my way is to setup a forever loop to 
handle continued income data. 
Not sure if it is the right way to use spark




16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 

Re: Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread Jeff Zhang
Do you mean the input data size as 10M or the task result size ?

>>> But my way is to setup a forever loop to handle continued income data. Not
sure if it is the right way to use spark
Not sure what this mean, do you use spark-streaming, for doing batch job in
the forever loop ?



On Wed, Apr 20, 2016 at 3:55 PM, 李明伟  wrote:

> Hi Jeff
>
> The total size of my data is less than 10M. I already set the driver
> memory to 4GB.
>
>
>
>
>
>
>
> 在 2016-04-20 13:42:25,"Jeff Zhang"  写道:
>
> Seems it is OOM in driver side when fetching task result.
>
> You can try to increase spark.driver.memory and spark.driver.maxResultSize
>
> On Tue, Apr 19, 2016 at 4:06 PM, 李明伟  wrote:
>
>> Hi Zhan Zhang
>>
>>
>> Please see the exception trace below. It is saying some GC overhead limit
>> error
>> I am not a java or scala developer so it is hard for me to understand
>> these infor.
>> Also reading coredump is too difficult to me..
>>
>> I am not sure if the way I am using spark is correct. I understand that
>> spark can do batch or stream calculation. But my way is to setup a forever
>> loop to handle continued income data.
>> Not sure if it is the right way to use spark
>>
>>
>> 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
>> task-result-getter-2
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>> at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>> at
>> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
>> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>> at
>> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
>> at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC
>> overhead limit exceeded
>> at
>> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>> at
>> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>> at
>> 

Re:Re: Re: Why Spark having OutOfMemory Exception?

2016-04-20 Thread 李明伟
Hi Jeff


The total size of my data is less than 10M. I already set the driver memory to 
4GB.











在 2016-04-20 13:42:25,"Jeff Zhang"  写道:

Seems it is OOM in driver side when fetching task result. 


You can try to increase spark.driver.memory and spark.driver.maxResultSize


On Tue, Apr 19, 2016 at 4:06 PM, 李明伟  wrote:

Hi Zhan Zhang




Please see the exception trace below. It is saying some GC overhead limit error
I am not a java or scala developer so it is hard for me to understand these 
infor.
Also reading coredump is too difficult to me..


I am not sure if the way I am using spark is correct. I understand that spark 
can do batch or stream calculation. But my way is to setup a forever loop to 
handle continued income data. 
Not sure if it is the right way to use spark




16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
at 
scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 

Re: Re: Why Spark having OutOfMemory Exception?

2016-04-19 Thread Jeff Zhang
Seems it is OOM in driver side when fetching task result.

You can try to increase spark.driver.memory and spark.driver.maxResultSize

On Tue, Apr 19, 2016 at 4:06 PM, 李明伟  wrote:

> Hi Zhan Zhang
>
>
> Please see the exception trace below. It is saying some GC overhead limit
> error
> I am not a java or scala developer so it is hard for me to understand
> these infor.
> Also reading coredump is too difficult to me..
>
> I am not sure if the way I am using spark is correct. I understand that
> spark can do batch or stream calculation. But my way is to setup a forever
> loop to handle continued income data.
> Not sure if it is the right way to use spark
>
>
> 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
> task-result-getter-2
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
> at
> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at
> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
> at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
> at
> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
> at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC
> overhead limit exceeded
> at
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
> at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
> at
> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at
> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
> at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> 

Re: Why Spark having OutOfMemory Exception?

2016-04-18 Thread Zhan Zhang
What kind of OOM? Driver or executor side? You can use coredump to find what 
cause the OOM.

Thanks.

Zhan Zhang

On Apr 18, 2016, at 9:44 PM, 李明伟 
> wrote:

Hi Samaga

Thanks very much for your reply and sorry for the delay reply.

Cassandra or Hive is a good suggestion.
However in my situation I am not sure if it will make sense.

My requirements is that to get the recent 24 hour data to generate report. The 
frequency is 5 minute.
So if use cassandra or hive, it means spark will have to read 24 hour data 
every 5 mintues. And among those data, a big part (like 23 hours or more ) will 
be repeatedly read.

The window in spark is for stream computing. I did not use it but I will 
consider it


Thanks again

Regards
Mingwei





At 2016-04-11 19:09:48, "Lohith Samaga M" 
> wrote:
>Hi Kramer,
>   Some options:
>   1. Store in Cassandra with TTL = 24 hours. When you read the full 
> table, you get the latest 24 hours data.
>   2. Store in Hive as ORC file and use timestamp field to filter out the 
> old data.
>   3. Try windowing in spark or flink (have not used either).
>
>
>Best regards / Mit freundlichen Grüßen / Sincères salutations
>M. Lohith Samaga
>
>
>-Original Message-
>From: kramer2...@126.com [mailto:kramer2...@126.com]
>Sent: Monday, April 11, 2016 16.18
>To: user@spark.apache.org
>Subject: Why Spark having OutOfMemory Exception?
>
>I use spark to do some very simple calculation. The description is like below 
>(pseudo code):
>
>
>While timestamp == 5 minutes
>
>df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
>
>my_dict[timestamp] = df # Put the data frame into a dict
>
>delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
>24 hour before)
>
>big_df = merge(my_dict) # Merge the recent 24 hours data frame
>
>To explain..
>
>I have new files comes in every 5 minutes. But I need to generate report on 
>recent 24 hours data.
>The concept of 24 hours means I need to delete the oldest data frame every 
>time I put a new one into it.
>So I maintain a dict (my_dict in above code), the dict contains map like
>timestamp: dataframe. Everytime I put dataframe into the dict, I will go 
>through the dict to delete those old data frame whose timestamp is 24 hour ago.
>After delete and input. I merge the data frames in the dict to a big one and 
>run SQL on it to get my report.
>
>*
>I want to know if any thing wrong about this model? Because it is very slow 
>after started for a while and hit OutOfMemory. I know that my memory is 
>enough. Also size of file is very small for test purpose. So should not have 
>memory problem.
>
>I am wondering if there is lineage issue, but I am not sure.
>
>*
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
>Sent from the Apache Spark User List mailing list archive at 
>Nabble.com.
>
>-
>To unsubscribe, e-mail: 
>user-unsubscr...@spark.apache.org 
>For additional commands, e-mail: 
>user-h...@spark.apache.org
>
>Information transmitted by this e-mail is proprietary to Mphasis, its 
>associated companies and/ or its customers and is intended
>for use only by the individual or entity to which it is addressed, and may 
>contain information that is privileged, confidential or
>exempt from disclosure under applicable law. If you are not the intended 
>recipient or it appears that this mail has been forwarded
>to you without proper authority, you are notified that any use or 
>dissemination of this information in any manner is strictly
>prohibited. In such cases, please notify us immediately at 
>mailmas...@mphasis.com and delete this mail 
>from your records.
>







RE: Why Spark having OutOfMemory Exception?

2016-04-11 Thread Lohith Samaga M
Hi Kramer,
Some options:
1. Store in Cassandra with TTL = 24 hours. When you read the full 
table, you get the latest 24 hours data.
2. Store in Hive as ORC file and use timestamp field to filter out the 
old data.
3. Try windowing in spark or flink (have not used either).


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


-Original Message-
From: kramer2...@126.com [mailto:kramer2...@126.com] 
Sent: Monday, April 11, 2016 16.18
To: user@spark.apache.org
Subject: Why Spark having OutOfMemory Exception?

I use spark to do some very simple calculation. The description is like below 
(pseudo code):


While timestamp == 5 minutes

df = read_hdf() # Read hdfs to get a dataframe every 5 minutes

my_dict[timestamp] = df # Put the data frame into a dict

delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
24 hour before)

big_df = merge(my_dict) # Merge the recent 24 hours data frame

To explain..

I have new files comes in every 5 minutes. But I need to generate report on 
recent 24 hours data. 
The concept of 24 hours means I need to delete the oldest data frame every time 
I put a new one into it.
So I maintain a dict (my_dict in above code), the dict contains map like
timestamp: dataframe. Everytime I put dataframe into the dict, I will go 
through the dict to delete those old data frame whose timestamp is 24 hour ago.
After delete and input. I merge the data frames in the dict to a big one and 
run SQL on it to get my report.

*
I want to know if any thing wrong about this model? Because it is very slow 
after started for a while and hit OutOfMemory. I know that my memory is enough. 
Also size of file is very small for test purpose. So should not have memory 
problem.

I am wondering if there is lineage issue, but I am not sure. 

*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org