Re: Help: Driver OOM when shuffle large amount of data

Chris Fregly Mon, 28 Dec 2015 06:20:31 -0800

which version of spark is this?

is there any chance that a single key - or set of keys- key has a large number 
of values relative to the other keys (aka. skew)?


if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although 
I had some issues still with 1.5.1 in a similar situation.

I'm waiting to test with 1.6.0 before I start asking/creating jiras.

> On Dec 28, 2015, at 5:23 AM, Eugene Morozov <evgeny.a.moro...@gmail.com> 
> wrote:
> 
> Kendal, 
> 
> have you tried to reduce number of partitions?
> 
> --
> Be well!
> Jean Morozov
> 
>> On Mon, Dec 28, 2015 at 9:02 AM, kendal <ken...@163.com> wrote:
>> My driver is running OOM with my 4T data set... I don't collect any data to
>> driver. All what the program done is map - reduce - saveAsTextFile. But the
>> partitions to be shuffled is quite large - 20K+.
>> 
>> The symptom what I'm seeing the timeout when GetMapOutputStatuses from
>> Driver.
>> 15/12/24 02:04:21 INFO spark.MapOutputTrackerWorker: Don't have map outputs
>> for shuffle 0, fetching them
>> 15/12/24 02:04:21 INFO spark.MapOutputTrackerWorker: Doing the fetch;
>> tracker endpoint =
>> AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.115.58.55:52077/user/MapOutputTracker#-1937024516])
>> 15/12/24 02:06:21 WARN akka.AkkaRpcEndpointRef: Error sending message
>> [message = GetMapOutputStatuses(0)] in 1 attempts
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>>         at
>> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
>>         at
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
>>         at
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
>>         at
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
>> 
>> But the root cause is OOM:
>> 15/12/24 02:05:36 ERROR actor.ActorSystemImpl: Uncaught fatal error from
>> thread [sparkDriver-akka.remote.default-remote-dispatcher-24] shutting down
>> ActorSystem [sparkDriver]
>> java.lang.OutOfMemoryError: Java heap space
>>         at java.util.Arrays.copyOf(Arrays.java:2271)
>>         at
>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
>>         at akka.serialization.JavaSerializer.toBinary(Serializer.scala:131)
>>         at
>> akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
>>         at
>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>>         at
>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>>         at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>>         at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
>>         at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
>>         at
>> akka.remote.EndpointWriter$$anonfun$4.applyOrElse(Endpoint.scala:718)
>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>         at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>         at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>         at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>> 
>> I've already allocated 16G memory for my driver - which is the hard limit
>> MAX of my Yarn cluster. And I also applied Kryo serialization... Any idea to
>> reduce memory foot point?
>> And what confuses me is that, even I have 20K+ partition to shuffle, why I
>> need so much memory?!
>> 
>> Thank you so much for any help!
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-Driver-OOM-when-shuffle-large-amount-of-data-tp25818.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Help: Driver OOM when shuffle large amount of data

Reply via email to