Re: loads of memory still GC overhead limit exceeded

Xiangrui Meng Fri, 20 Feb 2015 13:57:19 -0800

Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS
performance should be improved in 1.3.0. -Xiangrui


On Fri, Feb 20, 2015 at 1:32 PM, Antony Mayi
<antonym...@yahoo.com.invalid> wrote:
> Hi Ilya,
>
> thanks for your insight, this was the right clue. I had default parallelism
> already set but it was quite low (hundreds) and moreover the number of
> partitions of the input RDD was low as well so the chunks were really too
> big. Increased parallelism and repartitioning seems to be helping...
>
> Thanks!
> Antony.
>
>
> On Thursday, 19 February 2015, 16:45, Ilya Ganelin <ilgan...@gmail.com>
> wrote:
>
>
>
> Hi Anthony - you are seeing a problem that I ran into. The underlying issue
> is your default parallelism setting. What's happening is that within ALS
> certain RDD operations end up changing the number of partitions you have of
> your data. For example if you start with an RDD of 300 partitions, unless
> default parallelism is set while the algorithm executes you'll eventually
> get an RDD with something like 20 partitions. Consequently, your giant data
> set is now stored across a much smaller number of partitions so each
> partition is huge. Then, when a shuffle requires serialization you run out
> of heap space trying to serialize it. The solution should be as simple as
> setting the default parallelism setting.
>
> This is referenced in a JIRA I can't find at the moment.
> On Thu, Feb 19, 2015 at 5:10 AM Antony Mayi <antonym...@yahoo.com.invalid>
> wrote:
>
> now with reverted spark.shuffle.io.preferDirectBufs (to true) getting again
> GC overhead limit exceeded:
>
> === spark stdout ===
> 15/02/19 12:08:08 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 18.0
> (TID 5329, 192.168.1.93): java.lang.OutOfMemoryError: GC overhead limit
> exceeded
>         at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
>         at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>         at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>
> === yarn log (same) ===
> 15/02/19 12:08:08 ERROR executor.Executor: Exception in task 7.0 in stage
> 18.0 (TID 5329)
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
>         at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>         at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>
> === yarn nodemanager ===
> 2015-02-19 12:08:13,758 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 19014 for container-id
> container_1424204221358_0013_01_000012: 29.8 GB of 32 GB physical memory
> used; 31.7 GB of 67.2 GB virtual memory used
> 2015-02-19 12:08:13,778 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 19013 for container-id
> container_1424204221358_0013_01_000008: 1.2 MB of 32 GB physical memory
> used; 103.6 MB of 67.2 GB virtual memory used
> 2015-02-19 12:08:14,455 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
> code from container container_1424204221358_0013_01_000008 is : 143
> 2015-02-19 12:08:14,455 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1424204221358_0013_01_000008 transitioned from RUNNING
> to EXITED_WITH_FAILURE
> 2015-02-19 12:08:14,455 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Cleaning up container container_1424204221358_0013_01_000008
>
> Antony.
>
>
>
>
> On Thursday, 19 February 2015, 11:54, Antony Mayi
> <antonym...@yahoo.com.INVALID> wrote:
>
>
>
> it is from within the ALS.trainImplicit() call. btw. the exception varies
> between this "GC overhead limit exceeded" and "Java heap space" (which I
> guess is just different outcome of same problem).
>
> just tried another run and here are the logs (filtered) - note I tried this
> run with spark.shuffle.io.preferDirectBufs=false so this might be slightly
> different issue from my previous case (going to revert now):
>
> === spark stdout ===
> 15/02/19 10:15:05 WARN storage.BlockManagerMasterActor: Removing
> BlockManager BlockManagerId(6, 192.168.1.92, 54289) with no recent heart
> beats: 50221ms exceeds 45000ms
> 15/02/19 10:16:05 WARN storage.BlockManagerMasterActor: Removing
> BlockManager BlockManagerId(13, 192.168.1.90, 56768) with no recent heart
> beats: 54749ms exceeds 45000ms
> 15/02/19 10:16:44 ERROR cluster.YarnClientClusterScheduler: Lost executor 6
> on 192.168.1.92: remote Akka client disassociated
> 15/02/19 10:16:44 WARN scheduler.TaskSetManager: Lost task 57.0 in stage
> 18.0 (TID 5379, 192.168.1.92): ExecutorLostFailure (executor 6 lost)
> 15/02/19 10:16:44 WARN scheduler.TaskSetManager: Lost task 32.0 in stage
> 18.0 (TID 5354, 192.168.1.92): ExecutorLostFailure (executor 6 lost)
> 15/02/19 10:16:44 WARN scheduler.TaskSetManager: Lost task 82.0 in stage
> 18.0 (TID 5404, 192.168.1.92): ExecutorLostFailure (executor 6 lost)
> 15/02/19 10:16:44 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 18.0
> (TID 5329, 192.168.1.92): ExecutorLostFailure (executor 6 lost)
> 15/02/19 10:16:44 ERROR cluster.YarnClientSchedulerBackend: Asked to remove
> non-existent executor 6
> 15/02/19 10:16:54 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 18.0
> (TID 5328, 192.168.1.90): FetchFailed(BlockManagerId(6, 192.168.1.92,
> 54289), shuffleId=6, mapId=227, reduceId=6, message=
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to
> /192.168.1.92:54289
>         at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>         at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>         at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>         at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> === yarn log ===
> 15/02/19 10:15:05 WARN executor.Executor: Told to re-register on heartbeat
> 15/02/19 10:16:02 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
> SIGNAL 15: SIGTERM
> 15/02/19 10:16:02 WARN server.TransportChannelHandler: Exception in
> connection from /192.168.1.92:45633
> io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Java
> heap space
>         at
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:280)
>         at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>         at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>
> === yarn nodemanager log ===
> 2015-02-19 10:16:45,146 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 20284 for container-id
> container_1424204221358_0012_01_000016:
>  28.5 GB of 32 GB physical memory used; 29.1 GB of 67.2 GB virtual memory
> used
> 2015-02-19 10:16:45,163 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 20273 for container-id
> container_1424204221358_0012_01_000020:
>  28.5 GB of 32 GB physical memory used; 29.2 GB of 67.2 GB virtual memory
> used
> 2015-02-19 10:16:46,621 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
> code from container container_1424204221358_0012_01_000008 is : 143
> 2015-02-19 10:16:46,621 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1424204221358_0012_01_000008 transitioned from RUNNING
> to EXITED_WITH_FAILURE
> 2015-02-19 10:16:46,621 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Cleaning up container container_1424204221358_0012_01_000008
>
>
> thanks for any help,
> Antony.
>
>
>
>
>
>
>
>
>
> ps. could that be Java 8 related?
>
>
> On Thursday, 19 February 2015, 11:25, Sean Owen <so...@cloudera.com> wrote:
>
>
>
> Oh OK you are saying you are requesting 25 executors and getting them,
> got it. You can consider making fewer, bigger executors to pool rather
> than split up your memory, but at some point it becomes
> counter-productive. 32GB is a fine executor size.
>
> So you have ~8GB available per task which seems like plenty. Something
> else is at work here. Is this error form your code's stages or ALS?
>
> On Thu, Feb 19, 2015 at 10:07 AM, Antony Mayi <antonym...@yahoo.com> wrote:
>> based on spark UI I am running 25 executors for sure. why would you expect
>> four? I submit the task with --num-executors 25 and I get 6-7 executors
>> running per host (using more of smaller executors allows me better cluster
>> utilization when running parallel spark sessions (which is not the case of
>> this reported issue - for now using the cluster exclusively)).
>>
>> thx,
>> Antony.
>>
>>
>> On Thursday, 19 February 2015, 11:02, Sean Owen <so...@cloudera.com>
>> wrote:
>>
>>
>>
>> This should result in 4 executors, not 25. They should be able to
>> execute 4*4 = 16 tasks simultaneously. You have them grab 4*32 = 128GB
>> of RAM, not 1TB.
>>
>> It still feels like this shouldn't be running out of memory, not by a
>> long shot though. But just pointing out potential differences between
>> what you are expecting and what you are configuring.
>>
>> On Thu, Feb 19, 2015 at 9:56 AM, Antony Mayi
>> <antonym...@yahoo.com.invalid> wrote:
>>> Hi,
>>>
>>> I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running
>>> spark
>>> 1.2.0 in yarn-client mode with following layout:
>>>
>>> spark.executor.cores=4
>>> spark.executor.memory=28G
>>> spark.yarn.executor.memoryOverhead=4096
>>>
>>> I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a
>>> dataset with ~3 billion of ratings using 25 executors. At some point some
>>> executor crashes with:
>>>
>>> 15/02/19 05:41:06 WARN util.AkkaUtils: Error sending message in 1
>>> attempts
>>> java.util.concurrent.TimeoutException: Futures timed out after [30
>>> seconds]
>>>        at
>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>>        at
>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>>        at
>>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>>        at
>>>
>>>
>>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>>        at scala.concurrent.Await$.result(package.scala:107)
>>>        at
>>> org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187)
>>>        at
>>> org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:398)
>>> 15/02/19 05:41:06 ERROR executor.Executor: Exception in task 131.0 in
>>> stage
>>> 51.0 (TID 7259)
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>        at java.lang.reflect.Array.newInstance(Array.java:75)
>>>        at
>>> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1671)
>>>        at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345)
>>>        at
>>> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1707)
>>>        at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345)
>>>
>>> So the GC overhead limit exceeded is pretty clear and would suggest
>>> running
>>> out of memory. Since I have 1TB of RAM available this must be rather due
>>> to
>>> some config inoptimality.
>>>
>>> Can anyone please point me to some directions how to tackle this?
>>>
>>> Thanks,
>>> Antony.
>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: loads of memory still GC overhead limit exceeded

Reply via email to