Hi List, We're recently trying to running spark on Mesos, however, we encountered a fatal error that mesos-master process will continuousely consume memory and finally killed by OOM Killer, this situation only happening if has spark job (fine-grained mode) running.
We finally root caused the issue and found that is because spark executor attach rdd computed result in TaskStatus, like below: ---------------------->8------------------------------->8------------- spark.git/core/src/main/scala/org/apache/spark/executor/Executor.scala val serializedDirectResult = ser.serialize(directResult) logInfo("Serialized size of result for " + taskId + " is " + serializedDirectResult.limit) val serializedResult = { if (serializedDirectResult.limit >= execBackend.akkaFrameSize() - AkkaUtils.reservedSizeBytes) { logInfo("Storing result for " + taskId + " in local BlockManager") val blockId = TaskResultBlockId(taskId) env.blockManager.putBytes( blockId, serializedDirectResult, StorageLevel.MEMORY_AND_DISK_SER) ser.serialize(new IndirectTaskResult[Any](blockId)) } else { logInfo("Sending result for " + taskId + " directly to driver") serializedDirectResult } } execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult) logInfo("Finished task ID " + taskId) ------------------->8-------------------------8>--------------------- And the spark executor log says how large the serializedResult is like below: 14/08/22 13:29:18 INFO Executor: Serialized size of result for 248 is 17573033 Since in fine-grained mode, every singe spark stage finished in say 10 seconds and may have tens of tasks, so it's generally fail mesos-master OOM in tens of minutes. I'm not familiar with spark, and I'm wondering if we should not store serializedResult into TaskStatus? -- Thanks, Chengwei
signature.asc
Description: Digital signature