Re: Long-running job OOMs driver process

Nathan Lande Fri, 18 Nov 2016 07:08:47 -0800

+1 to not threading.

What does your load look like? If you are loading many files and cacheing
them in N rdds rather than 1 rdd this could be an issue.


If the above two things don't fix your oom issue, without knowing anything
else about your job, I would focus on your cacheing strategy as a potential
culprit. Try running without any cacheing to isolate the issue; bad
cacheing strategy is the source of oom issues for me most of the time.

On Nov 18, 2016 6:31 AM, "Keith Bourgoin" <[email protected]> wrote:

> Hi Alexis,
>
> Thanks for the response. I've been working with Irina on trying to sort
> this issue out.
>
> We thread the file processing to amortize the cost of things like getting
> files from S3. It's a pattern we've seen recommended in many places, but I
> don't have any of those links handy.  The problem isn't the threading, per
> se, but clearly some sort of memory leak in the driver itself.  Each file
> is a self-contained unit of work, so once it's done all memory related to
> it should be freed. Nothing in the script itself grows over time, so if it
> can do 10 concurrently, it should be able to run like that forever.
>
> I've hit this same issue working on another Spark app which wasn't
> threaded, but produced tens of thousands of jobs. Eventually, the Spark UI
> would get slow, then unresponsive, and then be killed due to OOM.
>
> I'll try to cook up some examples of this today, threaded and not. We were
> hoping that someone had seen this before and it rung a bell. Maybe there's
> a setting to clean up info from old jobs that we can adjust.
>
> Cheers,
>
> Keith.
>
> On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <
> [email protected]> wrote:
>
>> Hi Irina,
>>
>> I would question the use of multiple threads in your application. Since
>> Spark is going to run the processing of each DataFrame on all the cores of
>> your cluster, the processes will be competing for resources. In fact, they
>> would not only compete for CPU cores but also for memory.
>>
>> Spark is designed to run your processes in a sequence, and each process
>> will be run in a distributed manner (multiple threads on multiple
>> instances). I would suggest to follow this principle.
>>
>> Feel free to share to code if you can. It's always helpful so that we can
>> give better advice.
>>
>> Alexis
>>
>> On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <[email protected]> wrote:
>>
>> We have an application that reads text files, converts them to
>> dataframes, and saves them in Parquet format. The application runs fine
>> when processing a few files, but we have several thousand produced every
>> day. When running the job for all files, we have spark-submit killed on OOM:
>>
>> #
>> # java.lang.OutOfMemoryError: Java heap space
>> # -XX:OnOutOfMemoryError="kill -9 %p"
>> #   Executing /bin/sh -c "kill -9 27226"...
>>
>> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
>> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
>> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
>> and 30g of RAM each). Spark config settings are as follows:
>>
>> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
>>
>> ('spark.executors.instances', '3'),
>>
>> ('spark.yarn.executor.memoryOverhead', '9g'),
>>
>> ('spark.executor.cores', '15'),
>>
>> ('spark.executor.memory', '12g'),
>>
>> ('spark.scheduler.mode', 'FIFO'),
>>
>> ('spark.cleaner.ttl', '1800'),
>>
>> The job processes each file in a thread, and we have 10 threads running
>> concurrently. The process will OOM after about 4 hours, at which point
>> Spark has processed over 20,000 jobs.
>>
>> It seems like the driver is running out of memory, but each individual
>> job is quite small. Are there any known memory leaks for long-running Spark
>> applications on Yarn?
>>
>>
>>
>>
>> --
>>
>> *Alexis Seigneurin*
>> *Managing Consultant*
>> (202) 459-1591 <202%20459.1591> - LinkedIn
>> <http://www.linkedin.com/in/alexisseigneurin>
>>
>> <http://ipponusa.com/>
>> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>>
>

Re: Long-running job OOMs driver process

Reply via email to