+1 to not threading. What does your load look like? If you are loading many files and cacheing them in N rdds rather than 1 rdd this could be an issue.
If the above two things don't fix your oom issue, without knowing anything else about your job, I would focus on your cacheing strategy as a potential culprit. Try running without any cacheing to isolate the issue; bad cacheing strategy is the source of oom issues for me most of the time. On Nov 18, 2016 6:31 AM, "Keith Bourgoin" <[email protected]> wrote: > Hi Alexis, > > Thanks for the response. I've been working with Irina on trying to sort > this issue out. > > We thread the file processing to amortize the cost of things like getting > files from S3. It's a pattern we've seen recommended in many places, but I > don't have any of those links handy. The problem isn't the threading, per > se, but clearly some sort of memory leak in the driver itself. Each file > is a self-contained unit of work, so once it's done all memory related to > it should be freed. Nothing in the script itself grows over time, so if it > can do 10 concurrently, it should be able to run like that forever. > > I've hit this same issue working on another Spark app which wasn't > threaded, but produced tens of thousands of jobs. Eventually, the Spark UI > would get slow, then unresponsive, and then be killed due to OOM. > > I'll try to cook up some examples of this today, threaded and not. We were > hoping that someone had seen this before and it rung a bell. Maybe there's > a setting to clean up info from old jobs that we can adjust. > > Cheers, > > Keith. > > On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin < > [email protected]> wrote: > >> Hi Irina, >> >> I would question the use of multiple threads in your application. Since >> Spark is going to run the processing of each DataFrame on all the cores of >> your cluster, the processes will be competing for resources. In fact, they >> would not only compete for CPU cores but also for memory. >> >> Spark is designed to run your processes in a sequence, and each process >> will be run in a distributed manner (multiple threads on multiple >> instances). I would suggest to follow this principle. >> >> Feel free to share to code if you can. It's always helpful so that we can >> give better advice. >> >> Alexis >> >> On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <[email protected]> wrote: >> >> We have an application that reads text files, converts them to >> dataframes, and saves them in Parquet format. The application runs fine >> when processing a few files, but we have several thousand produced every >> day. When running the job for all files, we have spark-submit killed on OOM: >> >> # >> # java.lang.OutOfMemoryError: Java heap space >> # -XX:OnOutOfMemoryError="kill -9 %p" >> # Executing /bin/sh -c "kill -9 27226"... >> >> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark >> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge >> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores >> and 30g of RAM each). Spark config settings are as follows: >> >> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'), >> >> ('spark.executors.instances', '3'), >> >> ('spark.yarn.executor.memoryOverhead', '9g'), >> >> ('spark.executor.cores', '15'), >> >> ('spark.executor.memory', '12g'), >> >> ('spark.scheduler.mode', 'FIFO'), >> >> ('spark.cleaner.ttl', '1800'), >> >> The job processes each file in a thread, and we have 10 threads running >> concurrently. The process will OOM after about 4 hours, at which point >> Spark has processed over 20,000 jobs. >> >> It seems like the driver is running out of memory, but each individual >> job is quite small. Are there any known memory leaks for long-running Spark >> applications on Yarn? >> >> >> >> >> -- >> >> *Alexis Seigneurin* >> *Managing Consultant* >> (202) 459-1591 <202%20459.1591> - LinkedIn >> <http://www.linkedin.com/in/alexisseigneurin> >> >> <http://ipponusa.com/> >> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY> >> >
