Re: High virtual memory usage
Hi Stephan, thanks for your support. I was able to track the problem a few days ago. Unirest was the one to blame, I was using it on some mapfuncionts to connect to external services and for some reason it was using insane amounts of virtual memory. Paulo Cezar On Mon, Dec 19, 2016 at 11:30 AM Stephan Ewen <se...@apache.org> wrote: > Hi Paulo! > > Hmm, interesting. The high discrepancy between virtual and physical memory > usually means that the process either maps large files into memory, or that > it pre-allocates a lot of memory without immediately using it. > Neither of these things are done by Flink. > > Could this be an effect of either the Docker environment (mapping certain > kernel spaces / libraries / whatever) or a result of one of the libraries > (gRPC or so)? > > Stephan > > > On Mon, Dec 19, 2016 at 12:32 PM, Paulo Cezar <paulo.ce...@gogeo.io> > wrote: > > - Are you using RocksDB? > > No. > > > - What is your flink configuration, especially around memory settings? > > I'm using default config with 2GB for jobmanager and 5GB for taskmanagers. > I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s > 4 -nm 'Flink'" > > - What do you use for TaskManager heap size? Any manual value, or do you > let Flink/Yarn set it automatically based on container size? > > No manual values here. YARN config is pretty much default with maximum > allocation of 12GB of physical memory and ratio between virtual memory to > physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio). > > > - Do you use any libraries or connectors in your program? > > I'm using flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC > client and some http libraries like unirest and Apache HttpClient. > > - Also, can you tell us what OS you are running on? > > My YARN cluster runs on Docker containers (docker version 1.12) with > images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux > 3.19.0-65-generic x86_64). > > >
Re: High virtual memory usage
- Are you using RocksDB? No. - What is your flink configuration, especially around memory settings? I'm using default config with 2GB for jobmanager and 5GB for taskmanagers. I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s 4 -nm 'Flink'" - What do you use for TaskManager heap size? Any manual value, or do you let Flink/Yarn set it automatically based on container size? No manual values here. YARN config is pretty much default with maximum allocation of 12GB of physical memory and ratio between virtual memory to physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio). - Do you use any libraries or connectors in your program? I'm using flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC client and some http libraries like unirest and Apache HttpClient. - Also, can you tell us what OS you are running on? My YARN cluster runs on Docker containers (docker version 1.12) with images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-65-generic x86_64).
High virtual memory usage
Hi Folks, I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few hours after I start a streaming job (built using kafka connect 0.10_2.11) it gets killed seemingly for no reason. After inspecting the logs my best guess is that YARN is killing containers due to high virtual memory usage. Any guesses on why this might be happening or tips of what I should be looking for? What I'll do next is enable taskmanager.debug.memory.startLogThread to keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-bin-hadoop2.tgz <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2.tgz> on YARN, but my job uses scala 2.11 dependencies so I'll try using flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz> instead. - Flink logs: 2016-12-15 17:44:03,763 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 2016-12-15 17:44:05,475 INFO org.apache.flink.yarn.YarnFlinkResourceManager- Container ResourceID{resourceId='container_1481732559439_0002_01_04'} failed. Exit status: 1 2016-12-15 17:44:05,476 INFO org.apache.flink.yarn.YarnFlinkResourceManager- Diagnostics for container ResourceID{resourceId='container_1481732559439_0002_01_04'} in state COMPLETE : exitStatus=1 diagnostics=Exception from container-launch. Container id: container_1481732559439_0002_01_04 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 - YARN logs: container_1481732559439_0002_01_04: 2.6 GB of 5 GB physical memory used; 38.1 GB of 10.5 GB virtual memory used 2016-12-15 17:44:03,119 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 62223 for container-id container_1481732559439_0002_01_01: 656.3 MB of 2 GB physical memory used; 3.2 GB of 4.2 GB virtual memory used 2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1481732559439_0002_01_04 is : 1 2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1481732559439_0002_01_04 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Best regards, Paulo Cezar
Failures on DataSet programs
Hi Folks, I was wondering if it's possible to keep partial outputs from dataset programs. I have a batch pipeline that writes its output on HDFS using writeAsFormattedText. When it fails the output file is deleted but I would like to keep it so that I can generate new inputs for the pipeline to avoid reprocessing. []'s Paulo Cezar
OutOfMemoryError
Hi folks, I'm trying to run a DataSet program but after around 200k records are processed a "java.lang.OutOfMemoryError: unable to create new native thread" stops me. I'm deploying Flink (via bin/yarn-session.sh) on a YARN cluster with 10 nodes (each with 8 cores) and starting 10 task managers, each with 8 slots and 6GB of RAM. Except for the data sink that writes to HDFS and runs with a parallelism of 1, my job runs with a parallelism of 80 and has two input datasets, each is a HDFS file with around 6GB and 20mi lines. Most of my map functions uses external services via RPC or REST APIs to enrich the raw data with info from other sources. Might I be doing something wrong or I really should have more memory available? Thanks, Paulo Cezar