Re: High virtual memory usage

2017-01-05 Thread Stephan Ewen
Happy to hear that!



On Thu, Jan 5, 2017 at 1:34 PM, Paulo Cezar  wrote:

> Hi Stephan, thanks for your support.
>
> I was able to track the problem a few days ago. Unirest was the one to
> blame, I was using it on some mapfuncionts to connect to external services
> and for some reason it was using insane amounts of virtual memory.
>
> Paulo Cezar
>
> On Mon, Dec 19, 2016 at 11:30 AM Stephan Ewen  wrote:
>
>> Hi Paulo!
>>
>> Hmm, interesting. The high discrepancy between virtual and physical
>> memory usually means that the process either maps large files into memory,
>> or that it pre-allocates a lot of memory without immediately using it.
>> Neither of these things are done by Flink.
>>
>> Could this be an effect of either the Docker environment (mapping certain
>> kernel spaces / libraries / whatever) or a result of one of the libraries
>> (gRPC or so)?
>>
>> Stephan
>>
>>
>> On Mon, Dec 19, 2016 at 12:32 PM, Paulo Cezar 
>> wrote:
>>
>>   - Are you using RocksDB?
>>
>> No.
>>
>>
>>   - What is your flink configuration, especially around memory settings?
>>
>> I'm using default config with 2GB for jobmanager and 5GB for
>> taskmanagers. I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm
>> 2048 -tm 5120 -s 4 -nm 'Flink'"
>>
>>   - What do you use for TaskManager heap size? Any manual value, or do
>> you let Flink/Yarn set it automatically based on container size?
>>
>> No manual values here. YARN config is pretty much default with maximum
>> allocation of 12GB of physical memory and ratio between virtual memory to
>> physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).
>>
>>
>>   - Do you use any libraries or connectors in your program?
>>
>> I'm using  flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC
>> client and some http libraries like unirest and Apache HttpClient.
>>
>>   - Also, can you tell us what OS you are running on?
>>
>> My YARN cluster runs on Docker containers (docker version 1.12) with
>> images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux
>> 3.19.0-65-generic x86_64).
>>
>>
>>


Re: High virtual memory usage

2017-01-05 Thread Paulo Cezar
Hi Stephan, thanks for your support.

I was able to track the problem a few days ago. Unirest was the one to
blame, I was using it on some mapfuncionts to connect to external services
and for some reason it was using insane amounts of virtual memory.

Paulo Cezar

On Mon, Dec 19, 2016 at 11:30 AM Stephan Ewen  wrote:

> Hi Paulo!
>
> Hmm, interesting. The high discrepancy between virtual and physical memory
> usually means that the process either maps large files into memory, or that
> it pre-allocates a lot of memory without immediately using it.
> Neither of these things are done by Flink.
>
> Could this be an effect of either the Docker environment (mapping certain
> kernel spaces / libraries / whatever) or a result of one of the libraries
> (gRPC or so)?
>
> Stephan
>
>
> On Mon, Dec 19, 2016 at 12:32 PM, Paulo Cezar 
> wrote:
>
>   - Are you using RocksDB?
>
> No.
>
>
>   - What is your flink configuration, especially around memory settings?
>
> I'm using default config with 2GB for jobmanager and 5GB for taskmanagers.
> I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s
> 4 -nm 'Flink'"
>
>   - What do you use for TaskManager heap size? Any manual value, or do you
> let Flink/Yarn set it automatically based on container size?
>
> No manual values here. YARN config is pretty much default with maximum
> allocation of 12GB of physical memory and ratio between virtual memory to
> physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).
>
>
>   - Do you use any libraries or connectors in your program?
>
> I'm using  flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC
> client and some http libraries like unirest and Apache HttpClient.
>
>   - Also, can you tell us what OS you are running on?
>
> My YARN cluster runs on Docker containers (docker version 1.12) with
> images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux
> 3.19.0-65-generic x86_64).
>
>
>


Re: High virtual memory usage

2016-12-19 Thread Stephan Ewen
Hi Paulo!

Hmm, interesting. The high discrepancy between virtual and physical memory
usually means that the process either maps large files into memory, or that
it pre-allocates a lot of memory without immediately using it.
Neither of these things are done by Flink.

Could this be an effect of either the Docker environment (mapping certain
kernel spaces / libraries / whatever) or a result of one of the libraries
(gRPC or so)?

Stephan


On Mon, Dec 19, 2016 at 12:32 PM, Paulo Cezar  wrote:

>   - Are you using RocksDB?
>
> No.
>
>
>   - What is your flink configuration, especially around memory settings?
>
> I'm using default config with 2GB for jobmanager and 5GB for taskmanagers.
> I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s
> 4 -nm 'Flink'"
>
>   - What do you use for TaskManager heap size? Any manual value, or do you
> let Flink/Yarn set it automatically based on container size?
>
> No manual values here. YARN config is pretty much default with maximum
> allocation of 12GB of physical memory and ratio between virtual memory to
> physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).
>
>
>   - Do you use any libraries or connectors in your program?
>
> I'm using  flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC
> client and some http libraries like unirest and Apache HttpClient.
>
>   - Also, can you tell us what OS you are running on?
>
> My YARN cluster runs on Docker containers (docker version 1.12) with
> images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux
> 3.19.0-65-generic x86_64).
>
>


Re: High virtual memory usage

2016-12-19 Thread Paulo Cezar
  - Are you using RocksDB?

No.


  - What is your flink configuration, especially around memory settings?

I'm using default config with 2GB for jobmanager and 5GB for taskmanagers.
I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s
4 -nm 'Flink'"

  - What do you use for TaskManager heap size? Any manual value, or do you
let Flink/Yarn set it automatically based on container size?

No manual values here. YARN config is pretty much default with maximum
allocation of 12GB of physical memory and ratio between virtual memory to
physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).


  - Do you use any libraries or connectors in your program?

I'm using  flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC client
and some http libraries like unirest and Apache HttpClient.

  - Also, can you tell us what OS you are running on?

My YARN cluster runs on Docker containers (docker version 1.12) with images
based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux
3.19.0-65-generic x86_64).


Re: High virtual memory usage

2016-12-16 Thread Stephan Ewen
Also, can you tell us what OS you are running on?

On Fri, Dec 16, 2016 at 6:23 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> To diagnose this a little better, can you help us with the following info:
>
>   - Are you using RocksDB?
>   - What is your flink configuration, especially around memory settings?
>   - What do you use for TaskManager heap size? Any manual value, or do you
> let Flink/Yarn set it automatically based on container size?
>   - Do you use any libraries or connectors in your program?
>
> Greetings,
> Stephan
>
>
> On Fri, Dec 16, 2016 at 5:47 PM, Paulo Cezar <paulo.ce...@gogeo.io> wrote:
>
>> Hi Folks,
>>
>> I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few
>> hours after I start a streaming job (built using kafka connect 0.10_2.11)
>> it gets killed seemingly for no reason. After inspecting the logs my best
>> guess is that YARN is killing containers due to high virtual memory usage.
>>
>> Any guesses on why this might be happening or tips of what I should be
>> looking for?
>>
>> What I'll do next is enable taskmanager.debug.memory.startLogThread to
>> keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-b
>> in-hadoop2.tgz
>> <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2.tgz>
>> on YARN, but my job uses scala 2.11 dependencies so I'll try using
>> flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz
>> <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz>
>> instead.
>>
>>
>>- Flink logs:
>>
>> 2016-12-15 17:44:03,763 WARN  akka.remote.ReliableDeliverySupervisor 
>>- Association with remote system 
>> [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for 
>> [5000] ms. Reason is: [Disassociated].
>> 2016-12-15 17:44:05,475 INFO  org.apache.flink.yarn.YarnFlinkResourceManager 
>>- Container 
>> ResourceID{resourceId='container_1481732559439_0002_01_04'} failed. Exit 
>> status: 1
>> 2016-12-15 17:44:05,476 INFO  org.apache.flink.yarn.YarnFlinkResourceManager 
>>- Diagnostics for container 
>> ResourceID{resourceId='container_1481732559439_0002_01_04'} in state 
>> COMPLETE : exitStatus=1 diagnostics=Exception from container-launch.
>> Container id: container_1481732559439_0002_01_04
>> Exit code: 1
>> Stack trace: ExitCodeException exitCode=1:
>>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
>>  at org.apache.hadoop.util.Shell.run(Shell.java:456)
>>  at 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
>>  at 
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>>  at 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>>  at 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at java.lang.Thread.run(Thread.java:745)
>>
>>
>> Container exited with a non-zero exit code 1
>>
>>
>>
>>- YARN logs:
>>
>> container_1481732559439_0002_01_04: 2.6 GB of 5 GB physical memory used; 
>> 38.1 GB of 10.5 GB virtual memory used
>> 2016-12-15 17:44:03,119 INFO 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>>  Memory usage of ProcessTree 62223 for container-id 
>> container_1481732559439_0002_01_01: 656.3 MB of 2 GB physical memory 
>> used; 3.2 GB of 4.2 GB virtual memory used
>> 2016-12-15 17:44:03,766 WARN 
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit 
>> code from container container_1481732559439_0002_01_04 is : 1
>> 2016-12-15 17:44:03,766 WARN 
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: 
>> Exception from container-launch with container ID: 
>> container_1481732559439_0002_01_04 and exit code: 1
>> ExitCodeException exitCode=1:
>>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
>>  at org.apache.hadoop.util.Shell.run(Shell.java:456)
>>  at 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
&

Re: High virtual memory usage

2016-12-16 Thread Stephan Ewen
Hi!

To diagnose this a little better, can you help us with the following info:

  - Are you using RocksDB?
  - What is your flink configuration, especially around memory settings?
  - What do you use for TaskManager heap size? Any manual value, or do you
let Flink/Yarn set it automatically based on container size?
  - Do you use any libraries or connectors in your program?

Greetings,
Stephan


On Fri, Dec 16, 2016 at 5:47 PM, Paulo Cezar <paulo.ce...@gogeo.io> wrote:

> Hi Folks,
>
> I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few
> hours after I start a streaming job (built using kafka connect 0.10_2.11)
> it gets killed seemingly for no reason. After inspecting the logs my best
> guess is that YARN is killing containers due to high virtual memory usage.
>
> Any guesses on why this might be happening or tips of what I should be
> looking for?
>
> What I'll do next is enable taskmanager.debug.memory.startLogThread to
> keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-
> bin-hadoop2.tgz
> <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2.tgz>
> on YARN, but my job uses scala 2.11 dependencies so I'll try using
> flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz
> <https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz>
> instead.
>
>
>- Flink logs:
>
> 2016-12-15 17:44:03,763 WARN  akka.remote.ReliableDeliverySupervisor  
>   - Association with remote system 
> [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for [5000] 
> ms. Reason is: [Disassociated].
> 2016-12-15 17:44:05,475 INFO  org.apache.flink.yarn.YarnFlinkResourceManager  
>   - Container 
> ResourceID{resourceId='container_1481732559439_0002_01_04'} failed. Exit 
> status: 1
> 2016-12-15 17:44:05,476 INFO  org.apache.flink.yarn.YarnFlinkResourceManager  
>   - Diagnostics for container 
> ResourceID{resourceId='container_1481732559439_0002_01_04'} in state 
> COMPLETE : exitStatus=1 diagnostics=Exception from container-launch.
> Container id: container_1481732559439_0002_01_04
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
>   at org.apache.hadoop.util.Shell.run(Shell.java:456)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>
>
> Container exited with a non-zero exit code 1
>
>
>
>- YARN logs:
>
> container_1481732559439_0002_01_04: 2.6 GB of 5 GB physical memory used; 
> 38.1 GB of 10.5 GB virtual memory used
> 2016-12-15 17:44:03,119 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 62223 for container-id 
> container_1481732559439_0002_01_01: 656.3 MB of 2 GB physical memory 
> used; 3.2 GB of 4.2 GB virtual memory used
> 2016-12-15 17:44:03,766 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1481732559439_0002_01_04 is : 1
> 2016-12-15 17:44:03,766 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
> from container-launch with container ID: 
> container_1481732559439_0002_01_04 and exit code: 1
> ExitCodeException exitCode=1:
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
>   at org.apache.hadoop.util.Shell.run(Shell.java:456)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>
>
> Best regards,
> Paulo Cezar
>


High virtual memory usage

2016-12-16 Thread Paulo Cezar
Hi Folks,

I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few
hours after I start a streaming job (built using kafka connect 0.10_2.11)
it gets killed seemingly for no reason. After inspecting the logs my best
guess is that YARN is killing containers due to high virtual memory usage.

Any guesses on why this might be happening or tips of what I should be
looking for?

What I'll do next is enable taskmanager.debug.memory.startLogThread to keep
investigating. Also, I was deploying flink-1.2-SNAPSHOT-bin-hadoop2.tgz
<https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2.tgz>
on YARN, but my job uses scala 2.11 dependencies so I'll try using
flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz
<https://s3.amazonaws.com/flink-nightly/flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz>
instead.


   - Flink logs:

2016-12-15 17:44:03,763 WARN  akka.remote.ReliableDeliverySupervisor
 - Association with remote system
[akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for
[5000] ms. Reason is: [Disassociated].
2016-12-15 17:44:05,475 INFO
org.apache.flink.yarn.YarnFlinkResourceManager-
Container ResourceID{resourceId='container_1481732559439_0002_01_04'}
failed. Exit status: 1
2016-12-15 17:44:05,476 INFO
org.apache.flink.yarn.YarnFlinkResourceManager-
Diagnostics for container
ResourceID{resourceId='container_1481732559439_0002_01_04'} in
state COMPLETE : exitStatus=1 diagnostics=Exception from
container-launch.
Container id: container_1481732559439_0002_01_04
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1



   - YARN logs:

container_1481732559439_0002_01_04: 2.6 GB of 5 GB physical memory
used; 38.1 GB of 10.5 GB virtual memory used
2016-12-15 17:44:03,119 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 62223 for container-id
container_1481732559439_0002_01_01: 656.3 MB of 2 GB physical
memory used; 3.2 GB of 4.2 GB virtual memory used
2016-12-15 17:44:03,766 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exit code from container container_1481732559439_0002_01_04 is : 1
2016-12-15 17:44:03,766 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exception from container-launch with container ID:
container_1481732559439_0002_01_04 and exit code: 1
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Best regards,
Paulo Cezar