Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread 周思华
Hi Lasse,


I met that before. I think maybe the non-heap memory trend of the graph you 
attached is the "expected" result ... Because rocksdb will keep the a "filter 
(bloom filter)" in memory for every opened sst file by default, and the num of 
the sst file will increase by time, so it looks like a leak. There is a 
issue(https://issues.apache.org/jira/browse/FLINK-7289) Stefan created to track 
this, and the 
page(https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB) from 
RocksDB's wiki could give you a better understand of the memory used by 
RocksDB, and Stefan please correct me if I bring any wrong information above.


Best Regards,
Sihua Zhou
On 04/11/2018 09:55,Ted Yu wrote:
Please see the last comment on this issue:


https://github.com/facebook/rocksdb/issues/3216



FYI


On Tue, Apr 10, 2018 at 12:25 AM, Lasse Nedergaard  
wrote:


This graph shows Non-Heap . If the same pattern exists it make sense that it 
will try to allocate more memory and then exceed the limit. I can see the trend 
for all other containers that has been killed. So my question is now, what is 
using non-heap memory?
From 
http://mail-archives.apache.org/mod_mbox/flink-user/201707.mbox/%3ccanc1h_u0dqqvbysdaollbemewaxiimtmfjjcribpfpo0idl...@mail.gmail.com%3E
 it look like RockDb could be guilty.


I have job using incremental checkpointing and some without, some optimised for 
FLASH_SSD. all have same pattern


Lasse 






2018-04-10 8:52 GMT+02:00 Lasse Nedergaard :

Hi.


I found the exception attached below, for our simple job. It states that our 
task-manager was killed du to exceed memory limit on 2.7GB. But when I look at 
Flink metricts just 30 sec before it use 1.3 GB heap and 712 MB Non-Heap around 
2 GB. 
So something else are also using memory inside the conatianer any idea how to 
figure out what?
As a side note we use RockDBStateBackend with this configuration


env.getCheckpointConfig().setMinPauseBetweenCheckpoints((long)(config.checkPointInterval
 * 0.75));
env.enableCheckpointing(config.checkPointInterval, 
CheckpointingMode.AT_LEAST_ONCE);
env.setStateBackend(new RocksDBStateBackend(config.checkpointDataUri));
Where checkpointDataUri point to S3


Lasse Nedergaard



2018-04-09 16:52:01,239 INFO  org.apache.flink.yarn.YarnFlinkResourceManager
- Diagnostics for container container_1522921976871_0001_01_79 
in state COMPLETE : exitStatus=Pmem limit exceeded (-104) diagnostics=Container 
[pid=30118,containerID=container_1522921976871_0001_01_79] is running 
beyond physical memory limits. Current usage: 2.7 GB of 2.7 GB physical memory 
used; 4.9 GB of 13.4 GB virtual memory used. Killing container.

Dump of the process-tree for container_1522921976871_0001_01_79 :

|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

|- 30136 30118 30118 30118 (java) 245173 68463 5193723904 703845 
/usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m 
-Dlog.file=/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.log
 -Dlogback.configurationFile=file:./logback.xml 
-Dlog4j.configuration=file:./log4j.properties 
org.apache.flink.yarn.YarnTaskManager --configDir . 

|- 30118 30116 30118 30118 (bash) 0 0 115818496 674 /bin/bash -c 
/usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m  
-Dlog.file=/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.log
 -Dlogback.configurationFile=file:./logback.xml 
-Dlog4j.configuration=file:./log4j.properties 
org.apache.flink.yarn.YarnTaskManager --configDir . 1> 
/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.out
 2> 
/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.err
 




2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter 
- 
gauge.ip-10-1-1-181.taskmanager.container_1522921976871_0001_01_79.Status.JVM.Memory.Heap.Used=1398739496




2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter 
- 
gauge.ip-10-1-1-181.taskmanager.container_1522921976871_0001_01_79.Status.JVM.Memory.NonHeap.Used=746869520







 


2018-04-09 23:52 GMT+02:00 Ken Krugler :

Hi Chesnay,


Don’t know if this helps, but I’d run into this as well, though I haven’t 
hooked up YourKit to analyze exactly what’s causing the memory problem.


E.g. after about 3.5 hours running locally, it failed with memory issues.



In the TaskManager logs, I start seeing exceptions in my code….


java.lang.OutOfMemoryError: GC overhead limit exceeded


And then eventually...


2018-04-07 21:55:25,686 WARN  
org.apache.flink.runtime.accumulators.AccumulatorRegistry - 

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Ted Yu
Please see the last comment on this issue:

https://github.com/facebook/rocksdb/issues/3216

FYI

On Tue, Apr 10, 2018 at 12:25 AM, Lasse Nedergaard <
lassenederga...@gmail.com> wrote:

>
> This graph shows Non-Heap . If the same pattern exists it make sense that
> it will try to allocate more memory and then exceed the limit. I can see
> the trend for all other containers that has been killed. So my question is
> now, what is using non-heap memory?
> From http://mail-archives.apache.org/mod_mbox/flink-
> user/201707.mbox/%3CCANC1h_u0dQQvbysDAoLLbEmeWaxiimTMFjJC
> ribpfpo0idl...@mail.gmail.com%3E it look like RockDb could be guilty.
>
> I have job using incremental checkpointing and some without, some
> optimised for FLASH_SSD. all have same pattern
>
> Lasse
>
>
>
> 2018-04-10 8:52 GMT+02:00 Lasse Nedergaard :
>
>> Hi.
>>
>> I found the exception attached below, for our simple job. It states that
>> our task-manager was killed du to exceed memory limit on 2.7GB. But when I
>> look at Flink metricts just 30 sec before it use 1.3 GB heap and 712 MB
>> Non-Heap around 2 GB.
>> So something else are also using memory inside the conatianer any idea
>> how to figure out what?
>> As a side note we use RockDBStateBackend with this configuration
>>
>> env.getCheckpointConfig().setMinPauseBetweenCheckpoints((long)(config.checkPointInterval
>>  * 0.75));
>> env.enableCheckpointing(config.checkPointInterval, 
>> CheckpointingMode.AT_LEAST_ONCE);
>> env.setStateBackend(new RocksDBStateBackend(config.checkpointDataUri));
>>
>> Where checkpointDataUri point to S3
>>
>> Lasse Nedergaard
>>
>> 2018-04-09 16:52:01,239 INFO  org.apache.flink.yarn.YarnFlin
>> kResourceManager- Diagnostics for container
>> container_1522921976871_0001_01_79 in state COMPLETE :
>> exitStatus=Pmem limit exceeded (-104) diagnostics=Container
>> [pid=30118,containerID=container_1522921976871_0001_01_79] is
>> running beyond physical memory limits. Current usage: 2.7 GB of 2.7 GB
>> physical memory used; 4.9 GB of 13.4 GB virtual memory used. Killing
>> container.
>>
>> Dump of the process-tree for container_1522921976871_0001_01_79 :
>>
>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>>
>> |- 30136 30118 30118 30118 (java) 245173 68463 5193723904 703845
>> /usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
>> -Dlog.file=/var/log/hadoop-yarn/containers/application_15229
>> 21976871_0001/container_1522921976871_0001_01_79/taskmanager.log
>> -Dlogback.configurationFile=file:./logback.xml
>> -Dlog4j.configuration=file:./log4j.properties
>> org.apache.flink.yarn.YarnTaskManager --configDir .
>>
>> |- 30118 30116 30118 30118 (bash) 0 0 115818496 674 /bin/bash -c
>> /usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
>> -Dlog.file=/var/log/hadoop-yarn/containers/application_15229
>> 21976871_0001/container_1522921976871_0001_01_79/taskmanager.log
>> -Dlogback.configurationFile=file:./logback.xml
>> -Dlog4j.configuration=file:./log4j.properties
>> org.apache.flink.yarn.YarnTaskManager --configDir . 1>
>> /var/log/hadoop-yarn/containers/application_1522921976871_
>> 0001/container_1522921976871_0001_01_79/taskmanager.out 2>
>> /var/log/hadoop-yarn/containers/application_1522921976871_
>> 0001/container_1522921976871_0001_01_79/taskmanager.err
>>
>>
>> 2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
>> - gauge.ip-10-1-1-181.taskmanage
>> r.container_1522921976871_0001_01_79.Status.JVM.
>> Memory.Heap.Used=1398739496
>>
>>
>> 2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
>> - gauge.ip-10-1-1-181.taskmanage
>> r.container_1522921976871_0001_01_79.Status.JVM.
>> Memory.NonHeap.Used=746869520
>>
>>
>>
>>
>>
>> 2018-04-09 23:52 GMT+02:00 Ken Krugler :
>>
>>> Hi Chesnay,
>>>
>>> Don’t know if this helps, but I’d run into this as well, though I
>>> haven’t hooked up YourKit to analyze exactly what’s causing the memory
>>> problem.
>>>
>>> E.g. after about 3.5 hours running locally, it failed with memory issues.
>>>
>>> In the TaskManager logs, I start seeing exceptions in my code….
>>>
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> And then eventually...
>>>
>>> 2018-04-07 21:55:25,686 WARN  
>>> org.apache.flink.runtime.accumulators.AccumulatorRegistry
>>> - Failed to serialize accumulators for task.
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> Immediately after this, one of my custom functions gets a close() call,
>>> and I see a log msg about it "switched from RUNNING to FAILED”.
>>>
>>> After this, I see messages that the job is being restarted, but the
>>> TaskManager log output abruptly ends.
>>>
>>> In the Job Manager log, this is what is output following the time of the
>>> last TaskManager logging output:
>>>
>>> 2018-04-07 21:57:33,702 INFO  

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Lasse Nedergaard
This time attached.



2018-04-10 10:41 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:

> Can you use third party site for the graph ?
>
> I cannot view it.
>
> Thanks
>
>  Original message 
> From: Lasse Nedergaard <lassenederga...@gmail.com>
> Date: 4/10/18 12:25 AM (GMT-08:00)
> To: Ken Krugler <kkrugler_li...@transpac.com>
> Cc: user <user@flink.apache.org>, Chesnay Schepler <ches...@apache.org>
> Subject: Re: java.lang.Exception: TaskManager was lost/killed
>
>
> This graph shows Non-Heap . If the same pattern exists it make sense that
> it will try to allocate more memory and then exceed the limit. I can see
> the trend for all other containers that has been killed. So my question is
> now, what is using non-heap memory?
> From http://mail-archives.apache.org/mod_mbox/flink-
> user/201707.mbox/%3CCANC1h_u0dQQvbysDAoLLbEmeWaxiimTMFjJC
> ribpfpo0idl...@mail.gmail.com%3E it look like RockDb could be guilty.
>
> I have job using incremental checkpointing and some without, some
> optimised for FLASH_SSD. all have same pattern
>
> Lasse
>
>
>
> 2018-04-10 8:52 GMT+02:00 Lasse Nedergaard <lassenederga...@gmail.com>:
>
>> Hi.
>>
>> I found the exception attached below, for our simple job. It states that
>> our task-manager was killed du to exceed memory limit on 2.7GB. But when I
>> look at Flink metricts just 30 sec before it use 1.3 GB heap and 712 MB
>> Non-Heap around 2 GB.
>> So something else are also using memory inside the conatianer any idea
>> how to figure out what?
>> As a side note we use RockDBStateBackend with this configuration
>>
>> env.getCheckpointConfig().setMinPauseBetweenCheckpoints((long)(config.checkPointInterval
>>  * 0.75));
>> env.enableCheckpointing(config.checkPointInterval, 
>> CheckpointingMode.AT_LEAST_ONCE);
>> env.setStateBackend(new RocksDBStateBackend(config.checkpointDataUri));
>>
>> Where checkpointDataUri point to S3
>>
>> Lasse Nedergaard
>>
>> 2018-04-09 16:52:01,239 INFO  org.apache.flink.yarn.YarnFlin
>> kResourceManager- Diagnostics for container
>> container_1522921976871_0001_01_79 in state COMPLETE :
>> exitStatus=Pmem limit exceeded (-104) diagnostics=Container
>> [pid=30118,containerID=container_1522921976871_0001_01_79] is
>> running beyond physical memory limits. Current usage: 2.7 GB of 2.7 GB
>> physical memory used; 4.9 GB of 13.4 GB virtual memory used. Killing
>> container.
>>
>> Dump of the process-tree for container_1522921976871_0001_01_79 :
>>
>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>>
>> |- 30136 30118 30118 30118 (java) 245173 68463 5193723904 703845
>> /usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
>> -Dlog.file=/var/log/hadoop-yarn/containers/application_15229
>> 21976871_0001/container_1522921976871_0001_01_79/taskmanager.log
>> -Dlogback.configurationFile=file:./logback.xml
>> -Dlog4j.configuration=file:./log4j.properties
>> org.apache.flink.yarn.YarnTaskManager --configDir .
>>
>> |- 30118 30116 30118 30118 (bash) 0 0 115818496 674 /bin/bash -c
>> /usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
>> -Dlog.file=/var/log/hadoop-yarn/containers/application_15229
>> 21976871_0001/container_1522921976871_0001_01_79/taskmanager.log
>> -Dlogback.configurationFile=file:./logback.xml
>> -Dlog4j.configuration=file:./log4j.properties
>> org.apache.flink.yarn.YarnTaskManager --configDir . 1>
>> /var/log/hadoop-yarn/containers/application_1522921976871_
>> 0001/container_1522921976871_0001_01_79/taskmanager.out 2>
>> /var/log/hadoop-yarn/containers/application_1522921976871_
>> 0001/container_1522921976871_0001_01_79/taskmanager.err
>>
>>
>> 2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
>> - gauge.ip-10-1-1-181.taskmanage
>> r.container_1522921976871_0001_01_79.Status.JVM.
>> Memory.Heap.Used=1398739496
>>
>>
>> 2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
>> - gauge.ip-10-1-1-181.taskmanage
>> r.container_1522921976871_0001_01_79.Status.JVM.
>> Memory.NonHeap.Used=746869520
>>
>>
>>
>>
>>
>> 2018-04-09 23:52 GMT+02:00 Ken Krugler <kkrugler_li...@transpac.com>:
>>
>>> Hi Chesnay,
>>>
>>> Don’t know if this helps, but I’d run into this as well, though I
>>> haven’t hooked up YourKit to analyze exactly what’s causing the

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Ted Yu
Can you use third party site for the graph ?
I cannot view it.

Thanks
 Original message From: Lasse Nedergaard 
<lassenederga...@gmail.com> Date: 4/10/18  12:25 AM  (GMT-08:00) To: Ken 
Krugler <kkrugler_li...@transpac.com> Cc: user <user@flink.apache.org>, Chesnay 
Schepler <ches...@apache.org> Subject: Re: java.lang.Exception: TaskManager was 
lost/killed 

This graph shows Non-Heap . If the same pattern exists it make sense that it 
will try to allocate more memory and then exceed the limit. I can see the trend 
for all other containers that has been killed. So my question is now, what is 
using non-heap memory?From 
http://mail-archives.apache.org/mod_mbox/flink-user/201707.mbox/%3ccanc1h_u0dqqvbysdaollbemewaxiimtmfjjcribpfpo0idl...@mail.gmail.com%3E
 it look like RockDb could be guilty.
I have job using incremental checkpointing and some without, some optimised for 
FLASH_SSD. all have same pattern
Lasse 


2018-04-10 8:52 GMT+02:00 Lasse Nedergaard <lassenederga...@gmail.com>:
Hi.
I found the exception attached below, for our simple job. It states that our 
task-manager was killed du to exceed memory limit on 2.7GB. But when I look at 
Flink metricts just 30 sec before it use 1.3 GB heap and 712 MB Non-Heap around 
2 GB. So something else are also using memory inside the conatianer any idea 
how to figure out what?As a side note we use RockDBStateBackend with this 
configuration
env.getCheckpointConfig().setMinPauseBetweenCheckpoints((long)(config.checkPointInterval
 * 0.75));
env.enableCheckpointing(config.checkPointInterval, 
CheckpointingMode.AT_LEAST_ONCE);
env.setStateBackend(new RocksDBStateBackend(config.checkpointDataUri));Where 
checkpointDataUri point to S3
Lasse Nedergaard











2018-04-09 16:52:01,239 INFO  org.apache.flink.yarn.YarnFlinkResourceManager    
            - Diagnostics for container container_1522921976871_0001_01_79 
in state COMPLETE : exitStatus=Pmem limit exceeded (-104) diagnostics=Container 
[pid=30118,containerID=container_1522921976871_0001_01_79] is running 
beyond physical memory limits. Current usage: 2.7 GB of 2.7 GB physical memory 
used; 4.9 GB of 13.4 GB virtual memory used. Killing container.
Dump of the process-tree for container_1522921976871_0001_01_79 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 30136 30118 30118 30118 (java) 245173 68463 5193723904 703845 
/usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m 
-Dlog.file=/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.log
 -Dlogback.configurationFile=file:./logback.xml 
-Dlog4j.configuration=file:./log4j.properties 
org.apache.flink.yarn.YarnTaskManager --configDir . 
|- 30118 30116 30118 30118 (bash) 0 0 115818496 674 /bin/bash -c 
/usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m  
-Dlog.file=/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.log
 -Dlogback.configurationFile=file:./logback.xml 
-Dlog4j.configuration=file:./log4j.properties 
org.apache.flink.yarn.YarnTaskManager --configDir . 1> 
/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.out
 2> 
/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.err
 


2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter                     
            - 
gauge.ip-10-1-1-181.taskmanager.container_1522921976871_0001_01_79.Status.JVM.Memory.Heap.Used=1398739496


2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter                     
            - 
gauge.ip-10-1-1-181.taskmanager.container_1522921976871_0001_01_79.Status.JVM.Memory.NonHeap.Used=746869520






 
2018-04-09 23:52 GMT+02:00 Ken Krugler <kkrugler_li...@transpac.com>:
Hi Chesnay,
Don’t know if this helps, but I’d run into this as well, though I haven’t 
hooked up YourKit to analyze exactly what’s causing the memory problem.
E.g. after about 3.5 hours running locally, it failed with memory issues.

In the TaskManager logs, I start seeing exceptions in my code….
java.lang.OutOfMemoryError: GC overhead limit exceeded
And then eventually...
2018-04-07 21:55:25,686 WARN  
org.apache.flink.runtime.accumulators.AccumulatorRegistry     - Failed to 
serialize accumulators for task.
java.lang.OutOfMemoryError: GC overhead limit exceeded

Immediately after this, one of my custom functions gets a close() call, and I 
see a log msg about it "switched from RUNNING to FAILED”.
After this, I see messages that the job is being restarted, but the TaskManager 
log output abruptly ends.
In the Job Manager log, this is what is output following the time of the last 
TaskManager logging output:
2018-04-07 21:57:33,702 INFO  
org.apache

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Lasse Nedergaard
This graph shows Non-Heap . If the same pattern exists it make sense that
it will try to allocate more memory and then exceed the limit. I can see
the trend for all other containers that has been killed. So my question is
now, what is using non-heap memory?
From
http://mail-archives.apache.org/mod_mbox/flink-user/201707.mbox/%3ccanc1h_u0dqqvbysdaollbemewaxiimtmfjjcribpfpo0idl...@mail.gmail.com%3E
it look like RockDb could be guilty.

I have job using incremental checkpointing and some without, some optimised
for FLASH_SSD. all have same pattern

Lasse



2018-04-10 8:52 GMT+02:00 Lasse Nedergaard :

> Hi.
>
> I found the exception attached below, for our simple job. It states that
> our task-manager was killed du to exceed memory limit on 2.7GB. But when I
> look at Flink metricts just 30 sec before it use 1.3 GB heap and 712 MB
> Non-Heap around 2 GB.
> So something else are also using memory inside the conatianer any idea how
> to figure out what?
> As a side note we use RockDBStateBackend with this configuration
>
> env.getCheckpointConfig().setMinPauseBetweenCheckpoints((long)(config.checkPointInterval
>  * 0.75));
> env.enableCheckpointing(config.checkPointInterval, 
> CheckpointingMode.AT_LEAST_ONCE);
> env.setStateBackend(new RocksDBStateBackend(config.checkpointDataUri));
>
> Where checkpointDataUri point to S3
>
> Lasse Nedergaard
>
> 2018-04-09 16:52:01,239 INFO  org.apache.flink.yarn.
> YarnFlinkResourceManager- Diagnostics for container
> container_1522921976871_0001_01_79 in state COMPLETE :
> exitStatus=Pmem limit exceeded (-104) diagnostics=Container
> [pid=30118,containerID=container_1522921976871_0001_01_79] is running
> beyond physical memory limits. Current usage: 2.7 GB of 2.7 GB physical
> memory used; 4.9 GB of 13.4 GB virtual memory used. Killing container.
>
> Dump of the process-tree for container_1522921976871_0001_01_79 :
>
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>
> |- 30136 30118 30118 30118 (java) 245173 68463 5193723904 703845
> /usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
> -Dlog.file=/var/log/hadoop-yarn/containers/application_
> 1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.log
> -Dlogback.configurationFile=file:./logback.xml
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskManager
> --configDir .
>
> |- 30118 30116 30118 30118 (bash) 0 0 115818496 674 /bin/bash -c
> /usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
> -Dlog.file=/var/log/hadoop-yarn/containers/application_
> 1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.log
> -Dlogback.configurationFile=file:./logback.xml
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskManager
> --configDir . 1> /var/log/hadoop-yarn/containers/application_
> 1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.out
> 2> /var/log/hadoop-yarn/containers/application_
> 1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.err
>
>
> 2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
>   - gauge.ip-10-1-1-181.taskmanager.container_
> 1522921976871_0001_01_79.Status.JVM.Memory.Heap.Used=1398739496
>
>
> 2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
>   - gauge.ip-10-1-1-181.taskmanager.container_
> 1522921976871_0001_01_79.Status.JVM.Memory.NonHeap.Used=746869520
>
>
>
>
>
> 2018-04-09 23:52 GMT+02:00 Ken Krugler :
>
>> Hi Chesnay,
>>
>> Don’t know if this helps, but I’d run into this as well, though I haven’t
>> hooked up YourKit to analyze exactly what’s causing the memory problem.
>>
>> E.g. after about 3.5 hours running locally, it failed with memory issues.
>>
>> In the TaskManager logs, I start seeing exceptions in my code….
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> And then eventually...
>>
>> 2018-04-07 21:55:25,686 WARN  
>> org.apache.flink.runtime.accumulators.AccumulatorRegistry
>> - Failed to serialize accumulators for task.
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> Immediately after this, one of my custom functions gets a close() call,
>> and I see a log msg about it "switched from RUNNING to FAILED”.
>>
>> After this, I see messages that the job is being restarted, but the
>> TaskManager log output abruptly ends.
>>
>> In the Job Manager log, this is what is output following the time of the
>> last TaskManager logging output:
>>
>> 2018-04-07 21:57:33,702 INFO  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>> - Triggering checkpoint 129 @ 1523163453702
>> 2018-04-07 21:58:43,916 WARN  akka.remote.ReliableDeliverySupervisor
>>- Association with remote system [
>> akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address
>> is now 

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Lasse Nedergaard
Hi.

I found the exception attached below, for our simple job. It states that
our task-manager was killed du to exceed memory limit on 2.7GB. But when I
look at Flink metricts just 30 sec before it use 1.3 GB heap and 712 MB
Non-Heap around 2 GB.
So something else are also using memory inside the conatianer any idea how
to figure out what?
As a side note we use RockDBStateBackend with this configuration

env.getCheckpointConfig().setMinPauseBetweenCheckpoints((long)(config.checkPointInterval
* 0.75));
env.enableCheckpointing(config.checkPointInterval,
CheckpointingMode.AT_LEAST_ONCE);
env.setStateBackend(new RocksDBStateBackend(config.checkpointDataUri));

Where checkpointDataUri point to S3

Lasse Nedergaard

2018-04-09 16:52:01,239 INFO  org.apache.flink.yarn.YarnFlinkResourceManager
  - Diagnostics for container
container_1522921976871_0001_01_79 in state COMPLETE : exitStatus=Pmem
limit exceeded (-104) diagnostics=Container
[pid=30118,containerID=container_1522921976871_0001_01_79] is running
beyond physical memory limits. Current usage: 2.7 GB of 2.7 GB physical
memory used; 4.9 GB of 13.4 GB virtual memory used. Killing container.

Dump of the process-tree for container_1522921976871_0001_01_79 :

|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

|- 30136 30118 30118 30118 (java) 245173 68463 5193723904 703845
/usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
-Dlog.file=/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.log
-Dlogback.configurationFile=file:./logback.xml
-Dlog4j.configuration=file:./log4j.properties
org.apache.flink.yarn.YarnTaskManager --configDir .

|- 30118 30116 30118 30118 (bash) 0 0 115818496 674 /bin/bash -c
/usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
-Dlog.file=/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.log
-Dlogback.configurationFile=file:./logback.xml
-Dlog4j.configuration=file:./log4j.properties
org.apache.flink.yarn.YarnTaskManager --configDir . 1>
/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.out
2>
/var/log/hadoop-yarn/containers/application_1522921976871_0001/container_1522921976871_0001_01_79/taskmanager.err



2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
-
gauge.ip-10-1-1-181.taskmanager.container_1522921976871_0001_01_79.Status.JVM.Memory.Heap.Used=1398739496


2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
-
gauge.ip-10-1-1-181.taskmanager.container_1522921976871_0001_01_79.Status.JVM.Memory.NonHeap.Used=746869520





2018-04-09 23:52 GMT+02:00 Ken Krugler :

> Hi Chesnay,
>
> Don’t know if this helps, but I’d run into this as well, though I haven’t
> hooked up YourKit to analyze exactly what’s causing the memory problem.
>
> E.g. after about 3.5 hours running locally, it failed with memory issues.
>
> In the TaskManager logs, I start seeing exceptions in my code….
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> And then eventually...
>
> 2018-04-07 21:55:25,686 WARN  
> org.apache.flink.runtime.accumulators.AccumulatorRegistry
> - Failed to serialize accumulators for task.
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> Immediately after this, one of my custom functions gets a close() call,
> and I see a log msg about it "switched from RUNNING to FAILED”.
>
> After this, I see messages that the job is being restarted, but the
> TaskManager log output abruptly ends.
>
> In the Job Manager log, this is what is output following the time of the
> last TaskManager logging output:
>
> 2018-04-07 21:57:33,702 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> - Triggering checkpoint 129 @ 1523163453702
> 2018-04-07 21:58:43,916 WARN  akka.remote.ReliableDeliverySupervisor
>- Association with remote system [
> akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address
> is now gated for [5000] ms. Reason: [Disassociated]
> 2018-04-07 21:58:51,084 WARN  akka.remote.transport.netty.NettyTransport
>- Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: kens-
> mbp.hsd1.ca.comcast.net/192.168.3.177:63780
> 2018-04-07 21:58:51,086 WARN  akka.remote.ReliableDeliverySupervisor
>- Association with remote system [akka.tcp://
> fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now
> gated for [5000] ms. Reason: [Association failed with [akka.tcp://
> fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection
> refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
> 2018-04-07 21:59:01,047 WARN  akka.remote.transport.netty.NettyTransport
>- Remote connection 

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Ken Krugler
Hi Chesnay,

Don’t know if this helps, but I’d run into this as well, though I haven’t 
hooked up YourKit to analyze exactly what’s causing the memory problem.

E.g. after about 3.5 hours running locally, it failed with memory issues.

In the TaskManager logs, I start seeing exceptions in my code….

java.lang.OutOfMemoryError: GC overhead limit exceeded

And then eventually...

2018-04-07 21:55:25,686 WARN  
org.apache.flink.runtime.accumulators.AccumulatorRegistry - Failed to 
serialize accumulators for task.
java.lang.OutOfMemoryError: GC overhead limit exceeded

Immediately after this, one of my custom functions gets a close() call, and I 
see a log msg about it "switched from RUNNING to FAILED”.

After this, I see messages that the job is being restarted, but the TaskManager 
log output abruptly ends.

In the Job Manager log, this is what is output following the time of the last 
TaskManager logging output:

2018-04-07 21:57:33,702 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 129 @ 1523163453702
2018-04-07 21:58:43,916 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Disassociated] 
2018-04-07 21:58:51,084 WARN  akka.remote.transport.netty.NettyTransport
- Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
2018-04-07 21:58:51,086 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection 
refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
2018-04-07 21:59:01,047 WARN  akka.remote.transport.netty.NettyTransport
- Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
2018-04-07 21:59:01,050 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection 
refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
2018-04-07 21:59:11,057 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection 
refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
2018-04-07 21:59:11,058 WARN  akka.remote.transport.netty.NettyTransport
- Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
2018-04-07 21:59:21,049 WARN  akka.remote.transport.netty.NettyTransport
- Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
2018-04-07 21:59:21,049 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is 
now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection 
refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
2018-04-07 21:59:21,056 WARN  akka.remote.RemoteWatcher 
- Detected unreachable: 
[akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780]
2018-04-07 21:59:21,063 INFO  org.apache.flink.runtime.jobmanager.JobManager
- Task manager 
akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780/user/taskmanager terminated.
2018-04-07 21:59:21,064 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- 
FetchUrlsFunction for sitemap -> ParseSiteMapFunction -> 
OutlinkToStateUrlFunction (1/1) (3e9374d1bf5fdb359e3a624a4d5d659b) switched 
from RUNNING to FAILED.
java.lang.Exception: TaskManager was lost/killed: 
c51d3879b6244828eb9fc78c943007ad @ kens-mbp.hsd1.ca.comcast.net (dataPort=63782)

— Ken



> On Apr 9, 2018, at 12:48 PM, Chesnay Schepler  wrote:
> 
> We will need more information to offer any solution. The exception simply 
> means that a TaskManager shut down, for which there are a myriad of possible 
> explanations.
> 
> Please have a look at the TaskManager logs, they may contain a hint as to why 
> it shut down.
> 
> On 09.04.2018 

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Hao Sun
Same story here, 1.3.2 on K8s. Very hard to find reasons on why a TM is
killed. Not likely caused by memory leak. If there is a logger I have turn
on please let me know.

On Mon, Apr 9, 2018, 13:41 Lasse Nedergaard 
wrote:

> We see the same running 1.4.2 on Yarn hosted on Aws EMR cluster. The only
> thing I can find in the logs from are SIGTERM with the code 15 or -100.
> Today our simple job reading from Kinesis and writing to Cassandra was
> killed. The other day in another job I identified a map state.remove
> command to cause a task manager lost without and exception
> I find it frustrating that it is so hard to find the root cause.
> If I look on historical metrics on cpu, heap and non heap I can’t see
> anything that should cause a problem.
> So any ideas about how to debug this kind of exception is much
> appreciated.
>
> Med venlig hilsen / Best regards
> Lasse Nedergaard
>
>
> Den 9. apr. 2018 kl. 21.48 skrev Chesnay Schepler :
>
> We will need more information to offer any solution. The exception simply
> means that a TaskManager shut down, for which there are a myriad of
> possible explanations.
>
> Please have a look at the TaskManager logs, they may contain a hint as to
> why it shut down.
>
> On 09.04.2018 16:01, Javier Lopez wrote:
>
> Hi,
>
> "are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  " -> Yes,
> to every node in the cluster.
>
> On 9 April 2018 at 15:37, miki haiat  wrote:
>
>> Javier
>> "adding the jar file to the /lib path of every task manager"
>> are you moving the job  jar to  the* ~/flink-1.4.2/lib path* ?
>>
>> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez 
>> wrote:
>>
>>> Hi,
>>>
>>> We had the same metaspace problem, it was solved by adding the jar file
>>> to the /lib path of every task manager, as explained here
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading
>>> .
>>>  As well we
>>> added these java options: "-XX:CompressedClassSpaceSize=100M
>>> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M "
>>>
>>> From time to time we have the same problem with TaskManagers
>>> disconnecting, but the logs are not useful. We are using 1.3.2.
>>>
>>> On 9 April 2018 at 10:41, Alexander Smirnov <
>>> alexander.smirn...@gmail.com> wrote:
>>>
 I've seen similar problem, but it was not a heap size, but Metaspace.
 It was caused by a job restarting in a loop. Looks like for each
 restart, Flink loads new instance of classes and very soon in runs out of
 metaspace.

 I've created a JIRA issue for this problem, but got no response from
 the development team on it:
 https://issues.apache.org/jira/browse/FLINK-9132
 


 On Mon, Apr 9, 2018 at 11:36 AM 王凯  wrote:

> thanks a lot,i will try it
>
> 在 2018-04-09 00:06:02,"TechnoMage"  写道:
>
> I have seen this when my task manager ran out of RAM.  Increase the
> heap size.
>
> flink-conf.yaml:
> taskmanager.heap.mb
> jobmanager.heap.mb
>
> Michael
>
> On Apr 8, 2018, at 2:36 AM, 王凯  wrote:
>
> 
> hi all, recently, i found a problem,it runs well when start. But
> after long run,the exception display as above,how can resolve it?
>
>
>
>
>
>
>
>
>

>>>
>>
>
>


Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Lasse Nedergaard
We see the same running 1.4.2 on Yarn hosted on Aws EMR cluster. The only thing 
I can find in the logs from are SIGTERM with the code 15 or -100. 
Today our simple job reading from Kinesis and writing to Cassandra was killed. 
The other day in another job I identified a map state.remove command to cause a 
task manager lost without and exception
I find it frustrating that it is so hard to find the root cause. 
If I look on historical metrics on cpu, heap and non heap I can’t see anything 
that should cause a problem. 
So any ideas about how to debug this kind of exception is much appreciated. 

Med venlig hilsen / Best regards
Lasse Nedergaard


> Den 9. apr. 2018 kl. 21.48 skrev Chesnay Schepler :
> 
> We will need more information to offer any solution. The exception simply 
> means that a TaskManager shut down, for which there are a myriad of possible 
> explanations.
> 
> Please have a look at the TaskManager logs, they may contain a hint as to why 
> it shut down.
> 
>> On 09.04.2018 16:01, Javier Lopez wrote:
>> Hi,
>> 
>> "are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  " -> Yes, to 
>> every node in the cluster.
>> 
>>> On 9 April 2018 at 15:37, miki haiat  wrote:
>>> Javier 
>>> "adding the jar file to the /lib path of every task manager"
>>> are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  
>>> 
 On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez  
 wrote:
 Hi,
 
 We had the same metaspace problem, it was solved by adding the jar file to 
 the /lib path of every task manager, as explained here 
 https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading.
  As well we added these java options: "-XX:CompressedClassSpaceSize=100M 
 -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M "
 
 From time to time we have the same problem with TaskManagers 
 disconnecting, but the logs are not useful. We are using 1.3.2.
 
> On 9 April 2018 at 10:41, Alexander Smirnov 
>  wrote:
> I've seen similar problem, but it was not a heap size, but Metaspace.
> It was caused by a job restarting in a loop. Looks like for each restart, 
> Flink loads new instance of classes and very soon in runs out of 
> metaspace.
> 
> I've created a JIRA issue for this problem, but got no response from the 
> development team on it: https://issues.apache.org/jira/browse/FLINK-9132
> 
> 
>> On Mon, Apr 9, 2018 at 11:36 AM 王凯  wrote:
>> thanks a lot,i will try it
>> 
>> 在 2018-04-09 00:06:02,"TechnoMage"  写道:
>> I have seen this when my task manager ran out of RAM.  Increase the heap 
>> size.
>> 
>> flink-conf.yaml:
>> taskmanager.heap.mb 
>> jobmanager.heap.mb
>> 
>> Michael
>> 
>>> On Apr 8, 2018, at 2:36 AM, 王凯  wrote:
>>> 
>>> 
>>> hi all, recently, i found a problem,it runs well when start. But after 
>>> long run,the exception display as above,how can resolve it?
>>> 
>>> 
>>>  
>> 
>> 
>> 
>>  
>> 
 
>>> 
>> 
> 


Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Chesnay Schepler
We will need more information to offer any solution. The exception 
simply means that a TaskManager shut down, for which there are a myriad 
of possible explanations.


Please have a look at the TaskManager logs, they may contain a hint as 
to why it shut down.


On 09.04.2018 16:01, Javier Lopez wrote:

Hi,

"are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  " -> 
Yes, to every node in the cluster.


On 9 April 2018 at 15:37, miki haiat > wrote:


Javier
"adding the jar file to the /lib path of every task manager"
are you moving the job  jar to  the* ~/flink-1.4.2/lib path* ?

On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez
> wrote:

Hi,

We had the same metaspace problem, it was solved by adding the
jar file to the /lib path of every task manager, as explained
here

https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading

.
As well we added these java options:
"-XX:CompressedClassSpaceSize=100M -XX:MaxMetaspaceSize=300M
-XX:MetaspaceSize=200M "

From time to time we have the same problem with TaskManagers
disconnecting, but the logs are not useful. We are using 1.3.2.

On 9 April 2018 at 10:41, Alexander Smirnov
> wrote:

I've seen similar problem, but it was not a heap size, but
Metaspace.
It was caused by a job restarting in a loop. Looks like
for each restart, Flink loads new instance of classes and
very soon in runs out of metaspace.

I've created a JIRA issue for this problem, but got no
response from the development team on it:
https://issues.apache.org/jira/browse/FLINK-9132



On Mon, Apr 9, 2018 at 11:36 AM 王凯 > wrote:

thanks a lot,i will try it

在 2018-04-09 00:06:02,"TechnoMage"
>
写道:

I have seen this when my task manager ran out of
RAM. Increase the heap size.

flink-conf.yaml:
taskmanager.heap.mb
jobmanager.heap.mb

Michael


On Apr 8, 2018, at 2:36 AM, 王凯 > wrote:


hi all, recently, i found a problem,it runs well
when start. But after long run,the exception
display as above,how can resolve it?














Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Javier Lopez
Hi,

"are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  " -> Yes, to
every node in the cluster.

On 9 April 2018 at 15:37, miki haiat  wrote:

> Javier
> "adding the jar file to the /lib path of every task manager"
> are you moving the job  jar to  the* ~/flink-1.4.2/lib path* ?
>
> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez 
> wrote:
>
>> Hi,
>>
>> We had the same metaspace problem, it was solved by adding the jar file
>> to the /lib path of every task manager, as explained here
>> https://ci.apache.org/projects/flink/flink-docs-release
>> -1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading.
>> As well we added these java options: "-XX:CompressedClassSpaceSize=100M
>> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M "
>>
>> From time to time we have the same problem with TaskManagers
>> disconnecting, but the logs are not useful. We are using 1.3.2.
>>
>> On 9 April 2018 at 10:41, Alexander Smirnov > > wrote:
>>
>>> I've seen similar problem, but it was not a heap size, but Metaspace.
>>> It was caused by a job restarting in a loop. Looks like for each
>>> restart, Flink loads new instance of classes and very soon in runs out of
>>> metaspace.
>>>
>>> I've created a JIRA issue for this problem, but got no response from the
>>> development team on it: https://issues.apache.org/jira/browse/FLINK-9132
>>>
>>>
>>> On Mon, Apr 9, 2018 at 11:36 AM 王凯  wrote:
>>>
 thanks a lot,i will try it

 在 2018-04-09 00:06:02,"TechnoMage"  写道:

 I have seen this when my task manager ran out of RAM.  Increase the
 heap size.

 flink-conf.yaml:
 taskmanager.heap.mb
 jobmanager.heap.mb

 Michael

 On Apr 8, 2018, at 2:36 AM, 王凯  wrote:

 
 hi all, recently, i found a problem,it runs well when start. But after
 long run,the exception display as above,how can resolve it?









>>>
>>
>


Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread miki haiat
Javier
"adding the jar file to the /lib path of every task manager"
are you moving the job  jar to  the* ~/flink-1.4.2/lib path* ?

On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez 
wrote:

> Hi,
>
> We had the same metaspace problem, it was solved by adding the jar file to
> the /lib path of every task manager, as explained here
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/
> debugging_classloading.html#avoiding-dynamic-classloading. As well we
> added these java options: "-XX:CompressedClassSpaceSize=100M
> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M "
>
> From time to time we have the same problem with TaskManagers
> disconnecting, but the logs are not useful. We are using 1.3.2.
>
> On 9 April 2018 at 10:41, Alexander Smirnov 
> wrote:
>
>> I've seen similar problem, but it was not a heap size, but Metaspace.
>> It was caused by a job restarting in a loop. Looks like for each restart,
>> Flink loads new instance of classes and very soon in runs out of metaspace.
>>
>> I've created a JIRA issue for this problem, but got no response from the
>> development team on it: https://issues.apache.org/jira/browse/FLINK-9132
>>
>>
>> On Mon, Apr 9, 2018 at 11:36 AM 王凯  wrote:
>>
>>> thanks a lot,i will try it
>>>
>>> 在 2018-04-09 00:06:02,"TechnoMage"  写道:
>>>
>>> I have seen this when my task manager ran out of RAM.  Increase the heap
>>> size.
>>>
>>> flink-conf.yaml:
>>> taskmanager.heap.mb
>>> jobmanager.heap.mb
>>>
>>> Michael
>>>
>>> On Apr 8, 2018, at 2:36 AM, 王凯  wrote:
>>>
>>> 
>>> hi all, recently, i found a problem,it runs well when start. But after
>>> long run,the exception display as above,how can resolve it?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Javier Lopez
Hi,

We had the same metaspace problem, it was solved by adding the jar file to
the /lib path of every task manager, as explained here
https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading.
As well we added these java options: "-XX:CompressedClassSpaceSize=100M
-XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M "

>From time to time we have the same problem with TaskManagers disconnecting,
but the logs are not useful. We are using 1.3.2.

On 9 April 2018 at 10:41, Alexander Smirnov 
wrote:

> I've seen similar problem, but it was not a heap size, but Metaspace.
> It was caused by a job restarting in a loop. Looks like for each restart,
> Flink loads new instance of classes and very soon in runs out of metaspace.
>
> I've created a JIRA issue for this problem, but got no response from the
> development team on it: https://issues.apache.org/jira/browse/FLINK-9132
>
>
> On Mon, Apr 9, 2018 at 11:36 AM 王凯  wrote:
>
>> thanks a lot,i will try it
>>
>> 在 2018-04-09 00:06:02,"TechnoMage"  写道:
>>
>> I have seen this when my task manager ran out of RAM.  Increase the heap
>> size.
>>
>> flink-conf.yaml:
>> taskmanager.heap.mb
>> jobmanager.heap.mb
>>
>> Michael
>>
>> On Apr 8, 2018, at 2:36 AM, 王凯  wrote:
>>
>> 
>> hi all, recently, i found a problem,it runs well when start. But after
>> long run,the exception display as above,how can resolve it?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>


Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Alexander Smirnov
I've seen similar problem, but it was not a heap size, but Metaspace.
It was caused by a job restarting in a loop. Looks like for each restart,
Flink loads new instance of classes and very soon in runs out of metaspace.

I've created a JIRA issue for this problem, but got no response from the
development team on it: https://issues.apache.org/jira/browse/FLINK-9132


On Mon, Apr 9, 2018 at 11:36 AM 王凯  wrote:

> thanks a lot,i will try it
>
> 在 2018-04-09 00:06:02,"TechnoMage"  写道:
>
> I have seen this when my task manager ran out of RAM.  Increase the heap
> size.
>
> flink-conf.yaml:
> taskmanager.heap.mb
> jobmanager.heap.mb
>
> Michael
>
> On Apr 8, 2018, at 2:36 AM, 王凯  wrote:
>
> 
> hi all, recently, i found a problem,it runs well when start. But after
> long run,the exception display as above,how can resolve it?
>
>
>
>
>
>
>
>
>


Re: java.lang.Exception: TaskManager was lost/killed

2018-04-08 Thread TechnoMage
I have seen this when my task manager ran out of RAM.  Increase the heap size.

flink-conf.yaml:
taskmanager.heap.mb 
jobmanager.heap.mb

Michael

> On Apr 8, 2018, at 2:36 AM, 王凯  wrote:
> 
> 
> hi all, recently, i found a problem,it runs well when start. But after long 
> run,the exception display as above,how can resolve it?
> 
> 
>