Re: runtime.resourcemanager

2018-12-11 Thread Piotr Nowojski
Hey,

Is that whole Task Manager log? Have you checked memory issues both on Task 
Managers and the Job Manager? Like out of memory/long GC pauses as I suggested 
in the first email? 

After you rule memory issues, you could capture couple of thread dumps (`kill 
-3 JVM_PID` or `jstack JVM_PID`) and check if any thread is stuck in your code.

Another potential issue, are you sure that you have a healthy network between 
nodes? No packet losts, low ping etc?

Piotrek

> On 10 Dec 2018, at 17:44, Alieh  wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Hello,
> 
> this is the task manage log but it does not change after I run the program.  
> I think the Flink planner has problem with my program. It can not even start 
> the job.
> 
> Best,
> 
> Alieh
> 
> 
> 
> 018-12-10 12:20:20,386 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> 
> 2018-12-10 12:20:20,387 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Starting 
> TaskManager (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 @ 13:31:13 UTC)
> 2018-12-10 12:20:20,387 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  OS current 
> user: alieh
> 2018-12-10 12:20:20,609 WARN  org.apache.hadoop.util.NativeCodeLoader 
>   - Unable to load native-hadoop library for your platform... 
> using builtin-java classes where applicable
> 2018-12-10 12:20:20,768 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Current 
> Hadoop/Kerberos user: alieh
> 2018-12-10 12:20:20,769 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JVM: Java 
> HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b12
> 2018-12-10 12:20:20,769 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Maximum heap 
> size: 922 MiBytes
> 2018-12-10 12:20:20,769 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JAVA_HOME: 
> /usr/lib/jvm/java-8-oracle
> 2018-12-10 12:20:20,774 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Hadoop 
> version: 2.4.1
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JVM Options:
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> -XX:+UseG1GC
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - -Xms922M
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - -Xmx922M
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> -XX:MaxDirectMemorySize=8388607T
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> -Dlog.file=/home/alieh/flink-1.6.0/log/flink-alieh-taskexecutor-0-alieh-P67A-D3-B3.log
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> -Dlog4j.configuration=file:/home/alieh/flink-1.6.0/conf/log4j.properties 
> 
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> -Dlogback.configurationFile=file:/home/alieh/flink-1.6.0/conf/logback.xml 
> 
> 2018-12-10 12:20:20,775 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Program 
> Arguments:
> 2018-12-10 12:20:20,776 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> --configDir
> 2018-12-10 12:20:20,776 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> /home/alieh/flink-1.6.0/conf
> 2018-12-10 12:20:20,776 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Classpath: 
> /home/alieh/flink-1.6.0/lib/flink-python_2.11-1.6.0.jar:/home/alieh/flink-1.6.0/lib/flink-shaded-hadoop2-uber-1.6.0.jar:/home/alieh/flink-1.6.0/lib/log4j-1.2.17.jar:/home/alieh/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar:/home/alieh/flink-1.6.0/lib/flink-dist_2.11-1.6.0.jar:::
> 2018-12-10 12:20:20,776 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
> 
> 2018-12-10 12:20:20,777 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Registered 
> UNIX signal handlers for [TERM, HUP, INT]
> 2018-12-10 12:20:20,785 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Maximum 
> number of open file descriptors is 1048576.
> 2018-12-10 12:20:20,803 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: jobmanager.rpc.address, localhost
> 2018-12-10 12:20:20,803 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: jobmanager.rpc.port, 6123
> 2018-12-10 12:20:20,803 INFO  
> 

Re: runtime.resourcemanager

2018-12-10 Thread Alieh













Hello,

this is the task manage log but it does not change after I run the 
program.  I think the Flink planner has problem with my program. It can 
not even start the job.


Best,

Alieh


018-12-10 12:20:20,386 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 

2018-12-10 12:20:20,387 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Starting 
TaskManager (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 @ 13:31:13 UTC)
2018-12-10 12:20:20,387 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  OS current 
user: alieh
2018-12-10 12:20:20,609 WARN  org.apache.hadoop.util.NativeCodeLoader   
- Unable to load native-hadoop library for your platform... using 
builtin-java classes where applicable
2018-12-10 12:20:20,768 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Current 
Hadoop/Kerberos user: alieh
2018-12-10 12:20:20,769 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JVM: Java 
HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b12
2018-12-10 12:20:20,769 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Maximum heap 
size: 922 MiBytes
2018-12-10 12:20:20,769 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JAVA_HOME: 
/usr/lib/jvm/java-8-oracle
2018-12-10 12:20:20,774 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Hadoop 
version: 2.4.1
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  JVM Options:
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - -XX:+UseG1GC
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - -Xms922M
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - -Xmx922M
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
-XX:MaxDirectMemorySize=8388607T
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
-Dlog.file=/home/alieh/flink-1.6.0/log/flink-alieh-taskexecutor-0-alieh-P67A-D3-B3.log
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
-Dlog4j.configuration=file:/home/alieh/flink-1.6.0/conf/log4j.properties
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
-Dlogback.configurationFile=file:/home/alieh/flink-1.6.0/conf/logback.xml
2018-12-10 12:20:20,775 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Program 
Arguments:
2018-12-10 12:20:20,776 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - --configDir
2018-12-10 12:20:20,776 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 
/home/alieh/flink-1.6.0/conf
2018-12-10 12:20:20,776 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   -  Classpath: 
/home/alieh/flink-1.6.0/lib/flink-python_2.11-1.6.0.jar:/home/alieh/flink-1.6.0/lib/flink-shaded-hadoop2-uber-1.6.0.jar:/home/alieh/flink-1.6.0/lib/log4j-1.2.17.jar:/home/alieh/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar:/home/alieh/flink-1.6.0/lib/flink-dist_2.11-1.6.0.jar:::
2018-12-10 12:20:20,776 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - 

2018-12-10 12:20:20,777 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Registered UNIX 
signal handlers for [TERM, HUP, INT]
2018-12-10 12:20:20,785 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Maximum number 
of open file descriptors is 1048576.
2018-12-10 12:20:20,803 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: jobmanager.rpc.address, localhost
2018-12-10 12:20:20,803 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: jobmanager.rpc.port, 6123
2018-12-10 12:20:20,803 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: jobmanager.heap.size, 1024m
2018-12-10 12:20:20,803 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: taskmanager.heap.size, 1024m
2018-12-10 12:20:20,803 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: taskmanager.numberOfTaskSlots, 1
2018-12-10 12:20:20,803 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: parallelism.default, 1
2018-12-10 12:20:20,804 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: rest.port, 8081

Re: runtime.resourcemanager

2018-12-10 Thread Piotr Nowojski
Hi,

Have you checked task managers logs?

Piotrek

> On 8 Dec 2018, at 12:23, Alieh  wrote:
> 
> Hello Piotrek,
> 
> thank you for your answer. I installed a Flink on a local cluster and used 
> the GUI in order to monitor the task managers. It seems the program does not 
> start at all. The whole time just the job manager is struggling... For very 
> very toy examples, after a long time (during this time I see the job manager 
> logs as I mentioned before),  the job is started and can be executed in 2 
> seconds.  
> 
> Best,
> 
> Alieh
> 
> 
> On 12/07/2018 10:43 AM, Piotr Nowojski wrote:
>> Hi,
>> 
>> Please investigate logs/standard output/error from the task manager that has 
>> failed (the logs that you showed are from job manager). Probably there is 
>> some obvious error/exception explaining why has it failed. Most common 
>> reasons:
>> - out of memory
>> - long GC pause
>> - seg fault or other error from some native library
>> - task manager killed via for example SIGKILL
>> 
>> Piotrek
>> 
>>> On 6 Dec 2018, at 17:34, Alieh  
>>>  wrote:
>>> 
>>> Hello all,
>>> 
>>> I have an algorithm x () which contains several joins and usage of 3 times 
>>> of gelly ConnectedComponents. The problem is that if I call x() inside a 
>>> script more than three times, I receive the messages listed below in the 
>>> log and the program is somehow stopped. It happens even if I run it with a 
>>> toy example of a graph with less that 10 vertices. Do you have any clue 
>>> what is the problem?
>>> 
>>> Cheers,
>>> 
>>> Alieh
>>> 
>>> 
>>> 129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>> Trigger heartbeat request.
>>> 129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>> Trigger heartbeat request.
>>> 129150 [flink-akka.actor.default-dispatcher-20] DEBUG 
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor  - Received heartbeat 
>>> request from e80ec35f3d0a04a68000ecbdc555f98b.
>>> 129150 [flink-akka.actor.default-dispatcher-22] DEBUG 
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>> Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db.
>>> 129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
>>> Received new slot report from TaskManager 
>>> 78cdd7a4-0c00-4912-992f-a2990a5d46db.
>>> 129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received 
>>> slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da.
>>> 129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release 
>>> TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded the idle 
>>> timeout.
>>> 129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 
>>> 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped.
>>> 
>> 
> 



Re: runtime.resourcemanager

2018-12-08 Thread Alieh

Hello Piotrek,

thank you for your answer. I installed a Flink on a local cluster and 
used the GUI in order to monitor the task managers. It seems the program 
*d**oes not start at all*. The whole time just the job manager is 
struggling... For very very toy examples, after a long time (during this 
time I see the job manager logs as I mentioned before),  the job is 
started and can be executed in 2 seconds.


Best,

Alieh


On 12/07/2018 10:43 AM, Piotr Nowojski wrote:

Hi,

Please investigate logs/standard output/error from the task manager that has 
failed (the logs that you showed are from job manager). Probably there is some 
obvious error/exception explaining why has it failed. Most common reasons:
- out of memory
- long GC pause
- seg fault or other error from some native library
- task manager killed via for example SIGKILL

Piotrek


On 6 Dec 2018, at 17:34, Alieh  wrote:

Hello all,

I have an algorithm x () which contains several joins and usage of 3 times of 
gelly ConnectedComponents. The problem is that if I call x() inside a script 
more than three times, I receive the messages listed below in the log and the 
program is somehow stopped. It happens even if I run it with a toy example of a 
graph with less that 10 vertices. Do you have any clue what is the problem?

Cheers,

Alieh


129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger 
heartbeat request.
129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger 
heartbeat request.
129150 [flink-akka.actor.default-dispatcher-20] DEBUG 
org.apache.flink.runtime.taskexecutor.TaskExecutor  - Received heartbeat 
request from e80ec35f3d0a04a68000ecbdc555f98b.
129150 [flink-akka.actor.default-dispatcher-22] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received 
heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db.
129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received 
new slot report from TaskManager 78cdd7a4-0c00-4912-992f-a2990a5d46db.
129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received 
slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da.
129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release 
TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded the idle 
timeout.
129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 
78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped.







Re: runtime.resourcemanager

2018-12-07 Thread Piotr Nowojski
Hi,

Please investigate logs/standard output/error from the task manager that has 
failed (the logs that you showed are from job manager). Probably there is some 
obvious error/exception explaining why has it failed. Most common reasons:
- out of memory
- long GC pause
- seg fault or other error from some native library
- task manager killed via for example SIGKILL

Piotrek

> On 6 Dec 2018, at 17:34, Alieh  wrote:
> 
> Hello all,
> 
> I have an algorithm x () which contains several joins and usage of 3 times of 
> gelly ConnectedComponents. The problem is that if I call x() inside a script 
> more than three times, I receive the messages listed below in the log and the 
> program is somehow stopped. It happens even if I run it with a toy example of 
> a graph with less that 10 vertices. Do you have any clue what is the problem?
> 
> Cheers,
> 
> Alieh
> 
> 
> 129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger 
> heartbeat request.
> 129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger 
> heartbeat request.
> 129150 [flink-akka.actor.default-dispatcher-20] DEBUG 
> org.apache.flink.runtime.taskexecutor.TaskExecutor  - Received heartbeat 
> request from e80ec35f3d0a04a68000ecbdc555f98b.
> 129150 [flink-akka.actor.default-dispatcher-22] DEBUG 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received 
> heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db.
> 129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received 
> new slot report from TaskManager 78cdd7a4-0c00-4912-992f-a2990a5d46db.
> 129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received 
> slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da.
> 129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release 
> TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded the idle 
> timeout.
> 129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 
> 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped.
> 



runtime.resourcemanager

2018-12-06 Thread Alieh

Hello all,

I have an algorithm x () which contains several joins and usage of 3 
times of gelly ConnectedComponents. The problem is that if I call x() 
inside a script more than three times, I receive the messages listed 
below in the log and the program is somehow stopped. It happens even if 
I run it with a toy example of a graph with less that 10 vertices. Do 
you have any clue what is the problem?


Cheers,

Alieh


129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
Trigger heartbeat request.
129149 [flink-akka.actor.default-dispatcher-20] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
Trigger heartbeat request.
129150 [flink-akka.actor.default-dispatcher-20] DEBUG 
org.apache.flink.runtime.taskexecutor.TaskExecutor  - Received heartbeat 
request from e80ec35f3d0a04a68000ecbdc555f98b.
129150 [flink-akka.actor.default-dispatcher-22] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db.
129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
Received new slot report from TaskManager 
78cdd7a4-0c00-4912-992f-a2990a5d46db.
129151 [flink-akka.actor.default-dispatcher-22] DEBUG 
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - 
Received slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da.
129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - 
Release TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it 
exceeded the idle timeout.
129200 [flink-akka.actor.default-dispatcher-15] DEBUG 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - 
Worker 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped.