[jira] [Commented] (FLINK-5685) Connection leak in Taskmanager

2017-10-14 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204695#comment-16204695
 ] 

Stephan Ewen commented on FLINK-5685:
-

Is there any update on this issue, or did increasing the number of file handles 
solve the issue?

We are also updating to a newer akka version for the next release, which may 
have an impact.

> Connection leak in Taskmanager
> --
>
> Key: FLINK-5685
> URL: https://issues.apache.org/jira/browse/FLINK-5685
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Andrey
>Priority: Critical
>
> Steps to reproduce:
> * setup cluster with the following configuration: 1 job manager, 2 task 
> managers
> * job manager starts rejecting connection attempts from task manager.
> {code}
> 2017-01-30 03:24:42,908 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager  - Trying to 
> register at JobManager akka.tcp://flink@ip:6123/user/jobmanager (attempt 
> 4326, timeout: 30 seconds)
> 2017-01-30 03:24:42,913 WARN  Remoting
>   - Tried to associate with unreachable remote address 
> [akka.tcp://flink@ip:6123]. Address is now gated for 5000 ms, all messages to 
> this
>  address will be delivered to dead letters. Reason: The remote system has 
> quarantined this system. No further associations to the remote system are 
> possible until this system is restarted.
> {code}
> * task manager tries multiple times. (looks like it doens't close connection 
> after failure)
> * job manager unable to process any messages. In logs:
> {code}
> 2017-01-30 03:25:12,932 WARN  
> org.jboss.netty.channel.socket.nio.AbstractNioSelector- Failed to 
> accept a connection.
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100)
> at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5685) Connection leak in Taskmanager

2017-05-19 Thread Gustavo Anatoly (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017385#comment-16017385
 ] 

Gustavo Anatoly commented on FLINK-5685:


bq. I am wondering why akka would not close these connections.

Maybe doesn't close these connections, because too many open files cause an 
overload unable to close any file including TCP connections. For this reason I 
suggested to configure {{fs.file-max}} 

> Connection leak in Taskmanager
> --
>
> Key: FLINK-5685
> URL: https://issues.apache.org/jira/browse/FLINK-5685
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Andrey
>Priority: Critical
>
> Steps to reproduce:
> * setup cluster with the following configuration: 1 job manager, 2 task 
> managers
> * job manager starts rejecting connection attempts from task manager.
> {code}
> 2017-01-30 03:24:42,908 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager  - Trying to 
> register at JobManager akka.tcp://flink@ip:6123/user/jobmanager (attempt 
> 4326, timeout: 30 seconds)
> 2017-01-30 03:24:42,913 WARN  Remoting
>   - Tried to associate with unreachable remote address 
> [akka.tcp://flink@ip:6123]. Address is now gated for 5000 ms, all messages to 
> this
>  address will be delivered to dead letters. Reason: The remote system has 
> quarantined this system. No further associations to the remote system are 
> possible until this system is restarted.
> {code}
> * task manager tries multiple times. (looks like it doens't close connection 
> after failure)
> * job manager unable to process any messages. In logs:
> {code}
> 2017-01-30 03:25:12,932 WARN  
> org.jboss.netty.channel.socket.nio.AbstractNioSelector- Failed to 
> accept a connection.
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100)
> at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5685) Connection leak in Taskmanager

2017-05-19 Thread Gustavo Anatoly (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017381#comment-16017381
 ] 

Gustavo Anatoly commented on FLINK-5685:


Thanks for reply. I'll take a look in this issue that you shared. But even 
though that issue (FLINK-3347) be a possible reasons to cause this errors, it's 
interesting  to check it out {{fs.file-max}}. So I suggest adjust 
{{fs.file-max}} parameter with the specific scenario, you can adjust using:
{{# sysctl –w fs.file-max=}}
or editing {{/etc/sysctl.conf}} and add {{fs.file-max=}} manually.


> Connection leak in Taskmanager
> --
>
> Key: FLINK-5685
> URL: https://issues.apache.org/jira/browse/FLINK-5685
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Andrey
>Priority: Critical
>
> Steps to reproduce:
> * setup cluster with the following configuration: 1 job manager, 2 task 
> managers
> * job manager starts rejecting connection attempts from task manager.
> {code}
> 2017-01-30 03:24:42,908 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager  - Trying to 
> register at JobManager akka.tcp://flink@ip:6123/user/jobmanager (attempt 
> 4326, timeout: 30 seconds)
> 2017-01-30 03:24:42,913 WARN  Remoting
>   - Tried to associate with unreachable remote address 
> [akka.tcp://flink@ip:6123]. Address is now gated for 5000 ms, all messages to 
> this
>  address will be delivered to dead letters. Reason: The remote system has 
> quarantined this system. No further associations to the remote system are 
> possible until this system is restarted.
> {code}
> * task manager tries multiple times. (looks like it doens't close connection 
> after failure)
> * job manager unable to process any messages. In logs:
> {code}
> 2017-01-30 03:25:12,932 WARN  
> org.jboss.netty.channel.socket.nio.AbstractNioSelector- Failed to 
> accept a connection.
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100)
> at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5685) Connection leak in Taskmanager

2017-05-19 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017368#comment-16017368
 ] 

Stephan Ewen commented on FLINK-5685:
-

The restarting of actor systems was necessary to prevent having blocked off 
TaskManagers.

Does an additional TCP connection come when an actor system is restarted, or 
also in other cases? To get to 210936 connections, you would need a lot of 
restarts...
I am wondering why akka would not close these connections.

> Connection leak in Taskmanager
> --
>
> Key: FLINK-5685
> URL: https://issues.apache.org/jira/browse/FLINK-5685
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Andrey
>Priority: Critical
>
> Steps to reproduce:
> * setup cluster with the following configuration: 1 job manager, 2 task 
> managers
> * job manager starts rejecting connection attempts from task manager.
> {code}
> 2017-01-30 03:24:42,908 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager  - Trying to 
> register at JobManager akka.tcp://flink@ip:6123/user/jobmanager (attempt 
> 4326, timeout: 30 seconds)
> 2017-01-30 03:24:42,913 WARN  Remoting
>   - Tried to associate with unreachable remote address 
> [akka.tcp://flink@ip:6123]. Address is now gated for 5000 ms, all messages to 
> this
>  address will be delivered to dead letters. Reason: The remote system has 
> quarantined this system. No further associations to the remote system are 
> possible until this system is restarted.
> {code}
> * task manager tries multiple times. (looks like it doens't close connection 
> after failure)
> * job manager unable to process any messages. In logs:
> {code}
> 2017-01-30 03:25:12,932 WARN  
> org.jboss.netty.channel.socket.nio.AbstractNioSelector- Failed to 
> accept a connection.
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100)
> at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5685) Connection leak in Taskmanager

2017-05-19 Thread Andrey (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017318#comment-16017318
 ] 

Andrey commented on FLINK-5685:
---

{code}
#: ulimit -n
1024
{code}

This issue caused by https://issues.apache.org/jira/browse/FLINK-3347

> Connection leak in Taskmanager
> --
>
> Key: FLINK-5685
> URL: https://issues.apache.org/jira/browse/FLINK-5685
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Andrey
>Priority: Critical
>
> Steps to reproduce:
> * setup cluster with the following configuration: 1 job manager, 2 task 
> managers
> * job manager starts rejecting connection attempts from task manager.
> {code}
> 2017-01-30 03:24:42,908 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager  - Trying to 
> register at JobManager akka.tcp://flink@ip:6123/user/jobmanager (attempt 
> 4326, timeout: 30 seconds)
> 2017-01-30 03:24:42,913 WARN  Remoting
>   - Tried to associate with unreachable remote address 
> [akka.tcp://flink@ip:6123]. Address is now gated for 5000 ms, all messages to 
> this
>  address will be delivered to dead letters. Reason: The remote system has 
> quarantined this system. No further associations to the remote system are 
> possible until this system is restarted.
> {code}
> * task manager tries multiple times. (looks like it doens't close connection 
> after failure)
> * job manager unable to process any messages. In logs:
> {code}
> 2017-01-30 03:25:12,932 WARN  
> org.jboss.netty.channel.socket.nio.AbstractNioSelector- Failed to 
> accept a connection.
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100)
> at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5685) Connection leak in Taskmanager

2017-05-18 Thread Gustavo Anatoly (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016260#comment-16016260
 ] 

Gustavo Anatoly commented on FLINK-5685:


Hi [~dernasherbrezon]

Are you working in this issue? 
Could you please provide max open file parameter? (You can use {{ulimit -n}} to 
check the value) 



> Connection leak in Taskmanager
> --
>
> Key: FLINK-5685
> URL: https://issues.apache.org/jira/browse/FLINK-5685
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Andrey
>Priority: Critical
>
> Steps to reproduce:
> * setup cluster with the following configuration: 1 job manager, 2 task 
> managers
> * job manager starts rejecting connection attempts from task manager.
> {code}
> 2017-01-30 03:24:42,908 INFO  
> org.apache.flink.runtime.taskmanager.TaskManager  - Trying to 
> register at JobManager akka.tcp://flink@ip:6123/user/jobmanager (attempt 
> 4326, timeout: 30 seconds)
> 2017-01-30 03:24:42,913 WARN  Remoting
>   - Tried to associate with unreachable remote address 
> [akka.tcp://flink@ip:6123]. Address is now gated for 5000 ms, all messages to 
> this
>  address will be delivered to dead letters. Reason: The remote system has 
> quarantined this system. No further associations to the remote system are 
> possible until this system is restarted.
> {code}
> * task manager tries multiple times. (looks like it doens't close connection 
> after failure)
> * job manager unable to process any messages. In logs:
> {code}
> 2017-01-30 03:25:12,932 WARN  
> org.jboss.netty.channel.socket.nio.AbstractNioSelector- Failed to 
> accept a connection.
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100)
> at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at 
> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)