[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Nowojski updated FLINK-22643: --- Attachment: Screenshot 2022-07-11 at 16.55.02.png > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.14.0, 1.13.2 >Reporter: Zhilong Hong >Assignee: fanrui >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Fix For: 1.15.0 > > Attachments: Screenshot 2022-07-11 at 16.49.42.png, Screenshot > 2022-07-11 at 16.55.02.png > > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Nowojski updated FLINK-22643: --- Attachment: Screenshot 2022-07-11 at 16.49.42.png > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.14.0, 1.13.2 >Reporter: Zhilong Hong >Assignee: fanrui >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Fix For: 1.15.0 > > Attachments: Screenshot 2022-07-11 at 16.49.42.png > > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yingjie Cao updated FLINK-22643: Fix Version/s: 1.15.0 > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.14.0, 1.13.2 >Reporter: Zhilong Hong >Assignee: fanrui >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > Fix For: 1.15.0 > > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-22643: --- Labels: auto-deprioritized-major pull-request-available (was: auto-deprioritized-major) > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.14.0, 1.13.2 >Reporter: Zhilong Hong >Assignee: fanrui >Priority: Minor > Labels: auto-deprioritized-major, pull-request-available > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhilong Hong updated FLINK-22643: - Affects Version/s: 1.14.0 > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.13.0, 1.14.0 >Reporter: Zhilong Hong >Priority: Minor > Labels: auto-deprioritized-major > Fix For: 1.14.0 > > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhilong Hong updated FLINK-22643: - Affects Version/s: (was: 1.13.0) 1.13.2 > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.14.0, 1.13.2 >Reporter: Zhilong Hong >Priority: Minor > Labels: auto-deprioritized-major > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhilong Hong updated FLINK-22643: - Fix Version/s: (was: 1.14.0) > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.13.0, 1.14.0 >Reporter: Zhilong Hong >Priority: Minor > Labels: auto-deprioritized-major > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flink Jira Bot updated FLINK-22643: --- Labels: auto-deprioritized-major (was: stale-major) Priority: Minor (was: Major) This issue was labeled "stale-major" 7 days ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.13.0 >Reporter: Zhilong Hong >Priority: Minor > Labels: auto-deprioritized-major > Fix For: 1.14.0 > > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-22643) Too many TCP connections among TaskManagers for large scale jobs
[ https://issues.apache.org/jira/browse/FLINK-22643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flink Jira Bot updated FLINK-22643: --- Labels: stale-major (was: ) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Major but is unassigned and neither itself nor its Sub-Tasks have been updated for 30 days. I have gone ahead and added a "stale-major" to the issue". If this ticket is a Major, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized. > Too many TCP connections among TaskManagers for large scale jobs > > > Key: FLINK-22643 > URL: https://issues.apache.org/jira/browse/FLINK-22643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.13.0 >Reporter: Zhilong Hong >Priority: Major > Labels: stale-major > Fix For: 1.14.0 > > > For the large scale jobs, there will be too many TCP connections among > TaskManagers. Let's take an example. > For a streaming job with 20 JobVertices, each JobVertex has 500 parallelism. > We divide the vertices into 5 slot sharing groups. Each TaskManager has 5 > slots. Thus there will be 400 taskmanagers in this job. Let's assume that job > runs on a cluster with 20 machines. > If all the job edges are all-to-all edges, there will be 19 * 20 * 399 * 2 = > 303,240 TCP connections for each machine. If we run several jobs on this > cluster, the TCP connections may exceed the maximum limit of linux, which is > 1,048,576. This will stop the TaskManagers from creating new TCP connections > and cause task failovers. > As we run our production jobs on a K8S cluster, the job always failover due > to exceptions related to network, such as {{Sending the partition request to > 'null' failed}}, and etc. > We think that we can decrease the number of connections by letting tasks > reuse the same connection. We implemented a POC that makes all tasks on the > same TaskManager reuse one TCP connection. For the example job we mentioned > above, the number of connections will decrease from 303,240 to 15960. With > the POC, the frequency of meeting exceptions related to network in our > production jobs drops significantly. > The POC is illustrated in: > https://github.com/wsry/flink/commit/bf1c09e80450f40d018a1d1d4fe3dfd2de777fdc > -- This message was sent by Atlassian Jira (v8.3.4#803005)