[jira] [Comment Edited] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar
[ https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408518#comment-17408518 ] huntercc edited comment on FLINK-23905 at 9/2/21, 3:58 AM: --- Hi [~trohrmann]. Recently, we tried to transform org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree of BLOBs. More specifically, we allow TaskManagers from the same job on a machine to share blob files so that only one user jar is downloaded for a machine. Comparing the current method of sharing multiple tasks within a TM with ours, we believe that the two kinds of resource isolation are equivalent theoretically, especially there is no constraints on which tasks can be deployed in the current TM. I can find few differences between sharing BLOBs among 10 tasks in the same TM and sharing the files among 10 TM containing single task. Nevertheless, we would like you to help assess whether there are risks that we have not considered from a more professional perspective. was (Author: huntercc): hi [~trohrmann]. Recently, we tried to transform org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree of BLOBs. More specifically, we allow TaskManagers from the same job on a machine to share blob files so that only one user jar is downloaded for a machine. Comparing the current method of sharing multiple tasks within a TM with ours, we believe that the two kinds of resource isolation are equivalent theoretically, especially there is no constraints on which tasks can be deployed in the current TM. I can find few differences between sharing BLOBs among 10 tasks in the same TM and sharing the files among 10 TM containing single task. Nevertheless, we would like you to help assess whether there are risks that we have not considered from a more professional perspective. > Reduce the load on JobManager when submitting large-scale job with a big user > jar > - > > Key: FLINK-23905 > URL: https://issues.apache.org/jira/browse/FLINK-23905 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: huntercc >Priority: Major > > As described in FLINK-20612 and FLINK-21731, there are some time-consuming > steps in the job startup phase. Recently, we found that when submitting a > large-scale job with a large user jar, the time spent on changing the status > of a task from deploying to running accounts for a high proportion of the > total time-consuming. > In the task initialization stage, the user jar needs to be pulled from the > JobManager through BlobService. JobManager has to allocate a lot of computing > power to distribute the files, which leads to a heavy load in the start-up > stage. More generally, JobManager fails to respond to the RPC request sent by > the TaskManager side in time due to high load, causing some timeout > exceptions, such as akka timeout exception, which leads to job restart and > further prolongs the start-up time of the job. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar
[ https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408518#comment-17408518 ] huntercc edited comment on FLINK-23905 at 9/2/21, 3:57 AM: --- hi [~trohrmann]. Recently, we tried to transform org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree of BLOBs. More specifically, we allow TaskManagers from the same job on a machine to share blob files so that only one user jar is downloaded for a machine. Comparing the current method of sharing multiple tasks within a TM with ours, we believe that the two kinds of resource isolation are equivalent theoretically, especially there is no constraints on which tasks can be deployed in the current TM. I can find few differences between sharing BLOBs among 10 tasks in the same TM and sharing the files among 10 TM containing single task. Nevertheless, we would like you to help assess whether there are risks that we have not considered from a more professional perspective. was (Author: huntercc): Recently, we tried to transform org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree of BLOBs. More specifically, we allow TaskManagers from the same job on a machine to share blob files so that only one user jar is downloaded for a machine. Comparing the current method of sharing multiple tasks within a TM with ours, we believe that the two kinds of resource isolation are equivalent theoretically, especially there is no constraints on which tasks can be deployed in the current TM. I can find few differences between sharing BLOBs among 10 tasks in the same TM and sharing the files among 10 TM containing single task. Nevertheless, we would like you to help assess whether there are risks that we have not considered from a more professional perspective. > Reduce the load on JobManager when submitting large-scale job with a big user > jar > - > > Key: FLINK-23905 > URL: https://issues.apache.org/jira/browse/FLINK-23905 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: huntercc >Priority: Major > > As described in FLINK-20612 and FLINK-21731, there are some time-consuming > steps in the job startup phase. Recently, we found that when submitting a > large-scale job with a large user jar, the time spent on changing the status > of a task from deploying to running accounts for a high proportion of the > total time-consuming. > In the task initialization stage, the user jar needs to be pulled from the > JobManager through BlobService. JobManager has to allocate a lot of computing > power to distribute the files, which leads to a heavy load in the start-up > stage. More generally, JobManager fails to respond to the RPC request sent by > the TaskManager side in time due to high load, causing some timeout > exceptions, such as akka timeout exception, which leads to job restart and > further prolongs the start-up time of the job. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar
[ https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402729#comment-17402729 ] huntercc edited comment on FLINK-23905 at 8/23/21, 9:53 AM: Thanks for your reply and practical advice, [~trohrmann]. In fact, we have adopted a similar method by configuring the yarn.ship-files parameter, which greatly shortens the time spent in this step. I'm worried that there will be dependency conflicts in this way, especially when we use the yarn session mode. I venture to suppose that it would be better if this part of the work could be transparent to users. was (Author: huntercc): Thanks your reply and practical advice, [~trohrmann]. In fact, we have adopted a similar method by configuring the yarn.ship-files parameter, which greatly shortens the time spent in this step. I'm worried that there will be dependency conflicts in this way, especially when we use the yarn session mode. I venture to suppose that it would be better if this part of the work could be transparent to users. > Reduce the load on JobManager when submitting large-scale job with a big user > jar > - > > Key: FLINK-23905 > URL: https://issues.apache.org/jira/browse/FLINK-23905 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: huntercc >Priority: Major > > As described in FLINK-20612 and FLINK-21731, there are some time-consuming > steps in the job startup phase. Recently, we found that when submitting a > large-scale job with a large user jar, the time spent on changing the status > of a task from deploying to running accounts for a high proportion of the > total time-consuming. > In the task initialization stage, the user jar needs to be pulled from the > JobManager through BlobService. JobManager has to allocate a lot of computing > power to distribute the files, which leads to a heavy load in the start-up > stage. More generally, JobManager fails to respond to the RPC request sent by > the TaskManager side in time due to high load, causing some timeout > exceptions, such as akka timeout exception, which leads to job restart and > further prolongs the start-up time of the job. -- This message was sent by Atlassian Jira (v8.3.4#803005)