[jira] [Comment Edited] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

2021-09-01 Thread huntercc (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408518#comment-17408518
 ] 

huntercc edited comment on FLINK-23905 at 9/2/21, 3:58 AM:
---

Hi [~trohrmann]. Recently, we tried to transform 
org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree 
of BLOBs. More specifically, we allow TaskManagers from the same job on a 
machine to share blob files so that only one user jar is downloaded for a 
machine. Comparing the current method of sharing multiple tasks within a TM 
with ours, we believe that the two kinds of resource isolation are equivalent 
theoretically, especially there is no constraints on which tasks can be 
deployed in the current TM. I can find few differences between sharing BLOBs 
among 10 tasks in the same TM and sharing the files among 10 TM containing 
single task. Nevertheless, we would like you to help assess whether there are 
risks that we have not considered from a more professional perspective.


was (Author: huntercc):
hi [~trohrmann]. Recently, we tried to transform 
org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree 
of BLOBs. More specifically, we allow TaskManagers from the same job on a 
machine to share blob files so that only one user jar is downloaded for a 
machine. Comparing the current method of sharing multiple tasks within a TM 
with ours, we believe that the two kinds of resource isolation are equivalent 
theoretically, especially there is no constraints on which tasks can be 
deployed in the current TM. I can find few differences between sharing BLOBs 
among 10 tasks in the same TM and sharing the files among 10 TM containing 
single task. Nevertheless, we would like you to help assess whether there are 
risks that we have not considered from a more professional perspective.

> Reduce the load on JobManager when submitting large-scale job with a big user 
> jar
> -
>
> Key: FLINK-23905
> URL: https://issues.apache.org/jira/browse/FLINK-23905
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: huntercc
>Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming 
> steps in the job startup phase. Recently, we found that when submitting a 
> large-scale job with a large user jar, the time spent on changing the status 
> of a task from deploying to running accounts for a high proportion of the 
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the 
> JobManager through BlobService. JobManager has to allocate a lot of computing 
> power to distribute the files, which leads to a heavy load in the start-up 
> stage. More generally, JobManager fails to respond to the RPC request sent by 
> the TaskManager side in time due to high load, causing some timeout 
> exceptions, such as akka timeout exception, which leads to job restart and 
> further prolongs the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

2021-09-01 Thread huntercc (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408518#comment-17408518
 ] 

huntercc edited comment on FLINK-23905 at 9/2/21, 3:57 AM:
---

hi [~trohrmann]. Recently, we tried to transform 
org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree 
of BLOBs. More specifically, we allow TaskManagers from the same job on a 
machine to share blob files so that only one user jar is downloaded for a 
machine. Comparing the current method of sharing multiple tasks within a TM 
with ours, we believe that the two kinds of resource isolation are equivalent 
theoretically, especially there is no constraints on which tasks can be 
deployed in the current TM. I can find few differences between sharing BLOBs 
among 10 tasks in the same TM and sharing the files among 10 TM containing 
single task. Nevertheless, we would like you to help assess whether there are 
risks that we have not considered from a more professional perspective.


was (Author: huntercc):
Recently, we tried to transform org.apache.flink.runtime.blob.AbstractBlobCache 
by changing the sharing degree of BLOBs. More specifically, we allow 
TaskManagers from the same job on a machine to share blob files so that only 
one user jar is downloaded for a machine. Comparing the current method of 
sharing multiple tasks within a TM with ours, we believe that the two kinds of 
resource isolation are equivalent theoretically, especially there is no 
constraints on which tasks can be deployed in the current TM. I can find few 
differences between sharing BLOBs among 10 tasks in the same TM and sharing the 
files among 10 TM containing single task. Nevertheless, we would like you to 
help assess whether there are risks that we have not considered from a more 
professional perspective.

> Reduce the load on JobManager when submitting large-scale job with a big user 
> jar
> -
>
> Key: FLINK-23905
> URL: https://issues.apache.org/jira/browse/FLINK-23905
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: huntercc
>Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming 
> steps in the job startup phase. Recently, we found that when submitting a 
> large-scale job with a large user jar, the time spent on changing the status 
> of a task from deploying to running accounts for a high proportion of the 
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the 
> JobManager through BlobService. JobManager has to allocate a lot of computing 
> power to distribute the files, which leads to a heavy load in the start-up 
> stage. More generally, JobManager fails to respond to the RPC request sent by 
> the TaskManager side in time due to high load, causing some timeout 
> exceptions, such as akka timeout exception, which leads to job restart and 
> further prolongs the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

2021-08-23 Thread huntercc (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402729#comment-17402729
 ] 

huntercc edited comment on FLINK-23905 at 8/23/21, 9:53 AM:


Thanks for your reply and practical advice, [~trohrmann]. In fact, we have 
adopted a similar method by configuring the yarn.ship-files parameter, which 
greatly shortens the time spent in this step. I'm worried that there will be 
dependency conflicts in this way, especially when we use the yarn session mode. 
I venture to suppose that it would be better if this part of the work could be 
transparent to users.


was (Author: huntercc):
Thanks your reply and practical advice, [~trohrmann]. In fact, we have adopted 
a similar method by configuring the yarn.ship-files parameter, which greatly 
shortens the time spent in this step. I'm worried that there will be dependency 
conflicts in this way, especially when we use the yarn session mode. I venture 
to suppose that it would be better if this part of the work could be 
transparent to users.

> Reduce the load on JobManager when submitting large-scale job with a big user 
> jar
> -
>
> Key: FLINK-23905
> URL: https://issues.apache.org/jira/browse/FLINK-23905
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: huntercc
>Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming 
> steps in the job startup phase. Recently, we found that when submitting a 
> large-scale job with a large user jar, the time spent on changing the status 
> of a task from deploying to running accounts for a high proportion of the 
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the 
> JobManager through BlobService. JobManager has to allocate a lot of computing 
> power to distribute the files, which leads to a heavy load in the start-up 
> stage. More generally, JobManager fails to respond to the RPC request sent by 
> the TaskManager side in time due to high load, causing some timeout 
> exceptions, such as akka timeout exception, which leads to job restart and 
> further prolongs the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)