[jira] [Updated] (FLINK-29344) Make Adaptive Scheduler supports Fine-Grained Resource Management

2023-07-25 Thread Konstantin Knauf (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Knauf updated FLINK-29344:
-
Fix Version/s: (was: 1.18.0)

> Make Adaptive Scheduler supports Fine-Grained Resource Management
> -
>
> Key: FLINK-29344
> URL: https://issues.apache.org/jira/browse/FLINK-29344
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Xintong Song
>Assignee: Chesnay Schepler
>Priority: Major
>
> This ticket is a reflection of the following Slack discussion:
> {quote}
> Donatien Schmitz
> Adaptive Scheduler thread:
> Hey all, it seems like the Adaptive Scheduler does not support fine grain 
> resource management. I have fixed it and would like to know if you would be 
> interested in a PR or if it was purposely designed to not support Fine grain 
> resource management.
> rmetzger
> @Donatien Schmitz: I’m concerned that we don’t have a lot of review capacity 
> right now, and I’m now aware of any users asking for it.
> rmetzger
> I couldn’t find a ticket for adding this feature, did you find one?
> If not, can you add one? This will allow us to at least making this feature 
> show up on google, and people might comment on it, if they need it.
> rmetzger
> If the change is fairly self-contained, is unlikely to cause instabilities, 
> then we can also consider merging it
> rmetzger
> @Xintong Song what do you think?
> Xintong Song
> @rmetzger, thanks for involving me.
> @Donatien Schmitz, thanks for bringing this up, and for volunteering on 
> fixing this. Could you explain a bit more about how do you plan to fix this?
> Fine-grained resource management is not yet supported by adaptive scheduler, 
> because there’s an issue that we haven’t find a good solution for. Namely, if 
> only part of the resource requirements can be fulfilled, how do we decide 
> which requirements should be fulfilled. E.g., say the job declares it needs 
> 10 slots with resource 1 for map tasks, and another 10 slots with resource 2 
> for reduce tasks. If there’s not enough resources (say only 10 slots can be 
> allocated for simplicity), how many slots for map / reduce tasks should be 
> allocated? Obviously, <10 map, 0 reduce> & <0 map, 10 reduce> would not work. 
> For this example, a proportional scale-down (<5 map, 5 reduce>) seems 
> reasonable. However, a proportional scale-down is not always easy (e.g., 
> requirements is <100 map, 1 reduce>), and the issue grows more complicated if 
> you take lots of stages and the differences of slot sizes into consideration.
> I’d like to see adaptive scheduler also supports fine-grained resource 
> management. If there’s a good solution to the above issue, I’d love to help 
> review the effort.
> Donatien Schmitz
> Dear Robert and Xintong, thanks for reading and reacting to my message! I'll 
> reply tomorrow (GTM +1 time) if that's quite alright with you. Best, Donatien 
> Schmitz
> Donatien Schmitz
> @Xintong Song
> * We are working on fine-grain scheduling for resource optimisation of long 
> running or periodic jobs. One of the feature we are experiencing is a 
> "rescheduling plan", a mapping of operators and Resource Profiles that can be 
> dynamically applied to a running job. This rescheduling would be triggered by 
> policies about some metrics (focus on RocksDB in our case).
> * While developing this new feature, we decided to implement it on the 
> Adpative Scheduler instead of the Base Scheduler because the logic brought by 
> the state machine already present made it more logical: transitions from 
> states Executing -> Cancelling -> Rescheduling -> Waiting for Resources -> 
> Creating -> Executing
> * In our case we are working on a POC and thus focusing on a real simple job 
> with a // of 1. The issue you brought is indeed something we have faced while 
> raising the // of the job.
> * If you create a Jira Ticket we can discuss it over there if you'd like!
> Donatien Schmitz
> @rmetzger The changes do not break the default resource management but does 
> not fix the issue brought out by Xintong.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29344) Make Adaptive Scheduler supports Fine-Grained Resource Management

2023-07-04 Thread Chesnay Schepler (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chesnay Schepler updated FLINK-29344:
-
Fix Version/s: 1.18.0

> Make Adaptive Scheduler supports Fine-Grained Resource Management
> -
>
> Key: FLINK-29344
> URL: https://issues.apache.org/jira/browse/FLINK-29344
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Xintong Song
>Assignee: Chesnay Schepler
>Priority: Major
> Fix For: 1.18.0
>
>
> This ticket is a reflection of the following Slack discussion:
> {quote}
> Donatien Schmitz
> Adaptive Scheduler thread:
> Hey all, it seems like the Adaptive Scheduler does not support fine grain 
> resource management. I have fixed it and would like to know if you would be 
> interested in a PR or if it was purposely designed to not support Fine grain 
> resource management.
> rmetzger
> @Donatien Schmitz: I’m concerned that we don’t have a lot of review capacity 
> right now, and I’m now aware of any users asking for it.
> rmetzger
> I couldn’t find a ticket for adding this feature, did you find one?
> If not, can you add one? This will allow us to at least making this feature 
> show up on google, and people might comment on it, if they need it.
> rmetzger
> If the change is fairly self-contained, is unlikely to cause instabilities, 
> then we can also consider merging it
> rmetzger
> @Xintong Song what do you think?
> Xintong Song
> @rmetzger, thanks for involving me.
> @Donatien Schmitz, thanks for bringing this up, and for volunteering on 
> fixing this. Could you explain a bit more about how do you plan to fix this?
> Fine-grained resource management is not yet supported by adaptive scheduler, 
> because there’s an issue that we haven’t find a good solution for. Namely, if 
> only part of the resource requirements can be fulfilled, how do we decide 
> which requirements should be fulfilled. E.g., say the job declares it needs 
> 10 slots with resource 1 for map tasks, and another 10 slots with resource 2 
> for reduce tasks. If there’s not enough resources (say only 10 slots can be 
> allocated for simplicity), how many slots for map / reduce tasks should be 
> allocated? Obviously, <10 map, 0 reduce> & <0 map, 10 reduce> would not work. 
> For this example, a proportional scale-down (<5 map, 5 reduce>) seems 
> reasonable. However, a proportional scale-down is not always easy (e.g., 
> requirements is <100 map, 1 reduce>), and the issue grows more complicated if 
> you take lots of stages and the differences of slot sizes into consideration.
> I’d like to see adaptive scheduler also supports fine-grained resource 
> management. If there’s a good solution to the above issue, I’d love to help 
> review the effort.
> Donatien Schmitz
> Dear Robert and Xintong, thanks for reading and reacting to my message! I'll 
> reply tomorrow (GTM +1 time) if that's quite alright with you. Best, Donatien 
> Schmitz
> Donatien Schmitz
> @Xintong Song
> * We are working on fine-grain scheduling for resource optimisation of long 
> running or periodic jobs. One of the feature we are experiencing is a 
> "rescheduling plan", a mapping of operators and Resource Profiles that can be 
> dynamically applied to a running job. This rescheduling would be triggered by 
> policies about some metrics (focus on RocksDB in our case).
> * While developing this new feature, we decided to implement it on the 
> Adpative Scheduler instead of the Base Scheduler because the logic brought by 
> the state machine already present made it more logical: transitions from 
> states Executing -> Cancelling -> Rescheduling -> Waiting for Resources -> 
> Creating -> Executing
> * In our case we are working on a POC and thus focusing on a real simple job 
> with a // of 1. The issue you brought is indeed something we have faced while 
> raising the // of the job.
> * If you create a Jira Ticket we can discuss it over there if you'd like!
> Donatien Schmitz
> @rmetzger The changes do not break the default resource management but does 
> not fix the issue brought out by Xintong.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)