[jira] [Commented] (YARN-6720) Support updating FPGA related constraint node label after FPGA device re-configuration
[ https://issues.apache.org/jira/browse/YARN-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086747#comment-16086747 ] Zhankun Tang commented on YARN-6720: [~Naganarasimha], sorry for the late reply. Yeah. So far, I can only think of several new attributes for GPU/FPGA resource handler to use. Maybe it's fine that we defined some constant for GPU/FPGA first and improve it if this hard-code is not flexible? > Support updating FPGA related constraint node label after FPGA device > re-configuration > -- > > Key: YARN-6720 > URL: https://issues.apache.org/jira/browse/YARN-6720 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang > Attachments: > Storing-and-Updating-extra-FPGA-resource-attributes-in-hdfs_v1.pdf > > > In order to provide a global optimal scheduling for mutable FPGA resource, it > seems an easy and direct way to utilize constraint node labels(YARN-3409) > instead of extending the global scheduler(YARN-3926) to match both resource > count and attributes. > The rough idea is that the AM sets the constraint node label expression to > request containers on the nodes whose FPGA devices has the matching IP, and > then NM resource handler update the node constraint label if there's FPGA > device re-configuration. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6720) Support updating FPGA related constraint node label after FPGA device re-configuration
[ https://issues.apache.org/jira/browse/YARN-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077585#comment-16077585 ] Naganarasimha G R commented on YARN-6720: - Thanks [~tangzhankun] bq. a constraint label update after container finish to indicate docker image has been localized is helpful to improve the scheduling. This was one of the improvements which i had in my mind, to automatically add labels to the nodes for the localized Container images. We will develop it once YARN-3409 is in. This is similar to the docker swarm functionality. bq. For instance, GPU handler for all different vendor might need to set a constraint "GPU_DOCKER_IMAGE_LOCALIZED:True/False" to a node? FPGA handler for all vendor might need set "FPGA_IP_NAME:ipname"? If so, is it a burden for end users to search and use these scheduling preference? IIUC you are setting labels for "GPU_DOCKER_IMAGE_LOCALIZED:True/False" and/or "FPGA_IP_NAME:ipname", so not many constraints (named newly as attribute ) right ? Can you elaborate more to understand the use case ? > Support updating FPGA related constraint node label after FPGA device > re-configuration > -- > > Key: YARN-6720 > URL: https://issues.apache.org/jira/browse/YARN-6720 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang > Attachments: > Storing-and-Updating-extra-FPGA-resource-attributes-in-hdfs_v1.pdf > > > In order to provide a global optimal scheduling for mutable FPGA resource, it > seems an easy and direct way to utilize constraint node labels(YARN-3409) > instead of extending the global scheduler(YARN-3926) to match both resource > count and attributes. > The rough idea is that the AM sets the constraint node label expression to > request containers on the nodes whose FPGA devices has the matching IP, and > then NM resource handler update the node constraint label if there's FPGA > device re-configuration. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6720) Support updating FPGA related constraint node label after FPGA device re-configuration
[ https://issues.apache.org/jira/browse/YARN-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077524#comment-16077524 ] Zhankun Tang commented on YARN-6720: [~wangda], I think this is depend on YARN-3409's constraint label APIs. Agree that for GPU, a constraint label update after container finish to indicate docker image has been localized is helpful to improve the scheduling. Our idea of updating FPGA IP constraint label is same to this. One thing uncertain in my mind is that how can we make these constraint labels easy to use? Do we need to define plenty of constant key strings? For instance, GPU handler for all different vendor might need to set a constraint "GPU_DOCKER_IMAGE_LOCALIZED:True/False" to a node? FPGA handler for all vendor might need set "FPGA_IP_NAME:ipname"? If so, is it a burden for end users to search and use these scheduling preference? > Support updating FPGA related constraint node label after FPGA device > re-configuration > -- > > Key: YARN-6720 > URL: https://issues.apache.org/jira/browse/YARN-6720 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang > Attachments: > Storing-and-Updating-extra-FPGA-resource-attributes-in-hdfs_v1.pdf > > > In order to provide a global optimal scheduling for mutable FPGA resource, it > seems an easy and direct way to utilize constraint node labels(YARN-3409) > instead of extending the global scheduler(YARN-3926) to match both resource > count and attributes. > The rough idea is that the AM sets the constraint node label expression to > request containers on the nodes whose FPGA devices has the matching IP, and > then NM resource handler update the node constraint label if there's FPGA > device re-configuration. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6720) Support updating FPGA related constraint node label after FPGA device re-configuration
[ https://issues.apache.org/jira/browse/YARN-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075443#comment-16075443 ] Wangda Tan commented on YARN-6720: -- [~tangzhankun]/[~zyluo], bq. YARN-3409 Wouldn't be a blocker since this JIRA is a improvement of YARN-6507. I'm not sure how to support device meta info in global (RM) scheduler without YARN-3409, I couldn't find the answer from attached design doc. Could you explain what is the solution in your mind? Anyway I'm in favor of using a general approach which can be utilized by other features instead of customize RM scheduler to support FPGA requirements. GPU support is more sensitive to GPU type instead of firmware, but I can see docker support can be improved a lot if we can schedule containers to a node which already has localized docker image. > Support updating FPGA related constraint node label after FPGA device > re-configuration > -- > > Key: YARN-6720 > URL: https://issues.apache.org/jira/browse/YARN-6720 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang > Attachments: > Storing-and-Updating-extra-FPGA-resource-attributes-in-hdfs_v1.pdf > > > In order to provide a global optimal scheduling for mutable FPGA resource, it > seems an easy and direct way to utilize constraint node labels(YARN-3409) > instead of extending the global scheduler(YARN-3926) to match both resource > count and attributes. > The rough idea is that the AM sets the constraint node label expression to > request containers on the nodes whose FPGA devices has the matching IP, and > then NM resource handler update the node constraint label if there's FPGA > device re-configuration. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6720) Support updating FPGA related constraint node label after FPGA device re-configuration
[ https://issues.apache.org/jira/browse/YARN-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069618#comment-16069618 ] Zhankun Tang commented on YARN-6720: [~wangda]. Maybe it's my fault. Although the reconfigure FPGA device procedure is fast, the downloading may takes a reasonable time which should be avoid. That's the key problem this JIRA wants to solve. > Support updating FPGA related constraint node label after FPGA device > re-configuration > -- > > Key: YARN-6720 > URL: https://issues.apache.org/jira/browse/YARN-6720 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang > Attachments: > Storing-and-Updating-extra-FPGA-resource-attributes-in-hdfs_v1.pdf > > > In order to provide a global optimal scheduling for mutable FPGA resource, it > seems an easy and direct way to utilize constraint node labels(YARN-3409) > instead of extending the global scheduler(YARN-3926) to match both resource > count and attributes. > The rough idea is that the AM sets the constraint node label expression to > request containers on the nodes whose FPGA devices has the matching IP, and > then NM resource handler update the node constraint label if there's FPGA > device re-configuration. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6720) Support updating FPGA related constraint node label after FPGA device re-configuration
[ https://issues.apache.org/jira/browse/YARN-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066397#comment-16066397 ] Zhongyue Nah commented on YARN-6720: [~leftnoteasy] YARN-3409 Wouldn't be a blocker since this JIRA is a improvement of YARN-6507. We intend to move device metadata to global space so that the scheduler can make more efficient decisions in terms of IP reuse. I assume GPGPUs have similar issues and we wish to find a common solution across all resource types before we get this through. > Support updating FPGA related constraint node label after FPGA device > re-configuration > -- > > Key: YARN-6720 > URL: https://issues.apache.org/jira/browse/YARN-6720 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang > Attachments: > Storing-and-Updating-extra-FPGA-resource-attributes-in-hdfs_v1.pdf > > > In order to provide a global optimal scheduling for mutable FPGA resource, it > seems an easy and direct way to utilize constraint node labels(YARN-3409) > instead of extending the global scheduler(YARN-3926) to match both resource > count and attributes. > The rough idea is that the AM sets the constraint node label expression to > request containers on the nodes whose FPGA devices has the matching IP, and > then NM resource handler update the node constraint label if there's FPGA > device re-configuration. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6720) Support updating FPGA related constraint node label after FPGA device re-configuration
[ https://issues.apache.org/jira/browse/YARN-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065170#comment-16065170 ] Wangda Tan commented on YARN-6720: -- Thanks [~tangzhankun] and [~YuQiang Ye] for this proposal. In general this approach looks good to me. Since this proposal depends on YARN-3409, which needs more time to get done. I'm not sure if this is a blocker/critical item for this feature, IIRC, offline you mentioned that download FPGA firmware and reconfigure FPGA devices only take few seconds, which means it is generally fine without this improvement. > Support updating FPGA related constraint node label after FPGA device > re-configuration > -- > > Key: YARN-6720 > URL: https://issues.apache.org/jira/browse/YARN-6720 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Zhankun Tang > Attachments: > Storing-and-Updating-extra-FPGA-resource-attributes-in-hdfs_v1.pdf > > > In order to provide a global optimal scheduling for mutable FPGA resource, it > seems an easy and direct way to utilize constraint node labels(YARN-3409) > instead of extending the global scheduler(YARN-3926) to match both resource > count and attributes. > The rough idea is that the AM sets the constraint node label expression to > request containers on the nodes whose FPGA devices has the matching IP, and > then NM resource handler update the node constraint label if there's FPGA > device re-configuration. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org