[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607445#comment-16607445 ] Jonathan Hung commented on YARN-8200: - Build https://builds.apache.org/view/H-L/view/Hadoop/job/PreCommit-YARN-Build/21779 timed out: {noformat}cd /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common /opt/maven/bin/mvn --batch-mode -Dmaven.repo.local=/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/yetus-m2/hadoop-branch-2-patch-0 -Ptest-patch -Pparallel-tests -Pshelltest -Pnative -Drequire.fuse -Drequire.openssl -Drequire.snappy -Drequire.valgrind -Drequire.test.libhadoop -Pyarn-ui clean test -fae > /testptch/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-common.txt 2>&1 Elapsed: 2m 40s cd /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager /opt/maven/bin/mvn --batch-mode -Dmaven.repo.local=/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/yetus-m2/hadoop-branch-2-patch-0 -Ptest-patch -Pparallel-tests -Pshelltest -Pnative -Drequire.fuse -Drequire.openssl -Drequire.snappy -Drequire.valgrind -Drequire.test.libhadoop -Pyarn-ui clean test -fae > /testptch/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt 2>&1 Elapsed: 15m 20s cd /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice /opt/maven/bin/mvn --batch-mode -Dmaven.repo.local=/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/yetus-m2/hadoop-branch-2-patch-0 -Ptest-patch -Pparallel-tests -Pshelltest -Pnative -Drequire.fuse -Drequire.openssl -Drequire.snappy -Drequire.valgrind -Drequire.test.libhadoop -Pyarn-ui clean test -fae > /testptch/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt 2>&1 Elapsed: 4m 49s cd /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager /opt/maven/bin/mvn --batch-mode -Dmaven.repo.local=/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/yetus-m2/hadoop-branch-2-patch-0 -Ptest-patch -Pparallel-tests -Pshelltest -Pnative -Drequire.fuse -Drequire.openssl -Drequire.snappy -Drequire.valgrind -Drequire.test.libhadoop -Pyarn-ui clean test -fae > /testptch/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt 2>&1 Elapsed: 79m 41s cd /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests /opt/maven/bin/mvn --batch-mode -Dmaven.repo.local=/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/yetus-m2/hadoop-branch-2-patch-0 -Ptest-patch -Pparallel-tests -Pshelltest -Pnative -Drequire.fuse -Drequire.openssl -Drequire.snappy -Drequire.valgrind -Drequire.test.libhadoop -Pyarn-ui clean test -fae > /testptch/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-tests.txt 2>&1 Elapsed: 3m 59s cd /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client /opt/maven/bin/mvn --batch-mode -Dmaven.repo.local=/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/yetus-m2/hadoop-branch-2-patch-0 -Ptest-patch -Pparallel-tests -Pshelltest -Pnative -Drequire.fuse -Drequire.openssl -Drequire.snappy -Drequire.valgrind -Drequire.test.libhadoop -Pyarn-ui clean test -fae > /testptch/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client.txt 2>&1 Build timed out (after 500 minutes). Marking the build as aborted. Build was aborted Performing Post build task... Match found for :. : True Logical operation result is TRUE Running script : #!/bin/bash{noformat} > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-8200-branch-2.001.patch, > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606091#comment-16606091 ] Jonathan Hung commented on YARN-8200: - Rebased YARN-8200 on branch-2. Attached the full diff between branch-2 and YARN-8200 (001) > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-8200-branch-2.001.patch, > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603634#comment-16603634 ] Konstantin Shvachko commented on YARN-8200: --- I was trying to build YARN-8200 branch with this build: https://builds.apache.org/view/H-L/view/Hadoop/job/hadoop-qbt-branch2-java7-linux-x86-jhung/8/console And it is failing similar to HADOOP-15644. I think YARN-8200 branch need to be rebased to latest branch-2. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566119#comment-16566119 ] Wangda Tan commented on YARN-8200: -- [~jhung], thanks for sharing the result. Overall the number looks good. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554949#comment-16554949 ] Jonathan Hung commented on YARN-8200: - Uploaded scheduler allocation counters for default resources (mem/cpu) and gpu resources. Also uploaded synth_sls.json configuration used for generating synth trace (4k nodes, 20k jobs) SLS simulation using default resources took 2hr 10min, with gpu resources took 2hr 25min. In the gpu SLS simulation we hardcoded each mapper and reducer to request 1 gpu. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537738#comment-16537738 ] Jonathan Hung commented on YARN-8200: - Perf unit test: 2 resources: {noformat} 2 14652.015 4 20876.826 6 29455.08 8 45045.047 10 37735.848 12 40816.33 14 46403.71 16 47169.812 18 49261.082 20 48543.688 22 49140.05 24 47393.363 26 48899.754 28 48899.754 30 49751.242 32 50125.312 34 46296.297 36 48780.49 38 47961.63 40 47732.695 42 47732.695 44 48076.92 46 49019.61 48 46728.973 50 42643.92 52 46296.297 54 48426.15 56 49504.95 58 47846.89 60 48543.688 62 47393.363 64 48899.754 66 48661.8 68 49140.05 70 49019.61 72 48780.49 74 48899.754 76 49382.715 78 47393.363 80 48076.92 82 48192.77 84 47732.695 86 50125.312 88 48899.754 90 49019.61 92 48076.92 94 48192.77 96 48076.92 98 42553.19 100 47846.89 102 47846.89 104 48780.49 106 47961.63 108 49140.05 110 47169.812 112 47846.89 114 47619.047 116 47619.047 118 49875.312 120 47619.047 122 47393.363 124 47505.938 126 48899.754 128 48780.49 130 46189.375 132 47505.938 134 45871.56 136 47619.047 138 48543.688 140 47619.047 142 48076.92 144 48076.92 146 47732.695 148 47281.324 150 48543.688 152 48661.8 154 47393.363 156 48543.688 158 47961.63 160 46296.297 162 47846.89 164 47846.89 166 48543.688 168 47505.938 170 47281.324 172 48309.18 174 48309.18 176 5.0 178 47505.938 180 48192.77 182 48192.77 184 48309.18 186 48543.688 188 48661.8 190 48192.77 192 47846.89 194 42105.26 196 48899.754 198 47961.63 #ResourceTypes = 2. Avg of fastest 20: 49382.715 2018-06-26 17:12:59,756 ERROR [Thread[Thread-11,5,main]] delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:run(696)) - ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted{noformat} 3 resources: {noformat} 2 10964.912 4 15760.441 6 26990.553 8 24752.475 10 32733.225 12 30487.805 14 27397.26 16 35778.176 18 33112.582 20 34843.207 22 31347.963 24 37383.176 26 34482.758 28 39062.5 30 38095.24 32 35842.293 34 32154.342 36 39447.73 38 37878.79 40 38240.918 42 36101.082 44 38167.938 46 38834.953 48 38022.812 50 38610.04 52 37105.75 54 38610.04 56 39215.688 58 38022.812 60 39215.688 62 37950.664 64 39138.94 66 37735.848 68 38684.72 70 38986.355 72 37735.848 74 37243.95 76 38535.645 78 37807.184 80 38314.176 82 36900.367 84 38610.04 86 39370.08 88 38314.176 90 39525.69 92 38461.54 94 39761.43 96 39370.08 98 38910.504 100 38022.812 102 39138.94 104 38314.176 106 39292.73 108 39292.73 110 39370.08 112 39292.73 114 38314.176 116 39840.637 118 39062.5 120 39370.08 122 37950.664 124 39062.5 126 37664.785 128 38684.72 130 38986.355 132 39525.69 134 40322.582 136 39292.73 138 37664.785 140 39525.69 142 39138.94 144 39370.08 146 39840.637 148 37037.035 150 38387.715 152 39525.69 154 37523.453 156 39603.96 158 36764.707 160 32362.459 162 29542.098 164 31250.0 166 29112.082 168 32000.0 170 27662.518 172 27100.271 174 26845.637 176 33388.98 178 35714.285 180 31152.648 182 36832.414 184 35650.625 186 38461.54 188 34662.047 190 31104.2 192 32573.29 194 36900.367 196 26702.27 198 30211.48 #ResourceTypes = 3. Avg of fastest 20: 39525.69 2018-06-26 17:16:14,530 ERROR [Thread[Thread-11,5,main]] delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:run(696)) - ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted {noformat} 4 resources: {noformat} 2 13166.557 4 21299.254 6 33222.59 8 37174.723 10 32786.887 12 33955.855 14 38095.24 16 37243.95 18 38167.938 20 37807.184 22 36900.367 24 39370.08 26 36563.07 28 38240.918 30 38759.69 32 39370.08 34 35523.98 36 39370.08 38 38610.04 40 38759.69 42 39603.96 44 37878.79 46 38910.504 48 38684.72 50 39682.54 52 38461.54 54 38535.645 56 37105.75 58 38910.504 60 38095.24 62 38684.72 64 38910.504 66 39138.94 68 39292.73 70 38095.24 72 39215.688 74 39447.73 76 39447.73 78 4.0 80 38759.69 82 38910.504 84 39603.96 86 38834.953 88
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469835#comment-16469835 ] Sunil G commented on YARN-8200: --- Thanks [~jhung] Could u also pls share the SLS test comparison with branch-2 with feature branch also the perf UT case comparison. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469193#comment-16469193 ] Jonathan Hung commented on YARN-8200: - FYI we just updated YARN-8200 with the latest set of backports (thanks [~zhz] for the help). [~sunilg] / [~leftnoteasy] / [~templedf], please let us know if we are missing anything major. Thanks! > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461622#comment-16461622 ] Konstantin Shvachko commented on YARN-8200: --- Hey guys I rebased branch YARN-8200 onto branch-2 and pushed [~jhung]'s commits into it. Please take a look. Testing is in progress as I hear. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459363#comment-16459363 ] Konstantin Shvachko commented on YARN-8200: --- [~sunilg], thanks for the hints on the benchmarks. Also I agree we should branch off of branch-2 rather than 2.9. Will re-branch. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453407#comment-16453407 ] Sunil G commented on YARN-8200: --- Hi [~shv] and [~jhung] We used SLS to benchmark trunk first and then used same test bench to verify branch. # We used 4k nodes/ 8k nodes / 20k nodes slsnodes.json to verify the above mentioned case. # {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf}} is perf test UT which cover multiple resource types. One suggestion from me is that, its better the branch is cut from branch-2 compared to 2.9? > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453121#comment-16453121 ] Konstantin Shvachko commented on YARN-8200: --- Cut a branch for this jira out of branch-2.9. [~jhung] could you please merge your packports there. [~sunilg], [~templedf] could you please advise on the tools for measuring performance impact for Capacity Scheduler and Resource Manager. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451385#comment-16451385 ] Wangda Tan commented on YARN-8200: -- +1 to have a branch for this which we can easier know which patches got backported. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451384#comment-16451384 ] Konstantin Shvachko commented on YARN-8200: --- What people think if we create a branch so that Jonathan could apply his work on the backporting? That way we can make this discussion more material. Also you guys will be able to try it and see if it fits your requirements. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450136#comment-16450136 ] Daniel Templeton commented on YARN-8200: The challenge in backporting resource types into 3.0 was mostly just in splitting resource types from resource profiles. Otherwise it wasn't bad. But that was pull from 3.x into 3.0. Going back into 2.x will be much trickier. The code that resource types touches is code that you want to handle very, very carefully because it's at the core of what the resource manager does. I don't think it's a good idea to pull resource types back into 2.x Resource types represent a major change to the way the resource manager functions. It's not something that appropriate for a minor release. In fact, I would argue that resource types is one of the scarier changes in 3.0, so if you're willing to take on that risk in 2.x, you're probably better served just moving to 3.0. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449312#comment-16449312 ] Wangda Tan commented on YARN-8200: -- [~chris.douglas], I think [~sunilg] has already pointed out, the multiple resource type backport could be very tricky. IIRC, [~templedf] spent lots of time to backport from trunk to branch-3.0 in the last year and several issues caused by backport. And now it diverges more, we have more changes (about 5+ months) added to trunk including many scheduler related changes. [~shv], I understand you want a bridge release. I'm still +1 to have a 2.x bridge release and backporting GPU related changes to branch-2. But it might be worthwhile to look at 3.x release and fix migration issues so all users who want to migrate to 3.x can benefit from such efforts. Just my $0.02. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449284#comment-16449284 ] Sunil G commented on YARN-8200: --- Hi [~jhung] YARN-3926 has two parts of the work. Supporting multiple resource types and resource profiles. Due to the complex nature of resource profiles, Hadoop-3.0 still contains only resource types which is a part of YARN-3926. Moreover there is a chance that CS performance could be impacted. For 3.0 and 3.1, many core scheduler part changes were there compared to branch-2. It was more easy as whole resource types development happened in trunk based code hence we could cross check the impacted areas well. In branch-2, there may be more chances of performance impact as code difference to 3.0 and branch-2.10 are bit of huge. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449023#comment-16449023 ] Chris Douglas commented on YARN-8200: - bq. I would suggest to try use 3.x instead back porting this to 2.x so everybody is on the same codebase and improvement it. To me, the effort of backporting YARN-3926 + YARN-6223 will be comparable to upgrading a 3.x release and fixing (incompatible) issues >From [~jhung]'s analysis, the backports were relatively straightforward >(mostly new code). Keeping it in sync with fixes/improvements in 3.x will >require ongoing maintenance, which is unfortunate. Are there specific areas >where you suspect the backport could become difficult to maintain? > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449000#comment-16449000 ] Konstantin Shvachko commented on YARN-8200: --- Hey [~leftnoteasy], we discussed it in [this thread|https://lists.apache.org/thread.html/6e200891756aefbfd8b36cd1d9f22f99626284b656671ab719ee1496@%3Chdfs-dev.hadoop.apache.org%3E] some time ago. Clearly we want everybody on the same code base, but its a challenge to get there. So the thread proposed to build a bridge release, to help cross over to 3. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448943#comment-16448943 ] Wangda Tan commented on YARN-8200: -- [~jhung], I would suggest to try use 3.x instead back porting this to 2.x so everybody is on the same codebase and improvement it. To me, the effort of backporting YARN-3926 + YARN-6223 will be comparable to upgrading a 3.x release and fixing (incompatible) issues. Both of the features are more than 0.5 MB and change many files. I'm fine with backporting this to branch-2, but backporting itself could be very tricky. > Backport resource types/GPU features to branch-2 > > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org