[jira] [Comment Edited] (YARN-9854) RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink
[ https://issues.apache.org/jira/browse/YARN-9854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526243#comment-17526243 ] Song Jiacheng edited comment on YARN-9854 at 4/22/22 7:20 AM: -- Thanks for your patch,[~suxingfate]. We encountered this problem today and the situation was totally the same as yours. I think this patch will be helpful. It seems no one cares about this issue, I will apply this patch in our version, thanks for the patch again! was (Author: song jiacheng): Thanks for your patch,[~suxingfate]. We cthis problem today and the situation was totally the same as yours. I think this patch will be helpful. It seems no one cares about this issue, I will apply this patch in our version, thanks for the patch again! > RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink > -- > > Key: YARN-9854 > URL: https://issues.apache.org/jira/browse/YARN-9854 > Project: Hadoop YARN > Issue Type: Improvement > Components: amrmproxy, resourcemanager, webapp >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Major > Attachments: YARN-9854.001.patch > > > RM will proxy url request to [http://rm:port/proxy/application_x] to AM > or related history server. > Recently we met an issue https://issues.apache.org/jira/browse/SPARK-26961 > which will cause Spark AM hang forever. > And we have a monitor tool to access [http://rm:port/proxy/application_x] > periodically. Thus all proxied connection to the hang spark AM will also > hang forever due to WebAppProxyServlet is lacking of socket connection > timeout setting while initialize httpclient towards this spark AM. > > The jetty server holding RM servlets is with limited threads. In this case, > each time one such thread will hang due to waiting for Spark AM response. > Eventually all jetty threads serving http traffic hang and caused all RM web > links not responsive. > > If we give timeout config to httpclient, we will be free of this issue. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9854) RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink
[ https://issues.apache.org/jira/browse/YARN-9854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526243#comment-17526243 ] Song Jiacheng commented on YARN-9854: - Thanks for your patch,[~suxingfate]. We cthis problem today and the situation was totally the same as yours. I think this patch will be helpful. It seems no one cares about this issue, I will apply this patch in our version, thanks for the patch again! > RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink > -- > > Key: YARN-9854 > URL: https://issues.apache.org/jira/browse/YARN-9854 > Project: Hadoop YARN > Issue Type: Improvement > Components: amrmproxy, resourcemanager, webapp >Reporter: Wang, Xinglong >Assignee: Wang, Xinglong >Priority: Major > Attachments: YARN-9854.001.patch > > > RM will proxy url request to [http://rm:port/proxy/application_x] to AM > or related history server. > Recently we met an issue https://issues.apache.org/jira/browse/SPARK-26961 > which will cause Spark AM hang forever. > And we have a monitor tool to access [http://rm:port/proxy/application_x] > periodically. Thus all proxied connection to the hang spark AM will also > hang forever due to WebAppProxyServlet is lacking of socket connection > timeout setting while initialize httpclient towards this spark AM. > > The jetty server holding RM servlets is with limited threads. In this case, > each time one such thread will hang due to waiting for Spark AM response. > Eventually all jetty threads serving http traffic hang and caused all RM web > links not responsive. > > If we give timeout config to httpclient, we will be free of this issue. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, this method use this parameter to break from the loop. However, nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a number of queues, so this is a bug. I think this need to be changed to 1, cause we only get one attempt finished or moved. was: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, this method use this parameter to break from the loop. However, nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. I think this need to be changed to 1, cause we only get one attempt finished or moved. > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, removing a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps", as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} > updateAppsRunnability is below: > {code:java} > private void updateAppsRunnability(List> > appsNowMaybeRunnable, int maxRunnableApps) { > // Scan through and check whether this means that any apps are now > runnable > Iterator iter = new MultiListStartTimeIterator( > appsNowMaybeRunnable); > FSAppAttempt prev = null; > List noLongerPendingApps = new ArrayList(); > while (iter.hasNext()) { > FSAppAttempt next = iter.next(); > if (next == prev) { > continue; > } > if (canAppBeRunnable(next.getQueue(), next)) { > trackRunnableApp(next); >
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, this method use this parameter to break from the loop. However, nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. I think this need to be changed to 1, cause we only get one attempt finished or moved. was: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, this method use this parameter to break from the loop. However, nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. I think this need to be changed to 1. > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, removing a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps", as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} > updateAppsRunnability is below: > {code:java} > private void updateAppsRunnability(List> > appsNowMaybeRunnable, int maxRunnableApps) { > // Scan through and check whether this means that any apps are now > runnable > Iterator iter = new MultiListStartTimeIterator( > appsNowMaybeRunnable); > FSAppAttempt prev = null; > List noLongerPendingApps = new ArrayList(); > while (iter.hasNext()) { > FSAppAttempt next = iter.next(); > if (next == prev) { > continue; > } > if (canAppBeRunnable(next.getQueue(), next)) { > trackRunnableApp(next); > FSAppAttempt appSched = next; >
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, this method use this parameter to break from the loop. However, nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. I think this need to be changed to 1. was: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, this method use this parameter to break from the loop. However, nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, removing a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps", as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} > updateAppsRunnability is below: > {code:java} > private void updateAppsRunnability(List> > appsNowMaybeRunnable, int maxRunnableApps) { > // Scan through and check whether this means that any apps are now > runnable > Iterator iter = new MultiListStartTimeIterator( > appsNowMaybeRunnable); > FSAppAttempt prev = null; > List noLongerPendingApps = new ArrayList(); > while (iter.hasNext()) { > FSAppAttempt next = iter.next(); > if (next == prev) { > continue; > } > if (canAppBeRunnable(next.getQueue(), next)) { > trackRunnableApp(next); > FSAppAttempt appSched = next; > next.getQueue().addApp(appSched, true); > noLongerPendingApps.add(appSched); > if
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, this method use this parameter to break from the loop. However, nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. was: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, but nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, removing a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps", as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} > updateAppsRunnability is below: > {code:java} > private void updateAppsRunnability(List> > appsNowMaybeRunnable, int maxRunnableApps) { > // Scan through and check whether this means that any apps are now > runnable > Iterator iter = new MultiListStartTimeIterator( > appsNowMaybeRunnable); > FSAppAttempt prev = null; > List noLongerPendingApps = new ArrayList(); > while (iter.hasNext()) { > FSAppAttempt next = iter.next(); > if (next == prev) { > continue; > } > if (canAppBeRunnable(next.getQueue(), next)) { > trackRunnableApp(next); > FSAppAttempt appSched = next; > next.getQueue().addApp(appSched, true); > noLongerPendingApps.add(appSched); > if (noLongerPendingApps.size() >= maxRunnableApps) { > break; > } > } > prev =
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps", as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, but nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. was: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps",as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, but nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, removing a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps", as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} > updateAppsRunnability is below: > {code:java} > private void updateAppsRunnability(List> > appsNowMaybeRunnable, int maxRunnableApps) { > // Scan through and check whether this means that any apps are now > runnable > Iterator iter = new MultiListStartTimeIterator( > appsNowMaybeRunnable); > FSAppAttempt prev = null; > List noLongerPendingApps = new ArrayList(); > while (iter.hasNext()) { > FSAppAttempt next = iter.next(); > if (next == prev) { > continue; > } > if (canAppBeRunnable(next.getQueue(), next)) { > trackRunnableApp(next); > FSAppAttempt appSched = next; > next.getQueue().addApp(appSched, true); > noLongerPendingApps.add(appSched); > if (noLongerPendingApps.size() >= maxRunnableApps) { > break; > } > } > prev = next; > } > ... > {code} > maxRunnableApps is the number of
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, remove a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps",as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, but nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. was: In FairScheduler, remove a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps",as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} appsNowMaybeRunnable actually is a list of lists, the size of this > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, remove a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps",as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} > updateAppsRunnability is below: > {code:java} > private void updateAppsRunnability(List> > appsNowMaybeRunnable, int maxRunnableApps) { > // Scan through and check whether this means that any apps are now > runnable > Iterator iter = new MultiListStartTimeIterator( > appsNowMaybeRunnable); > FSAppAttempt prev = null; > List noLongerPendingApps = new ArrayList(); > while (iter.hasNext()) { > FSAppAttempt next = iter.next(); > if (next == prev) { > continue; > } > if (canAppBeRunnable(next.getQueue(), next)) { > trackRunnableApp(next); > FSAppAttempt appSched = next; > next.getQueue().addApp(appSched, true); > noLongerPendingApps.add(appSched); > if (noLongerPendingApps.size() >= maxRunnableApps) { > break; > } > } > prev = next; > } > ... > {code} > maxRunnableApps is the number of apps which can be runnable because of the > removal of previous attempts, but nowMaybeRunnable actually is a list of > lists, and the size of nowMaybeRunnable is actually a size
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, removing a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps",as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, but nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. was: In FairScheduler, remove a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps",as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} maxRunnableApps is the number of apps which can be runnable because of the removal of previous attempts, but nowMaybeRunnable actually is a list of lists, and the size of nowMaybeRunnable is actually a size of queues, so this is a bug. > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, removing a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps",as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} > updateAppsRunnability is below: > {code:java} > private void updateAppsRunnability(List> > appsNowMaybeRunnable, int maxRunnableApps) { > // Scan through and check whether this means that any apps are now > runnable > Iterator iter = new MultiListStartTimeIterator( > appsNowMaybeRunnable); > FSAppAttempt prev = null; > List noLongerPendingApps = new ArrayList(); > while (iter.hasNext()) { > FSAppAttempt next = iter.next(); > if (next == prev) { > continue; > } > if (canAppBeRunnable(next.getQueue(), next)) { > trackRunnableApp(next); > FSAppAttempt appSched = next; > next.getQueue().addApp(appSched, true); > noLongerPendingApps.add(appSched); > if (noLongerPendingApps.size() >= maxRunnableApps) { > break; > } > } > prev = next; > } > ... > {code} > maxRunnableApps is the number of apps
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, remove a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps",as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} updateAppsRunnability is below: {code:java} private void updateAppsRunnability(List> appsNowMaybeRunnable, int maxRunnableApps) { // Scan through and check whether this means that any apps are now runnable Iterator iter = new MultiListStartTimeIterator( appsNowMaybeRunnable); FSAppAttempt prev = null; List noLongerPendingApps = new ArrayList(); while (iter.hasNext()) { FSAppAttempt next = iter.next(); if (next == prev) { continue; } if (canAppBeRunnable(next.getQueue(), next)) { trackRunnableApp(next); FSAppAttempt appSched = next; next.getQueue().addApp(appSched, true); noLongerPendingApps.add(appSched); if (noLongerPendingApps.size() >= maxRunnableApps) { break; } } prev = next; } ... {code} appsNowMaybeRunnable actually is a list of lists, the size of this was: In FairScheduler, remove a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps",as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, remove a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps",as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} > updateAppsRunnability is below: > {code:java} > private void updateAppsRunnability(List> > appsNowMaybeRunnable, int maxRunnableApps) { > // Scan through and check whether this means that any apps are now > runnable > Iterator iter = new MultiListStartTimeIterator( > appsNowMaybeRunnable); > FSAppAttempt prev = null; > List noLongerPendingApps = new ArrayList(); > while (iter.hasNext()) { > FSAppAttempt next = iter.next(); > if (next == prev) { > continue; > } > if (canAppBeRunnable(next.getQueue(), next)) { > trackRunnableApp(next); > FSAppAttempt appSched = next; > next.getQueue().addApp(appSched, true); > noLongerPendingApps.add(appSched); > if (noLongerPendingApps.size() >= maxRunnableApps) { > break; > } > } > prev = next; > } > ... > {code} > appsNowMaybeRunnable actually is a list of lists, the size of this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break
[ https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10868: - Description: In FairScheduler, remove a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps",as below: {code:java} updateAppsRunnability(appsNowMaybeRunnable, appsNowMaybeRunnable.size()); {code} was:In FairScheduler, remove a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps" > FairScheduler: updateAppsRunnability never break > > > Key: YARN-10868 > URL: https://issues.apache.org/jira/browse/YARN-10868 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > In FairScheduler, remove a app attempt will call > MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some > non-runnable apps and make them not pending. This method will call > updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the > method parameter "maxRunnableApps",as below: > {code:java} > updateAppsRunnability(appsNowMaybeRunnable, > appsNowMaybeRunnable.size()); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10868) FairScheduler: updateAppsRunnability never break
Song Jiacheng created YARN-10868: Summary: FairScheduler: updateAppsRunnability never break Key: YARN-10868 URL: https://issues.apache.org/jira/browse/YARN-10868 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.2.1 Reporter: Song Jiacheng In FairScheduler, remove a app attempt will call MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable apps and make them not pending. This method will call updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the method parameter "maxRunnableApps" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9643) Federation: Add subClusterID in nodes page of Router web
[ https://issues.apache.org/jira/browse/YARN-9643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368729#comment-17368729 ] Song Jiacheng commented on YARN-9643: - [~hunhun], Thanks for reply. I have done this on myself, but thank you all the same. > Federation: Add subClusterID in nodes page of Router web > > > Key: YARN-9643 > URL: https://issues.apache.org/jira/browse/YARN-9643 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Attachments: nodes.png > > > In nodes page of router web, there only are node info, No cluster id > corresponding to the node. > [http://127.0.0.1:8089/cluster/nodes|http://192.168.169.72:8089/cluster/nodes] > !nodes.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9643) Federation: Add subClusterID in nodes page of Router web
[ https://issues.apache.org/jira/browse/YARN-9643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367928#comment-17367928 ] Song Jiacheng commented on YARN-9643: - Hi,[~hunhun]. Thanks for reporting. This will be convenient to manage the federation. So any progress for now? > Federation: Add subClusterID in nodes page of Router web > > > Key: YARN-9643 > URL: https://issues.apache.org/jira/browse/YARN-9643 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > Attachments: nodes.png > > > In nodes page of router web, there only are node info, No cluster id > corresponding to the node. > [http://127.0.0.1:8089/cluster/nodes|http://192.168.169.72:8089/cluster/nodes] > !nodes.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367924#comment-17367924 ] Song Jiacheng commented on YARN-10794: -- This has been fixed by https://issues.apache.org/jira/browse/YARN-9693 Sorry for not seeing that, Closing it. > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > The original reason of this problem is that NM will set a local AMRMToken for > AM if AMRMProxy is enabled, so that AM will fail if it contact with RM > directly. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this patch I can submit jobs via the both > ways. > My solution is that hold two tokens at the same time, and choose a right one > during the building of RPC Client. > I tested this patch in some situations like AM recover, NM recover, no error > found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367924#comment-17367924 ] Song Jiacheng edited comment on YARN-10794 at 6/23/21, 7:33 AM: This has been fixed by https://issues.apache.org/jira/browse/YARN-10229 Sorry for not seeing that, Closing it. was (Author: song jiacheng): This has been fixed by https://issues.apache.org/jira/browse/YARN-9693 Sorry for not seeing that, Closing it. > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > The original reason of this problem is that NM will set a local AMRMToken for > AM if AMRMProxy is enabled, so that AM will fail if it contact with RM > directly. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this patch I can submit jobs via the both > ways. > My solution is that hold two tokens at the same time, and choose a right one > during the building of RPC Client. > I tested this patch in some situations like AM recover, NM recover, no error > found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10794: - Comment: was deleted (was: https://issues.apache.org/jira/browse/YARN-9693) > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > The original reason of this problem is that NM will set a local AMRMToken for > AM if AMRMProxy is enabled, so that AM will fail if it contact with RM > directly. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this patch I can submit jobs via the both > ways. > My solution is that hold two tokens at the same time, and choose a right one > during the building of RPC Client. > I tested this patch in some situations like AM recover, NM recover, no error > found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367921#comment-17367921 ] Song Jiacheng commented on YARN-10794: -- https://issues.apache.org/jira/browse/YARN-9693 > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > The original reason of this problem is that NM will set a local AMRMToken for > AM if AMRMProxy is enabled, so that AM will fail if it contact with RM > directly. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this patch I can submit jobs via the both > ways. > My solution is that hold two tokens at the same time, and choose a right one > during the building of RPC Client. > I tested this patch in some situations like AM recover, NM recover, no error > found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9049) Add application submit data to state store
[ https://issues.apache.org/jira/browse/YARN-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360505#comment-17360505 ] Song Jiacheng commented on YARN-9049: - Hi, [~botong], [~bibinchundatt] I have developed a patch which persisted ApplicationSubmissionContext in zk, but I wonder if there will be too much pressure on zk, cause as we know every NM holds a connect to zk. Moreover, query for ApplicationSubmissionContext queries zk directly, not concerned about the cache. > Add application submit data to state store > -- > > Key: YARN-9049 > URL: https://issues.apache.org/jira/browse/YARN-9049 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Bibin Chundatt >Priority: Major > Attachments: YARN-9049.001.path > > > As per the discussion in YARN-8898 we need to persist trimmend > ApplicationSubmissionContext details to federation State Store. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during rolling upgrade from 2.6 to 3.2
[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10791: - Summary: Graceful decomission cause NPE during rolling upgrade from 2.6 to 3.2 (was: Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 ) > Graceful decomission cause NPE during rolling upgrade from 2.6 to 3.2 > -- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Minor > Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png, > image-2021-05-31-10-37-31-795.png > > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > When we exclude a node and call refreshNode gracefully, All the MR AMs will > fail. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it needs to be fixed, just raise a solution for this situation > !image-2021-05-31-10-32-17-541.png! > There are 2 nodes in the cluster, And the AM is deployed in node 44, I > excluded 46, which is another node in the cluster, and then refreshnode, the > error above occured. > As what I say, I think the original reasion is the compatibility of > NodeStateProto > !image-2021-05-31-10-37-31-795.png! > 2.6 MR can not recognize DECOMMISONING and SHUTDOWN -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10794: - Description: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. The original reason of this problem is that NM will set a local AMRMToken for AM if AMRMProxy is enabled, so that AM will fail if it contact with RM directly. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this patch I can submit jobs via the both ways. My solution is that hold two tokens at the same time, and choose a right one during the building of RPC Client. I tested this patch in some situations like AM recover, NM recover, no error found. But still, I can't ensure this patch is good, so i wonder if there is a better solution. was: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this patch I can submit jobs via the both ways. I tested this patch in some situations like AM recover, NM recover, no error found. But still, I can't ensure this patch is good, so i wonder if there is a better solution. > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > The original reason of this problem is that NM will set a local AMRMToken for > AM if AMRMProxy is enabled, so that AM will fail if it contact with RM > directly. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this patch I can submit jobs via the both > ways. > My solution is that hold two tokens at the same time, and choose a right one > during the building of RPC Client. > I tested this patch in some situations like AM recover, NM recover, no error > found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354305#comment-17354305 ] Song Jiacheng edited comment on YARN-10786 at 5/31/21, 8:45 AM: [~zhengchenyu] Thanks for the comment. I set yarn.web-proxy.address to all the subcluster webapp addresses, so that all the subcluster can access the AM pages. I have thought about other solutions, but all of them change a lot and may break some other rules. was (Author: song jiacheng): [~zhengchenyu] Thanks for the comment. {panel:title=我的标题} In the other way, If we have more than one subcluster, this way may be not good. {panel} I set yarn.web-proxy.address to all the subcluster webapp addresses, so that all the subcluster can access the AM pages. I have thought about other solutions, but all of them change a lot and may break some other rules. > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > n_v25156273211c049f8b396dcf15fcd9a84.png, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config needs to be added in the client side, so it will affect > application only. > Before fixing, click the AM link in RM or Router: > !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! > And after the fix, we can access the AM page as normal... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354305#comment-17354305 ] Song Jiacheng commented on YARN-10786: -- [~zhengchenyu] Thanks for the comment. {panel:title=我的标题} In the other way, If we have more than one subcluster, this way may be not good. {panel} I set yarn.web-proxy.address to all the subcluster webapp addresses, so that all the subcluster can access the AM pages. I have thought about other solutions, but all of them change a lot and may break some other rules. > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > n_v25156273211c049f8b396dcf15fcd9a84.png, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config needs to be added in the client side, so it will affect > application only. > Before fixing, click the AM link in RM or Router: > !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! > And after the fix, we can access the AM page as normal... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10794: - Description: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this patch I can submit jobs via the both ways. I tested this patch in some situations like AM recover, NM recover, no error found. But still, I can't ensure this patch is good, so i wonder if there is a better solution. was: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this I can submit jobs via the both ways. I tested this in some situations like AM recover, NM recover, no error found. But still, I can't ensure this patch is good, so i wonder if there is a better solution. > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this patch I can submit jobs via the both > ways. > I tested this patch in some situations like AM recover, NM recover, no error > found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354280#comment-17354280 ] Song Jiacheng commented on YARN-10794: -- I committed a patch based on the trunk. > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this I can submit jobs via the both ways. > I tested this in some situations like AM recover, NM recover, no error found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10794: - Description: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this I can submit jobs via the both ways. I tested this in some situations like AM recover, NM recover, no error found. But still, I can't ensure this patch is good, so i wonder if there is a better solution. was: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this I can submit jobs via the both ways. I tested some situations like AM recover, NM recover, no error found. But still, I can't ensure this patch is good, so i wonder if there is a better solution. > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this I can submit jobs via the both ways. > I tested this in some situations like AM recover, NM recover, no error found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10794: - Description: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this I can submit jobs via the both ways. I tested some situations like AM recover, NM recover, no error found. But still, I can't ensure this patch is good, so i wonder if there is a better solution. was: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this I can submit jobs via the both ways. I tested some situations like AM recover, NM recover, no error found. Still, I can't ensure this patch is good, so i wonder if there is a better solution. > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this I can submit jobs via the both ways. > I tested some situations like AM recover, NM recover, no error found. > But still, I can't ensure this patch is good, so i wonder if there is a > better solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10794: - Description: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job by the federation client while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this I can submit jobs via the both ways. I tested some situations like AM recover, NM recover, no error found. Still, I can't ensure this patch is good, so i wonder if there is a better solution. was: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job to federation while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this I can submit jobs via the both ways. I tested some situations like AM recover, NM recover, no error found. Still, I can't ensure this patch is good, so i wonder if there is a better solution. > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Attachments: YARN-10794.v1.patch > > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job by the federation client while using the patch. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this I can submit jobs via the both ways. > I tested some situations like AM recover, NM recover, no error found. > Still, I can't ensure this patch is good, so i wonder if there is a better > solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10794: - Description: Sorry for not knowing how to quote a issue... https://issues.apache.org/jira/browse/YARN-9693 This issue has already raised this problem, but it seems that I can't submit job to federation while using the patch. This problem makes it impossible to rolling upgrade to federation, cause we can't upgrade all the NMs and clients at one moment So I developed another patch, using this I can submit jobs via the both ways. I tested some situations like AM recover, NM recover, no error found. Still, I can't ensure this patch is good, so i wonder if there is a better solution. was:https://issues.apache.org/jira/browse/YARN-9693 > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > Sorry for not knowing how to quote a issue... > https://issues.apache.org/jira/browse/YARN-9693 > This issue has already raised this problem, but it seems that I can't submit > job to federation while using the patch. > This problem makes it impossible to rolling upgrade to federation, cause we > can't upgrade all the NMs and clients at one moment > So I developed another patch, using this I can submit jobs via the both ways. > I tested some situations like AM recover, NM recover, no error found. > Still, I can't ensure this patch is good, so i wonder if there is a better > solution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
[ https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10794: - Description: https://issues.apache.org/jira/browse/YARN-9693 > Submitting jobs to a single subcluster will fail while AMRMProxy is enabled > --- > > Key: YARN-10794 > URL: https://issues.apache.org/jira/browse/YARN-10794 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > > https://issues.apache.org/jira/browse/YARN-9693 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
Song Jiacheng created YARN-10794: Summary: Submitting jobs to a single subcluster will fail while AMRMProxy is enabled Key: YARN-10794 URL: https://issues.apache.org/jira/browse/YARN-10794 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.2.1 Reporter: Song Jiacheng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10791: - Description: We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception while we upgrading NM. When we exclude a node and call refreshNode gracefully, All the MR AMs will fail. 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) at java.lang.Thread.run(Thread.java:745) The reason of this is because we gracefully decomission nodes while using 2.6MR. handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING" So I add a config to decide if we should send the DECOMMISONING to AMs I don't know if it needs to be fixed, just raise a solution for this situation !image-2021-05-31-10-32-17-541.png! There are 2 nodes in the cluster, And the AM is deployed in node 44, I excluded 46, which is another node in the cluster, and then refreshnode, the error above occured. As what I say, I think the original reasion is the compatibility of NodeStateProto !image-2021-05-31-10-37-31-795.png! 2.6 MR can not recognize DECOMMISONING and SHUTDOWN was: We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception while we upgrading NM. When we exclude a node and call refreshNode gracefully, All the MR AMs will fail. 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) at java.lang.Thread.run(Thread.java:745) The reason of this is because we gracefully decomission nodes while using 2.6MR. handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING" So I add a config to decide if we should send the DECOMMISONING to AMs I don't know if it needs to be fixed, just raise a solution for this situation > Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 > -- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Minor > Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png, > image-2021-05-31-10-37-31-795.png > > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > When we exclude a node and call refreshNode gracefully, All the MR AMs will > fail. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it needs to be fixed, just raise a solution for this situation > !image-2021-05-31-10-32-17-541.png! > There are 2 nodes in the cluster, And the AM is deployed in node 44, I > excluded 46, which is another node in the cluster, and then refreshnode, the > error above occured. > As what I say, I think the original reasion is the compatibility of > NodeStateProto > !image-2021-05-31-10-37-31-795.png! > 2.6 MR can not recognize
[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10791: - Attachment: image-2021-05-31-10-37-31-795.png > Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 > -- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Minor > Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png, > image-2021-05-31-10-37-31-795.png > > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > When we exclude a node and call refreshNode gracefully, All the MR AMs will > fail. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it needs to be fixed, just raise a solution for this situation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10791: - Attachment: image-2021-05-31-10-32-17-541.png > Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 > -- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Minor > Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png > > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > When we exclude a node and call refreshNode gracefully, All the MR AMs will > fail. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it needs to be fixed, just raise a solution for this situation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354182#comment-17354182 ] Song Jiacheng commented on YARN-10791: -- [~epayne], thanks for the comment. And I know what you mean, but this error raised in all the MR AMs, not just the node which I excluded. Maybe my description is not so detailed, I'll add some details. > Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 > -- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Minor > Attachments: YARN-10791.v1.patch > > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > When we exclude a node and call refreshNode gracefully, All the MR AMs will > fail. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it needs to be fixed, just raise a solution for this situation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10791: - Description: We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception while we upgrading NM. When we exclude a node and call refreshNode gracefully, All the MR AMs will fail. 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) at java.lang.Thread.run(Thread.java:745) The reason of this is because we gracefully decomission nodes while using 2.6MR. handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING" So I add a config to decide if we should send the DECOMMISONING to AMs I don't know if it needs to be fixed, just raise a solution for this situation was: We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception while we upgrading NM. When we exclude a node and call refreshNode gracefully, All the MR AMs will fail. 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) at java.lang.Thread.run(Thread.java:745) The reason of this is because we gracefully decomission nodes while using 2.6MR. handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING" So I add a config to decide if we should send the DECOMMISONING to AMs I don't know if it is a bug, just raise a solution for this situation > Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 > -- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Minor > Attachments: YARN-10791.v1.patch > > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > When we exclude a node and call refreshNode gracefully, All the MR AMs will > fail. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it needs to be fixed, just raise a solution for this situation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10791: - Description: We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception while we upgrading NM. When we exclude a node and call refreshNode gracefully, All the MR AMs will fail. 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) at java.lang.Thread.run(Thread.java:745) The reason of this is because we gracefully decomission nodes while using 2.6MR. handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING" So I add a config to decide if we should send the DECOMMISONING to AMs I don't know if it is a bug, just raise a solution for this situation was: We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception while we upgrading NM. 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) at java.lang.Thread.run(Thread.java:745) The reason of this is because we gracefully decomission nodes while using 2.6MR. handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING" So I add a config to decide if we should send the DECOMMISONING to AMs I don't know if it is a bug, just raise a solution for this situation > Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 > -- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Minor > Attachments: YARN-10791.v1.patch > > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > When we exclude a node and call refreshNode gracefully, All the MR AMs will > fail. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it is a bug, just raise a solution for this situation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10791: - Priority: Minor (was: Major) > Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 > -- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Minor > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it is a bug, just raise a solution for this situation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2
Song Jiacheng created YARN-10791: Summary: Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 Key: YARN-10791 URL: https://issues.apache.org/jira/browse/YARN-10791 Project: Hadoop YARN Issue Type: Bug Components: RM Affects Versions: 3.2.1 Reporter: Song Jiacheng We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception while we upgrading NM. 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) at java.lang.Thread.run(Thread.java:745) The reason of this is because we gracefully decomission nodes while using 2.6MR. handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING" So I add a config to decide if we should send the DECOMMISONING to AMs I don't know if it is a bug, just raise a solution for this situation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351007#comment-17351007 ] Song Jiacheng commented on YARN-10786: -- [~zhuqi], Thanks for the review~ :D > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > n_v25156273211c049f8b396dcf15fcd9a84.png, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config needs to be added in the client side, so it will affect > application only. > Before fixing, click the AM link in RM or Router: > !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! > And after the fix, we can access the AM page as normal... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10786: - Description: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: 1. Add this config in the yarn-site.xml on client. yarn.web-proxy.address rm1:9088,rm2:9088 2. Change the way to get the config from Configuration#get to Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. So that I can access the AM page now. This config needs to be added in the client side, so it will affect application only. Before fixing, click the AM link in RM or Router: !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! And after the fix, we can access the AM page as normal... was: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: 1. Add this config in the yarn-site.xml on client. yarn.web-proxy.address rm1:9088,rm2:9088 2. Change the way to get the config from Configuration#get to Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. So that I can access the AM page now. This config need to be added in the client side, so it will affect application only. Before fixing, click the AM link in RM or Router: !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! And after the fix, we can access the AM page as normal... > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > n_v25156273211c049f8b396dcf15fcd9a84.png, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config needs to be added in the client side, so it will affect > application only. > Before fixing, click the AM link in RM or Router: > !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! > And after the fix, we can access the AM page as normal... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10786: - Attachment: n_v25156273211c049f8b396dcf15fcd9a84.png > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > n_v25156273211c049f8b396dcf15fcd9a84.png, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config need to be added in the client side, so it will affect > application only. > Before fixing, click the AM link in RM or Router: > !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! > And after the fix, we can access the AM page as normal... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10786: - Description: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: 1. Add this config in the yarn-site.xml on client. yarn.web-proxy.address rm1:9088,rm2:9088 2. Change the way to get the config from Configuration#get to Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. So that I can access the AM page now. This config need to be added in the client side, so it will affect application only. Before fixing, click the AM link in RM or Router: !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! And after the fix, we can access the AM page as normal... was: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: 1. Add this config in the yarn-site.xml on client. yarn.web-proxy.address rm1:9088,rm2:9088 2. Change the way to get the config from Configuration#get to Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. So that I can access the AM page now. This config need to be added in the client side, so it will affect application only. Before fixing, click the AM link in RM or Router: !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! And after the fix, we can access the AM page as normal... > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > n_v25156273211c049f8b396dcf15fcd9a84.png, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config need to be added in the client side, so it will affect > application only. > Before fixing, click the AM link in RM or Router: > !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! > And after the fix, we can access the AM page as normal... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350910#comment-17350910 ] Song Jiacheng commented on YARN-10786: -- Hi,[~zhuqi], I have added the error page before the fix. After the fix, we can access the AM page as normal. > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config need to be added in the client side, so it will affect > application only. > Before fixing, click the AM link in RM or Router: > !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! > And after the fix, we can access the AM page as normal... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10786: - Description: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: 1. Add this config in the yarn-site.xml on client. yarn.web-proxy.address rm1:9088,rm2:9088 2. Change the way to get the config from Configuration#get to Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. So that I can access the AM page now. This config need to be added in the client side, so it will affect application only. Before fixing, click the AM link in RM or Router: !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! And after the fix, we can access the AM page as normal... was: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: 1. Add this config in the yarn-site.xml on client. yarn.web-proxy.address rm1:9088,rm2:9088 2. Change the way to get the config from Configuration#get to Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. So that I can access the AM page now. This config need to be added in the client side, so it will affect application only. > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config need to be added in the client side, so it will affect > application only. > Before fixing, click the AM link in RM or Router: > !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png! > And after the fix, we can access the AM page as normal... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10786: - Attachment: v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch, > v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config need to be added in the client side, so it will affect > application only. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10786: - Description: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: 1. Add this config in the yarn-site.xml on client. yarn.web-proxy.address rm1:9088,rm2:9088 2. Change the way to get the config from Configuration#get to Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. So that I can access the AM page now. This config need to be added in the client side, so it will affect application only. was: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: yarn.web-proxy.address rm1:9088,rm2:9088 And then gets the config with Configuration#getStrings. So that I can access the AM page now. This config need to be added in the client side, so it will affect application only. > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > Fix For: 3.2.1 > > Attachments: YARN-10786.v1.patch > > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > 1. Add this config in the yarn-site.xml on client. > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > 2. Change the way to get the config from Configuration#get to > Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter. > So that I can access the AM page now. > This config need to be added in the client side, so it will affect > application only. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10786: - Description: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: yarn.web-proxy.address rm1:9088,rm2:9088 And then gets the config with Configuration#getStrings. So that I can access the AM page now. This config need to be added in the client side, so it will affect application only. was: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: yarn.web-proxy.address rm1:9088,rm2:9088 And then gets the config with Configuration#getStrings. So that I can access the AM page now > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > And then gets the config with Configuration#getStrings. > So that I can access the AM page now. > This config need to be added in the client side, so it will affect > application only. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation
[ https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jiacheng updated YARN-10786: - Description: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and if it does not exist, it will get the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: yarn.web-proxy.address rm1:9088,rm2:9088 And then gets the config with Configuration#getStrings. So that I can access the AM page now was: The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and it does not exist, it will gets the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: yarn.web-proxy.address rm1:9088,rm2:9088 And then gets the config with Configuration#getStrings. So that I can access the AM page now > Federation:We can't access the AM page while using federation > - > > Key: YARN-10786 > URL: https://issues.apache.org/jira/browse/YARN-10786 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.1 >Reporter: Song Jiacheng >Priority: Major > Labels: federation > > The reason of this is that AM gets the proxy URI from config > yarn.web-proxy.address, and if it does not exist, it will get the URI from > yarn.resourcemanager.webapp.address. > But in federation, we don't know which RM will be the home cluster of an > application, so I do this fix: > > yarn.web-proxy.address > rm1:9088,rm2:9088 > > And then gets the config with Configuration#getStrings. > So that I can access the AM page now -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10786) Federation:We can't access the AM page while using federation
Song Jiacheng created YARN-10786: Summary: Federation:We can't access the AM page while using federation Key: YARN-10786 URL: https://issues.apache.org/jira/browse/YARN-10786 Project: Hadoop YARN Issue Type: Bug Components: federation Affects Versions: 3.2.1 Reporter: Song Jiacheng The reason of this is that AM gets the proxy URI from config yarn.web-proxy.address, and it does not exist, it will gets the URI from yarn.resourcemanager.webapp.address. But in federation, we don't know which RM will be the home cluster of an application, so I do this fix: yarn.web-proxy.address rm1:9088,rm2:9088 And then gets the config with Configuration#getStrings. So that I can access the AM page now -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org