[jira] [Comment Edited] (YARN-9854) RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink

2022-04-22 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526243#comment-17526243
 ] 

Song Jiacheng edited comment on YARN-9854 at 4/22/22 7:20 AM:
--

Thanks for your patch,[~suxingfate].
We encountered this problem today and the situation was totally the same as 
yours. I think this patch will be helpful.
It seems no one cares about this issue, I will apply this patch in our version, 
thanks for the patch again!


was (Author: song jiacheng):
Thanks for your patch,[~suxingfate].
We cthis problem today and the situation was totally the same as yours. I think 
this patch will be helpful.
It seems no one cares about this issue, I will apply this patch in our version, 
thanks for the patch again!

> RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink
> --
>
> Key: YARN-9854
> URL: https://issues.apache.org/jira/browse/YARN-9854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: amrmproxy, resourcemanager, webapp
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Major
> Attachments: YARN-9854.001.patch
>
>
> RM will proxy url request to [http://rm:port/proxy/application_x] to AM 
> or related history server.
> Recently we met an issue https://issues.apache.org/jira/browse/SPARK-26961 
> which will cause Spark AM hang forever.
> And we have a monitor tool to access [http://rm:port/proxy/application_x] 
>  periodically. Thus all proxied connection to the hang spark AM will also 
> hang forever due to WebAppProxyServlet is lacking of socket connection 
> timeout setting while initialize httpclient towards this spark AM.
>  
> The jetty server holding RM servlets is with limited threads. In this case, 
> each time one such thread will hang due to waiting for Spark AM response. 
> Eventually all jetty threads serving http traffic hang and caused all RM web 
> links not responsive. 
>  
> If we give timeout config to httpclient, we will be free of this issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9854) RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink

2022-04-22 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526243#comment-17526243
 ] 

Song Jiacheng commented on YARN-9854:
-

Thanks for your patch,[~suxingfate].
We cthis problem today and the situation was totally the same as yours. I think 
this patch will be helpful.
It seems no one cares about this issue, I will apply this patch in our version, 
thanks for the patch again!

> RM jetty hang due to WebAppProxyServlet lacks of timeout while doing proxyLink
> --
>
> Key: YARN-9854
> URL: https://issues.apache.org/jira/browse/YARN-9854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: amrmproxy, resourcemanager, webapp
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Major
> Attachments: YARN-9854.001.patch
>
>
> RM will proxy url request to [http://rm:port/proxy/application_x] to AM 
> or related history server.
> Recently we met an issue https://issues.apache.org/jira/browse/SPARK-26961 
> which will cause Spark AM hang forever.
> And we have a monitor tool to access [http://rm:port/proxy/application_x] 
>  periodically. Thus all proxied connection to the hang spark AM will also 
> hang forever due to WebAppProxyServlet is lacking of socket connection 
> timeout setting while initialize httpclient towards this spark AM.
>  
> The jetty server holding RM servlets is with limited threads. In this case, 
> each time one such thread will hang due to waiting for Spark AM response. 
> Eventually all jetty threads serving http traffic hang and caused all RM web 
> links not responsive. 
>  
> If we give timeout config to httpclient, we will be free of this issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, this method use this parameter to break from the 
loop. However, nowMaybeRunnable actually is a list of lists, and the size of 
nowMaybeRunnable is actually a number of queues, so this is a bug.
I think this need to be changed to 1, cause we only get one attempt finished or 
moved.


  was:
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, this method use this parameter to break from the 
loop. However, nowMaybeRunnable actually is a list of lists, and the size of 
nowMaybeRunnable is actually a size of queues, so this is a bug.
I think this need to be changed to 1, cause we only get one attempt finished or 
moved.



> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, removing a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps", as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}
> updateAppsRunnability is below:
> {code:java}
> private void updateAppsRunnability(List>
>   appsNowMaybeRunnable, int maxRunnableApps) {
> // Scan through and check whether this means that any apps are now 
> runnable
> Iterator iter = new MultiListStartTimeIterator(
> appsNowMaybeRunnable);
> FSAppAttempt prev = null;
> List noLongerPendingApps = new ArrayList();
> while (iter.hasNext()) {
>   FSAppAttempt next = iter.next();
>   if (next == prev) {
> continue;
>   }
>   if (canAppBeRunnable(next.getQueue(), next)) {
> trackRunnableApp(next);
> 

[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, this method use this parameter to break from the 
loop. However, nowMaybeRunnable actually is a list of lists, and the size of 
nowMaybeRunnable is actually a size of queues, so this is a bug.
I think this need to be changed to 1, cause we only get one attempt finished or 
moved.


  was:
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, this method use this parameter to break from the 
loop. However, nowMaybeRunnable actually is a list of lists, and the size of 
nowMaybeRunnable is actually a size of queues, so this is a bug.
I think this need to be changed to 1.



> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, removing a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps", as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}
> updateAppsRunnability is below:
> {code:java}
> private void updateAppsRunnability(List>
>   appsNowMaybeRunnable, int maxRunnableApps) {
> // Scan through and check whether this means that any apps are now 
> runnable
> Iterator iter = new MultiListStartTimeIterator(
> appsNowMaybeRunnable);
> FSAppAttempt prev = null;
> List noLongerPendingApps = new ArrayList();
> while (iter.hasNext()) {
>   FSAppAttempt next = iter.next();
>   if (next == prev) {
> continue;
>   }
>   if (canAppBeRunnable(next.getQueue(), next)) {
> trackRunnableApp(next);
> FSAppAttempt appSched = next;
> 

[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, this method use this parameter to break from the 
loop. However, nowMaybeRunnable actually is a list of lists, and the size of 
nowMaybeRunnable is actually a size of queues, so this is a bug.
I think this need to be changed to 1.


  was:
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, this method use this parameter to break from the 
loop. However, nowMaybeRunnable actually is a list of lists, and the size of 
nowMaybeRunnable is actually a size of queues, so this is a bug.



> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, removing a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps", as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}
> updateAppsRunnability is below:
> {code:java}
> private void updateAppsRunnability(List>
>   appsNowMaybeRunnable, int maxRunnableApps) {
> // Scan through and check whether this means that any apps are now 
> runnable
> Iterator iter = new MultiListStartTimeIterator(
> appsNowMaybeRunnable);
> FSAppAttempt prev = null;
> List noLongerPendingApps = new ArrayList();
> while (iter.hasNext()) {
>   FSAppAttempt next = iter.next();
>   if (next == prev) {
> continue;
>   }
>   if (canAppBeRunnable(next.getQueue(), next)) {
> trackRunnableApp(next);
> FSAppAttempt appSched = next;
> next.getQueue().addApp(appSched, true);
> noLongerPendingApps.add(appSched);
> if 

[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, this method use this parameter to break from the 
loop. However, nowMaybeRunnable actually is a list of lists, and the size of 
nowMaybeRunnable is actually a size of queues, so this is a bug.


  was:
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, but nowMaybeRunnable actually is a list of lists, 
and the size of nowMaybeRunnable is actually a size of queues, so this is a bug.



> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, removing a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps", as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}
> updateAppsRunnability is below:
> {code:java}
> private void updateAppsRunnability(List>
>   appsNowMaybeRunnable, int maxRunnableApps) {
> // Scan through and check whether this means that any apps are now 
> runnable
> Iterator iter = new MultiListStartTimeIterator(
> appsNowMaybeRunnable);
> FSAppAttempt prev = null;
> List noLongerPendingApps = new ArrayList();
> while (iter.hasNext()) {
>   FSAppAttempt next = iter.next();
>   if (next == prev) {
> continue;
>   }
>   if (canAppBeRunnable(next.getQueue(), next)) {
> trackRunnableApp(next);
> FSAppAttempt appSched = next;
> next.getQueue().addApp(appSched, true);
> noLongerPendingApps.add(appSched);
> if (noLongerPendingApps.size() >= maxRunnableApps) {
>   break;
> }
>   }
>   prev = 

[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps", as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, but nowMaybeRunnable actually is a list of lists, 
and the size of nowMaybeRunnable is actually a size of queues, so this is a bug.


  was:
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps",as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, but nowMaybeRunnable actually is a list of lists, 
and the size of nowMaybeRunnable is actually a size of queues, so this is a bug.



> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, removing a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps", as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}
> updateAppsRunnability is below:
> {code:java}
> private void updateAppsRunnability(List>
>   appsNowMaybeRunnable, int maxRunnableApps) {
> // Scan through and check whether this means that any apps are now 
> runnable
> Iterator iter = new MultiListStartTimeIterator(
> appsNowMaybeRunnable);
> FSAppAttempt prev = null;
> List noLongerPendingApps = new ArrayList();
> while (iter.hasNext()) {
>   FSAppAttempt next = iter.next();
>   if (next == prev) {
> continue;
>   }
>   if (canAppBeRunnable(next.getQueue(), next)) {
> trackRunnableApp(next);
> FSAppAttempt appSched = next;
> next.getQueue().addApp(appSched, true);
> noLongerPendingApps.add(appSched);
> if (noLongerPendingApps.size() >= maxRunnableApps) {
>   break;
> }
>   }
>   prev = next;
> }
> ...
> {code}
> maxRunnableApps is the number of 

[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, remove a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps",as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, but nowMaybeRunnable actually is a list of lists, 
and the size of nowMaybeRunnable is actually a size of queues, so this is a bug.


  was:
In FairScheduler, remove a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps",as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

appsNowMaybeRunnable actually is a list of lists, the size of this



> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, remove a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps",as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}
> updateAppsRunnability is below:
> {code:java}
> private void updateAppsRunnability(List>
>   appsNowMaybeRunnable, int maxRunnableApps) {
> // Scan through and check whether this means that any apps are now 
> runnable
> Iterator iter = new MultiListStartTimeIterator(
> appsNowMaybeRunnable);
> FSAppAttempt prev = null;
> List noLongerPendingApps = new ArrayList();
> while (iter.hasNext()) {
>   FSAppAttempt next = iter.next();
>   if (next == prev) {
> continue;
>   }
>   if (canAppBeRunnable(next.getQueue(), next)) {
> trackRunnableApp(next);
> FSAppAttempt appSched = next;
> next.getQueue().addApp(appSched, true);
> noLongerPendingApps.add(appSched);
> if (noLongerPendingApps.size() >= maxRunnableApps) {
>   break;
> }
>   }
>   prev = next;
> }
> ...
> {code}
> maxRunnableApps is the number of apps which can be runnable because of the 
> removal of previous attempts, but nowMaybeRunnable actually is a list of 
> lists, and the size of nowMaybeRunnable is actually a size 

[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, removing a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps",as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, but nowMaybeRunnable actually is a list of lists, 
and the size of nowMaybeRunnable is actually a size of queues, so this is a bug.


  was:
In FairScheduler, remove a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps",as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

maxRunnableApps is the number of apps which can be runnable because of the 
removal of previous attempts, but nowMaybeRunnable actually is a list of lists, 
and the size of nowMaybeRunnable is actually a size of queues, so this is a bug.



> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, removing a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps",as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}
> updateAppsRunnability is below:
> {code:java}
> private void updateAppsRunnability(List>
>   appsNowMaybeRunnable, int maxRunnableApps) {
> // Scan through and check whether this means that any apps are now 
> runnable
> Iterator iter = new MultiListStartTimeIterator(
> appsNowMaybeRunnable);
> FSAppAttempt prev = null;
> List noLongerPendingApps = new ArrayList();
> while (iter.hasNext()) {
>   FSAppAttempt next = iter.next();
>   if (next == prev) {
> continue;
>   }
>   if (canAppBeRunnable(next.getQueue(), next)) {
> trackRunnableApp(next);
> FSAppAttempt appSched = next;
> next.getQueue().addApp(appSched, true);
> noLongerPendingApps.add(appSched);
> if (noLongerPendingApps.size() >= maxRunnableApps) {
>   break;
> }
>   }
>   prev = next;
> }
> ...
> {code}
> maxRunnableApps is the number of apps 

[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, remove a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps",as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}
updateAppsRunnability is below:

{code:java}
private void updateAppsRunnability(List>
  appsNowMaybeRunnable, int maxRunnableApps) {
// Scan through and check whether this means that any apps are now runnable
Iterator iter = new MultiListStartTimeIterator(
appsNowMaybeRunnable);
FSAppAttempt prev = null;
List noLongerPendingApps = new ArrayList();
while (iter.hasNext()) {
  FSAppAttempt next = iter.next();
  if (next == prev) {
continue;
  }

  if (canAppBeRunnable(next.getQueue(), next)) {
trackRunnableApp(next);
FSAppAttempt appSched = next;
next.getQueue().addApp(appSched, true);
noLongerPendingApps.add(appSched);

if (noLongerPendingApps.size() >= maxRunnableApps) {
  break;
}
  }

  prev = next;
}
...
{code}

appsNowMaybeRunnable actually is a list of lists, the size of this


  was:
In FairScheduler, remove a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps",as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}



> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, remove a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps",as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}
> updateAppsRunnability is below:
> {code:java}
> private void updateAppsRunnability(List>
>   appsNowMaybeRunnable, int maxRunnableApps) {
> // Scan through and check whether this means that any apps are now 
> runnable
> Iterator iter = new MultiListStartTimeIterator(
> appsNowMaybeRunnable);
> FSAppAttempt prev = null;
> List noLongerPendingApps = new ArrayList();
> while (iter.hasNext()) {
>   FSAppAttempt next = iter.next();
>   if (next == prev) {
> continue;
>   }
>   if (canAppBeRunnable(next.getQueue(), next)) {
> trackRunnableApp(next);
> FSAppAttempt appSched = next;
> next.getQueue().addApp(appSched, true);
> noLongerPendingApps.add(appSched);
> if (noLongerPendingApps.size() >= maxRunnableApps) {
>   break;
> }
>   }
>   prev = next;
> }
> ...
> {code}
> appsNowMaybeRunnable actually is a list of lists, the size of this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10868:
-
Description: 
In FairScheduler, remove a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps",as below:

{code:java}
updateAppsRunnability(appsNowMaybeRunnable,
appsNowMaybeRunnable.size());
{code}


  was:In FairScheduler, remove a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps"


> FairScheduler: updateAppsRunnability never break
> 
>
> Key: YARN-10868
> URL: https://issues.apache.org/jira/browse/YARN-10868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> In FairScheduler, remove a app attempt will call 
> MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some 
> non-runnable apps and make them not pending. This method will call 
> updateAppsRunnability at the end, and set appsNowMaybeRunnable.size() as the 
> method parameter "maxRunnableApps",as below:
> {code:java}
> updateAppsRunnability(appsNowMaybeRunnable,
> appsNowMaybeRunnable.size());
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10868) FairScheduler: updateAppsRunnability never break

2021-07-21 Thread Song Jiacheng (Jira)
Song Jiacheng created YARN-10868:


 Summary: FairScheduler: updateAppsRunnability never break
 Key: YARN-10868
 URL: https://issues.apache.org/jira/browse/YARN-10868
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.2.1
Reporter: Song Jiacheng


In FairScheduler, remove a app attempt will call 
MaxRunningAppsEnforcer#updateRunnabilityOnAppRemoval to find some non-runnable 
apps and make them not pending. This method will call updateAppsRunnability at 
the end, and set appsNowMaybeRunnable.size() as the method parameter 
"maxRunnableApps"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9643) Federation: Add subClusterID in nodes page of Router web

2021-06-24 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368729#comment-17368729
 ] 

Song Jiacheng commented on YARN-9643:
-

[~hunhun], Thanks for reply.

I have done this on myself, but thank you all the same.

> Federation: Add subClusterID in nodes page of Router web
> 
>
> Key: YARN-9643
> URL: https://issues.apache.org/jira/browse/YARN-9643
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Attachments: nodes.png
>
>
> In nodes page of router web, there only are node info, No cluster id 
> corresponding to the node.
> [http://127.0.0.1:8089/cluster/nodes|http://192.168.169.72:8089/cluster/nodes]
> !nodes.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9643) Federation: Add subClusterID in nodes page of Router web

2021-06-23 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367928#comment-17367928
 ] 

Song Jiacheng commented on YARN-9643:
-

Hi,[~hunhun].

Thanks for reporting.

This will be convenient to manage the federation. So any progress for now?

> Federation: Add subClusterID in nodes page of Router web
> 
>
> Key: YARN-9643
> URL: https://issues.apache.org/jira/browse/YARN-9643
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0
>Reporter: hunshenshi
>Assignee: hunshenshi
>Priority: Major
> Attachments: nodes.png
>
>
> In nodes page of router web, there only are node info, No cluster id 
> corresponding to the node.
> [http://127.0.0.1:8089/cluster/nodes|http://192.168.169.72:8089/cluster/nodes]
> !nodes.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-06-23 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367924#comment-17367924
 ] 

Song Jiacheng commented on YARN-10794:
--

This has been fixed by https://issues.apache.org/jira/browse/YARN-9693

Sorry for not seeing that, Closing it.

> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> The original reason of this problem is that NM will set a local AMRMToken for 
> AM if AMRMProxy is enabled, so that AM will fail if it contact with RM 
> directly.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this patch I can submit jobs via the both 
> ways.
> My solution is that hold two tokens at the same time, and choose a right one 
> during the building of RPC Client.
> I tested this patch in some situations like AM recover, NM recover, no error 
> found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-06-23 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367924#comment-17367924
 ] 

Song Jiacheng edited comment on YARN-10794 at 6/23/21, 7:33 AM:


This has been fixed by https://issues.apache.org/jira/browse/YARN-10229

Sorry for not seeing that, Closing it.


was (Author: song jiacheng):
This has been fixed by https://issues.apache.org/jira/browse/YARN-9693

Sorry for not seeing that, Closing it.

> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> The original reason of this problem is that NM will set a local AMRMToken for 
> AM if AMRMProxy is enabled, so that AM will fail if it contact with RM 
> directly.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this patch I can submit jobs via the both 
> ways.
> My solution is that hold two tokens at the same time, and choose a right one 
> during the building of RPC Client.
> I tested this patch in some situations like AM recover, NM recover, no error 
> found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-06-23 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10794:
-
Comment: was deleted

(was: https://issues.apache.org/jira/browse/YARN-9693)

> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> The original reason of this problem is that NM will set a local AMRMToken for 
> AM if AMRMProxy is enabled, so that AM will fail if it contact with RM 
> directly.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this patch I can submit jobs via the both 
> ways.
> My solution is that hold two tokens at the same time, and choose a right one 
> during the building of RPC Client.
> I tested this patch in some situations like AM recover, NM recover, no error 
> found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-06-23 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367921#comment-17367921
 ] 

Song Jiacheng commented on YARN-10794:
--

https://issues.apache.org/jira/browse/YARN-9693

> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> The original reason of this problem is that NM will set a local AMRMToken for 
> AM if AMRMProxy is enabled, so that AM will fail if it contact with RM 
> directly.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this patch I can submit jobs via the both 
> ways.
> My solution is that hold two tokens at the same time, and choose a right one 
> during the building of RPC Client.
> I tested this patch in some situations like AM recover, NM recover, no error 
> found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9049) Add application submit data to state store

2021-06-09 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360505#comment-17360505
 ] 

Song Jiacheng commented on YARN-9049:
-

Hi, [~botong], [~bibinchundatt]

I have developed a patch which persisted ApplicationSubmissionContext in zk, 
but I wonder if there will be too much pressure on zk, cause as we know every 
NM holds a connect to zk. Moreover, query for ApplicationSubmissionContext  
queries zk directly, not concerned about the cache.

> Add application submit data to state store
> --
>
> Key: YARN-9049
> URL: https://issues.apache.org/jira/browse/YARN-9049
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Bibin Chundatt
>Priority: Major
> Attachments: YARN-9049.001.path
>
>
> As per the discussion in YARN-8898 we need to persist trimmend 
> ApplicationSubmissionContext details to federation State Store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during rolling upgrade from 2.6 to 3.2

2021-06-01 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10791:
-
Summary: Graceful decomission cause NPE during rolling upgrade from 2.6 to 
3.2   (was: Graceful decomission cause NPE during Rolling upgrade from 2.6 to 
3.2 )

> Graceful decomission cause NPE during rolling upgrade from 2.6 to 3.2 
> --
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Minor
> Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png, 
> image-2021-05-31-10-37-31-795.png
>
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
> while we upgrading NM.
> When we exclude a node and call refreshNode gracefully, All the MR AMs will 
> fail.  
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM. 
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
>  at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using 
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of 
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it needs to be fixed, just raise a solution for this situation
> !image-2021-05-31-10-32-17-541.png!
> There are 2 nodes in the cluster, And the AM is deployed in node 44, I 
> excluded 46, which is another node in the cluster, and then refreshnode, the 
> error above occured.
> As what I say, I think the original reasion is the compatibility of 
> NodeStateProto
> !image-2021-05-31-10-37-31-795.png!
> 2.6 MR  can not recognize DECOMMISONING and SHUTDOWN



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-31 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10794:
-
Description: 
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

The original reason of this problem is that NM will set a local AMRMToken for 
AM if AMRMProxy is enabled, so that AM will fail if it contact with RM directly.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this patch I can submit jobs via the both 
ways.

My solution is that hold two tokens at the same time, and choose a right one 
during the building of RPC Client.

I tested this patch in some situations like AM recover, NM recover, no error 
found.

But still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 

  was:
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this patch I can submit jobs via the both 
ways.

I tested this patch in some situations like AM recover, NM recover, no error 
found.

But still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 


> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> The original reason of this problem is that NM will set a local AMRMToken for 
> AM if AMRMProxy is enabled, so that AM will fail if it contact with RM 
> directly.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this patch I can submit jobs via the both 
> ways.
> My solution is that hold two tokens at the same time, and choose a right one 
> during the building of RPC Client.
> I tested this patch in some situations like AM recover, NM recover, no error 
> found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-31 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354305#comment-17354305
 ] 

Song Jiacheng edited comment on YARN-10786 at 5/31/21, 8:45 AM:


[~zhengchenyu] Thanks for the comment.

I set yarn.web-proxy.address to all the subcluster webapp addresses, so that 
all the subcluster can access the AM pages.

I have thought about other solutions, but all of them change a lot and may 
break some other rules.


was (Author: song jiacheng):
[~zhengchenyu] Thanks for the comment.
{panel:title=我的标题}
In the other way, If we have more than one subcluster, this way may be not good.
{panel}
I set yarn.web-proxy.address to all the subcluster webapp addresses, so that 
all the subcluster can access the AM pages.

I have thought about other solutions, but all of them change a lot and may 
break some other rules.

> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> n_v25156273211c049f8b396dcf15fcd9a84.png, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config needs to be added in the client side, so it will affect 
> application only.
> Before fixing, click the AM link in RM or Router:
> !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!
>  And after the fix, we can access the AM page as normal...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-31 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354305#comment-17354305
 ] 

Song Jiacheng commented on YARN-10786:
--

[~zhengchenyu] Thanks for the comment.
{panel:title=我的标题}
In the other way, If we have more than one subcluster, this way may be not good.
{panel}
I set yarn.web-proxy.address to all the subcluster webapp addresses, so that 
all the subcluster can access the AM pages.

I have thought about other solutions, but all of them change a lot and may 
break some other rules.

> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> n_v25156273211c049f8b396dcf15fcd9a84.png, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config needs to be added in the client side, so it will affect 
> application only.
> Before fixing, click the AM link in RM or Router:
> !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!
>  And after the fix, we can access the AM page as normal...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-31 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10794:
-
Description: 
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this patch I can submit jobs via the both 
ways.

I tested this patch in some situations like AM recover, NM recover, no error 
found.

But still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 

  was:
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this I can submit jobs via the both ways.

I tested this in some situations like AM recover, NM recover, no error found.

But still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 


> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this patch I can submit jobs via the both 
> ways.
> I tested this patch in some situations like AM recover, NM recover, no error 
> found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-31 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354280#comment-17354280
 ] 

Song Jiacheng commented on YARN-10794:
--

I committed a patch based on the trunk.

> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch, YARN-10794.v2.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this I can submit jobs via the both ways.
> I tested this in some situations like AM recover, NM recover, no error found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-30 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10794:
-
Description: 
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this I can submit jobs via the both ways.

I tested this in some situations like AM recover, NM recover, no error found.

But still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 

  was:
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this I can submit jobs via the both ways.

I tested some situations like AM recover, NM recover, no error found.

But still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 


> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this I can submit jobs via the both ways.
> I tested this in some situations like AM recover, NM recover, no error found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-30 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10794:
-
Description: 
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this I can submit jobs via the both ways.

I tested some situations like AM recover, NM recover, no error found.

But still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 

  was:
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this I can submit jobs via the both ways.

I tested some situations like AM recover, NM recover, no error found.

Still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 


> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this I can submit jobs via the both ways.
> I tested some situations like AM recover, NM recover, no error found.
> But still, I can't ensure this patch is good, so i wonder if there is a 
> better solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-30 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10794:
-
Description: 
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job by the federation client while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this I can submit jobs via the both ways.

I tested some situations like AM recover, NM recover, no error found.

Still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 

  was:
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job to federation while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this I can submit jobs via the both ways.

I tested some situations like AM recover, NM recover, no error found.

Still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 


> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
> Attachments: YARN-10794.v1.patch
>
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job by the federation client while using the patch.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this I can submit jobs via the both ways.
> I tested some situations like AM recover, NM recover, no error found.
> Still, I can't ensure this patch is good, so i wonder if there is a better 
> solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-30 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10794:
-
Description: 
Sorry for not knowing how to quote a issue...

https://issues.apache.org/jira/browse/YARN-9693

This issue has already raised this problem, but it seems that I can't submit 
job to federation while using the patch.

This problem makes it impossible to rolling upgrade to federation, cause we 
can't upgrade all the NMs and clients at one moment

So I developed another patch, using this I can submit jobs via the both ways.

I tested some situations like AM recover, NM recover, no error found.

Still, I can't ensure this patch is good, so i wonder if there is a better 
solution.

 

  was:https://issues.apache.org/jira/browse/YARN-9693


> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> Sorry for not knowing how to quote a issue...
> https://issues.apache.org/jira/browse/YARN-9693
> This issue has already raised this problem, but it seems that I can't submit 
> job to federation while using the patch.
> This problem makes it impossible to rolling upgrade to federation, cause we 
> can't upgrade all the NMs and clients at one moment
> So I developed another patch, using this I can submit jobs via the both ways.
> I tested some situations like AM recover, NM recover, no error found.
> Still, I can't ensure this patch is good, so i wonder if there is a better 
> solution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-30 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10794:
-
Description: https://issues.apache.org/jira/browse/YARN-9693

> Submitting jobs to a single subcluster will fail while AMRMProxy is enabled
> ---
>
> Key: YARN-10794
> URL: https://issues.apache.org/jira/browse/YARN-10794
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>
> https://issues.apache.org/jira/browse/YARN-9693



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10794) Submitting jobs to a single subcluster will fail while AMRMProxy is enabled

2021-05-30 Thread Song Jiacheng (Jira)
Song Jiacheng created YARN-10794:


 Summary: Submitting jobs to a single subcluster will fail while 
AMRMProxy is enabled
 Key: YARN-10794
 URL: https://issues.apache.org/jira/browse/YARN-10794
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.2.1
Reporter: Song Jiacheng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2

2021-05-30 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10791:
-
Description: 
We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
while we upgrading NM.

When we exclude a node and call refreshNode gracefully, All the MR AMs will 
fail.  

2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM. 
 java.lang.NullPointerException
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
 at java.lang.Thread.run(Thread.java:745)

The reason of this is because we gracefully decomission nodes while using 2.6MR.

handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"

So I add a config to decide if we should send the DECOMMISONING to AMs

I don't know if it needs to be fixed, just raise a solution for this situation

!image-2021-05-31-10-32-17-541.png!

There are 2 nodes in the cluster, And the AM is deployed in node 44, I excluded 
46, which is another node in the cluster, and then refreshnode, the error above 
occured.

As what I say, I think the original reasion is the compatibility of 
NodeStateProto

!image-2021-05-31-10-37-31-795.png!

2.6 MR  can not recognize DECOMMISONING and SHUTDOWN

  was:
We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
while we upgrading NM.

When we exclude a node and call refreshNode gracefully, All the MR AMs will 
fail.  

2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM. 
 java.lang.NullPointerException
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
 at java.lang.Thread.run(Thread.java:745)

The reason of this is because we gracefully decomission nodes while using 2.6MR.

handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"

So I add a config to decide if we should send the DECOMMISONING to AMs

I don't know if it needs to be fixed, just raise a solution for this situation


> Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 
> --
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Minor
> Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png, 
> image-2021-05-31-10-37-31-795.png
>
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
> while we upgrading NM.
> When we exclude a node and call refreshNode gracefully, All the MR AMs will 
> fail.  
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM. 
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
>  at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using 
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of 
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it needs to be fixed, just raise a solution for this situation
> !image-2021-05-31-10-32-17-541.png!
> There are 2 nodes in the cluster, And the AM is deployed in node 44, I 
> excluded 46, which is another node in the cluster, and then refreshnode, the 
> error above occured.
> As what I say, I think the original reasion is the compatibility of 
> NodeStateProto
> !image-2021-05-31-10-37-31-795.png!
> 2.6 MR  can not recognize 

[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2

2021-05-30 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10791:
-
Attachment: image-2021-05-31-10-37-31-795.png

> Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 
> --
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Minor
> Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png, 
> image-2021-05-31-10-37-31-795.png
>
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
> while we upgrading NM.
> When we exclude a node and call refreshNode gracefully, All the MR AMs will 
> fail.  
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM. 
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
>  at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using 
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of 
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it needs to be fixed, just raise a solution for this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2

2021-05-30 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10791:
-
Attachment: image-2021-05-31-10-32-17-541.png

> Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 
> --
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Minor
> Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png
>
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
> while we upgrading NM.
> When we exclude a node and call refreshNode gracefully, All the MR AMs will 
> fail.  
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM. 
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
>  at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using 
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of 
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it needs to be fixed, just raise a solution for this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2

2021-05-30 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354182#comment-17354182
 ] 

Song Jiacheng commented on YARN-10791:
--

[~epayne], thanks for the comment.

And I know what you mean, but this error raised in all the MR AMs, not just the 
node which I excluded.

Maybe my description is not so detailed, I'll add some details.

> Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 
> --
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Minor
> Attachments: YARN-10791.v1.patch
>
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
> while we upgrading NM.
> When we exclude a node and call refreshNode gracefully, All the MR AMs will 
> fail.  
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM. 
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
>  at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using 
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of 
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it needs to be fixed, just raise a solution for this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2

2021-05-28 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10791:
-
Description: 
We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
while we upgrading NM.

When we exclude a node and call refreshNode gracefully, All the MR AMs will 
fail.  

2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM. 
 java.lang.NullPointerException
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
 at java.lang.Thread.run(Thread.java:745)

The reason of this is because we gracefully decomission nodes while using 2.6MR.

handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"

So I add a config to decide if we should send the DECOMMISONING to AMs

I don't know if it needs to be fixed, just raise a solution for this situation

  was:
We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
while we upgrading NM.

When we exclude a node and call refreshNode gracefully, All the MR AMs will 
fail.  

2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM. 
 java.lang.NullPointerException
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
 at java.lang.Thread.run(Thread.java:745)

The reason of this is because we gracefully decomission nodes while using 2.6MR.

handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"

So I add a config to decide if we should send the DECOMMISONING to AMs

I don't know if it is a bug, just raise a solution for this situation


> Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 
> --
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Minor
> Attachments: YARN-10791.v1.patch
>
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
> while we upgrading NM.
> When we exclude a node and call refreshNode gracefully, All the MR AMs will 
> fail.  
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM. 
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
>  at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using 
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of 
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it needs to be fixed, just raise a solution for this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2

2021-05-28 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10791:
-
Description: 
We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
while we upgrading NM.

When we exclude a node and call refreshNode gracefully, All the MR AMs will 
fail.  

2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM. 
 java.lang.NullPointerException
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
 at java.lang.Thread.run(Thread.java:745)

The reason of this is because we gracefully decomission nodes while using 2.6MR.

handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"

So I add a config to decide if we should send the DECOMMISONING to AMs

I don't know if it is a bug, just raise a solution for this situation

  was:
We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
while we upgrading NM.

2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM. 
java.lang.NullPointerException
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
 at java.lang.Thread.run(Thread.java:745)

The reason of this is because we gracefully decomission nodes while using 2.6MR.

handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"



So I add a config to decide if we should send the DECOMMISONING to AMs

I don't know if it is a bug, just raise a solution for this situation


> Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 
> --
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Minor
> Attachments: YARN-10791.v1.patch
>
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
> while we upgrading NM.
> When we exclude a node and call refreshNode gracefully, All the MR AMs will 
> fail.  
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM. 
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
>  at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using 
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of 
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it is a bug, just raise a solution for this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2

2021-05-27 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10791:
-
Priority: Minor  (was: Major)

> Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 
> --
>
> Key: YARN-10791
> URL: https://issues.apache.org/jira/browse/YARN-10791
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Minor
>
> We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
> while we upgrading NM.
> 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM. 
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
>  at java.lang.Thread.run(Thread.java:745)
> The reason of this is because we gracefully decomission nodes while using 
> 2.6MR.
> handleUpdatedNodes of 2.6MR can not recognize the node state of 
> "DECOMMISONING"
> So I add a config to decide if we should send the DECOMMISONING to AMs
> I don't know if it is a bug, just raise a solution for this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10791) Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2

2021-05-27 Thread Song Jiacheng (Jira)
Song Jiacheng created YARN-10791:


 Summary: Graceful decomission cause NPE during Rolling upgrade 
from 2.6 to 3.2 
 Key: YARN-10791
 URL: https://issues.apache.org/jira/browse/YARN-10791
 Project: Hadoop YARN
  Issue Type: Bug
  Components: RM
Affects Versions: 3.2.1
Reporter: Song Jiacheng


We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception 
while we upgrading NM.

2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING 
RM. 
java.lang.NullPointerException
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316)
 at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
 at java.lang.Thread.run(Thread.java:745)

The reason of this is because we gracefully decomission nodes while using 2.6MR.

handleUpdatedNodes of 2.6MR can not recognize the node state of "DECOMMISONING"



So I add a config to decide if we should send the DECOMMISONING to AMs

I don't know if it is a bug, just raise a solution for this situation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351007#comment-17351007
 ] 

Song Jiacheng commented on YARN-10786:
--

[~zhuqi], Thanks for the review~

:D

> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> n_v25156273211c049f8b396dcf15fcd9a84.png, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config needs to be added in the client side, so it will affect 
> application only.
> Before fixing, click the AM link in RM or Router:
> !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!
>  And after the fix, we can access the AM page as normal...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10786:
-
Description: 
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:

1. Add this config in the yarn-site.xml on client.


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

2. Change the way to get the config from Configuration#get to 
Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.

So that I can access the AM page now.

This config needs to be added in the client side, so it will affect application 
only.

Before fixing, click the AM link in RM or Router:

!v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!

 And after the fix, we can access the AM page as normal...

 

  was:
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:

1. Add this config in the yarn-site.xml on client.


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

2. Change the way to get the config from Configuration#get to 
Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.

So that I can access the AM page now.

This config need to be added in the client side, so it will affect application 
only.

Before fixing, click the AM link in RM or Router:

!v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!

 And after the fix, we can access the AM page as normal...

 


> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> n_v25156273211c049f8b396dcf15fcd9a84.png, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config needs to be added in the client side, so it will affect 
> application only.
> Before fixing, click the AM link in RM or Router:
> !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!
>  And after the fix, we can access the AM page as normal...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10786:
-
Attachment: n_v25156273211c049f8b396dcf15fcd9a84.png

> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> n_v25156273211c049f8b396dcf15fcd9a84.png, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config need to be added in the client side, so it will affect 
> application only.
> Before fixing, click the AM link in RM or Router:
> !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!
>  And after the fix, we can access the AM page as normal...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10786:
-
Description: 
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:

1. Add this config in the yarn-site.xml on client.


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

2. Change the way to get the config from Configuration#get to 
Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.

So that I can access the AM page now.

This config need to be added in the client side, so it will affect application 
only.

Before fixing, click the AM link in RM or Router:

!v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!

 And after the fix, we can access the AM page as normal...

 

  was:
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:

1. Add this config in the yarn-site.xml on client.


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

2. Change the way to get the config from Configuration#get to 
Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.

So that I can access the AM page now.

This config need to be added in the client side, so it will affect application 
only.

Before fixing, click the AM link in RM or Router:

!v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!

 And after the fix, we can access the AM page as normal...


> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> n_v25156273211c049f8b396dcf15fcd9a84.png, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config need to be added in the client side, so it will affect 
> application only.
> Before fixing, click the AM link in RM or Router:
> !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!
>  And after the fix, we can access the AM page as normal...
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350910#comment-17350910
 ] 

Song Jiacheng commented on YARN-10786:
--

Hi,[~zhuqi], I have added the error page before the fix.

After the fix, we can access the AM page as normal.

> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config need to be added in the client side, so it will affect 
> application only.
> Before fixing, click the AM link in RM or Router:
> !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!
>  And after the fix, we can access the AM page as normal...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10786:
-
Description: 
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:

1. Add this config in the yarn-site.xml on client.


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

2. Change the way to get the config from Configuration#get to 
Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.

So that I can access the AM page now.

This config need to be added in the client side, so it will affect application 
only.

Before fixing, click the AM link in RM or Router:

!v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!

 And after the fix, we can access the AM page as normal...

  was:
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:

1. Add this config in the yarn-site.xml on client.


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

2. Change the way to get the config from Configuration#get to 
Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.

So that I can access the AM page now.

This config need to be added in the client side, so it will affect application 
only.


> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config need to be added in the client side, so it will affect 
> application only.
> Before fixing, click the AM link in RM or Router:
> !v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png!
>  And after the fix, we can access the AM page as normal...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10786:
-
Attachment: 
v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png

> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch, 
> v1.1_dataplat_Hadoop平台_HDP3_2_1版本升级_YARN_26_Federation严重BUG--无法查看AM_WebHome_1621584160759-478.png
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config need to be added in the client side, so it will affect 
> application only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10786:
-
Description: 
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:

1. Add this config in the yarn-site.xml on client.


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

2. Change the way to get the config from Configuration#get to 
Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.

So that I can access the AM page now.

This config need to be added in the client side, so it will affect application 
only.

  was:
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

And then gets the config with Configuration#getStrings.

So that I can access the AM page now.

This config need to be added in the client side, so it will affect application 
only.


> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
> Fix For: 3.2.1
>
> Attachments: YARN-10786.v1.patch
>
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 1. Add this config in the yarn-site.xml on client.
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> 2. Change the way to get the config from Configuration#get to 
> Configuration#getStrings in WebAppUtils#getProxyHostsAndPortsForAmFilter.
> So that I can access the AM page now.
> This config need to be added in the client side, so it will affect 
> application only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10786:
-
Description: 
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

And then gets the config with Configuration#getStrings.

So that I can access the AM page now.

This config need to be added in the client side, so it will affect application 
only.

  was:
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

And then gets the config with Configuration#getStrings.

So that I can access the AM page now


> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> And then gets the config with Configuration#getStrings.
> So that I can access the AM page now.
> This config need to be added in the client side, so it will affect 
> application only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated YARN-10786:
-
Description: 
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and if it does not exist, it will get the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:


 yarn.web-proxy.address
 rm1:9088,rm2:9088
 

And then gets the config with Configuration#getStrings.

So that I can access the AM page now

  was:
The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and it does not exist, it will gets the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:


 yarn.web-proxy.address
 rm1:9088,rm2:9088


And then gets the config with Configuration#getStrings.

So that I can access the AM page now


> Federation:We can't access the AM page while using federation
> -
>
> Key: YARN-10786
> URL: https://issues.apache.org/jira/browse/YARN-10786
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.2.1
>Reporter: Song Jiacheng
>Priority: Major
>  Labels: federation
>
> The reason of this is that AM gets the proxy URI from config 
> yarn.web-proxy.address, and if it does not exist, it will get the URI from 
> yarn.resourcemanager.webapp.address.
> But in federation, we don't know which RM will be the home cluster of an 
> application, so I do this fix:
> 
>  yarn.web-proxy.address
>  rm1:9088,rm2:9088
>  
> And then gets the config with Configuration#getStrings.
> So that I can access the AM page now



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10786) Federation:We can't access the AM page while using federation

2021-05-25 Thread Song Jiacheng (Jira)
Song Jiacheng created YARN-10786:


 Summary: Federation:We can't access the AM page while using 
federation
 Key: YARN-10786
 URL: https://issues.apache.org/jira/browse/YARN-10786
 Project: Hadoop YARN
  Issue Type: Bug
  Components: federation
Affects Versions: 3.2.1
Reporter: Song Jiacheng


The reason of this is that AM gets the proxy URI from config 
yarn.web-proxy.address, and it does not exist, it will gets the URI from 
yarn.resourcemanager.webapp.address.

But in federation, we don't know which RM will be the home cluster of an 
application, so I do this fix:


 yarn.web-proxy.address
 rm1:9088,rm2:9088


And then gets the config with Configuration#getStrings.

So that I can access the AM page now



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org