[jira] [Created] (YARN-10299) TimeLine Service V1.5 use levelDB as backend storage will crash when data scale amount to 100GB

2020-06-01 Thread aimahou (Jira)
aimahou created YARN-10299:
--

 Summary: TimeLine Service V1.5 use levelDB as backend storage will 
crash when data scale amount to 100GB
 Key: YARN-10299
 URL: https://issues.apache.org/jira/browse/YARN-10299
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineservice
Affects Versions: 3.1.1
Reporter: aimahou


h2. Issue:

TimeLine Service V1.5 use levelDB as backend storage will crash when data scale 
amount to 100GB
h2. *Specific exception:*

2020-04-24 16:06:59,914 INFO  
applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore 
(ApplicationHistoryManagerOnTimelineStore.java:generateApplicationReport(691)) 
- No application attempt found for application_1587696012637_1143. Use a 
placeholder for its latest attempt id. 2020-04-24 16:06:59,914 INFO  
applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore 
(ApplicationHistoryManagerOnTimelineStore.java:generateApplicationReport(691)) 
- No application attempt found for application_1587696012637_1143. Use a 
placeholder for its latest attempt id. 
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: The 
entity for application attempt appattempt_1587696012637_1143_01 doesn't 
exist in the timeline store at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplicationAttempt(ApplicationHistoryManagerOnTimelineStore.java:183)
 at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.generateApplicationReport(ApplicationHistoryManagerOnTimelineStore.java:677)
 at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplications(ApplicationHistoryManagerOnTimelineStore.java:128)
 at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getApplications(ApplicationHistoryClientService.java:195)
 at 
org.apache.hadoop.yarn.server.webapp.AppsBlock.getApplicationReport(AppsBlock.java:129)
 at 
org.apache.hadoop.yarn.server.webapp.AppsBlock.fetchData(AppsBlock.java:114) at 
org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:137) at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) 
at org.apache.hadoop.yarn.webapp.View.render(View.java:243) at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) 
at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
 at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at 
org.apache.hadoop.yarn.webapp.Dispatcher.render(Dispatcher.java:206) at 
org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:165) at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
 at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
 at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182) 
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
 at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
 at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
 at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
 at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133) at 
com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130) at 
com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203) at 
com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130) at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
 at 
org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
 at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
 at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304)
 at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
 at 

[jira] [Created] (YARN-10298) TimeLine entity information only stored in one region when use apache HBase as backend storage

2020-06-01 Thread aimahou (Jira)
aimahou created YARN-10298:
--

 Summary: TimeLine entity information only stored in one region 
when use apache HBase as backend storage
 Key: YARN-10298
 URL: https://issues.apache.org/jira/browse/YARN-10298
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: ATSv2, timelineservice
Affects Versions: 3.1.1
Reporter: aimahou


h2. Issue

TimeLine entity information only stored in one region when use apache HBase as 
backend storage
h2. Probable cause

We found in the source code that the rowKey is composed of 
clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores 
timeline entity info,which probably cause the rowKey is sorted by dictionary 
order. Thus timeline entity may only store in one region or few adjacent 
regions.
h2. Related code snippet

HBaseTimelineWriterImpl.java

public TimelineWriteResponse write(TimelineCollectorContext context,
 TimelineEntities data, UserGroupInformation callerUgi)
 throws IOException {

...

boolean isApplication = ApplicationEntity.isApplicationEntity(te);
byte[] rowKey;
if (isApplication) {
 ApplicationRowKey applicationRowKey =
 new ApplicationRowKey(clusterId, userId, flowName, flowRunId,
 appId);
 rowKey = applicationRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE);
} else {
 EntityRowKey entityRowKey =
 new EntityRowKey(clusterId, userId, flowName, flowRunId, appId,
 te.getType(), te.getIdPrefix(), te.getId());
 rowKey = entityRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.ENTITY_TABLE);
}

if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) {
 SubApplicationRowKey subApplicationRowKey =
 new SubApplicationRowKey(subApplicationUser, clusterId,
 te.getType(), te.getIdPrefix(), te.getId(), userId);
 rowKey = subApplicationRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE);
}

...

}
h2. Suggestion

We can use the hash code of original rowKey as the rowKey to store and read 
timeline entity data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10298) TimeLine entity information only stored in one region when use apache HBase as backend storage

2020-06-01 Thread aimahou (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

aimahou updated YARN-10298:
---
Description: 
h2. Issue

TimeLine entity information only stored in one region when use apache HBase as 
backend storage
h2. Probable cause

We found in the source code that the rowKey is composed of 
clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores 
timeline entity info,which probably cause the rowKey is sorted by dictionary 
order. Thus timeline entity may only store in one region or few adjacent 
regions.
h2. Related code snippet

HBaseTimelineWriterImpl.java
{quote}public TimelineWriteResponse write(TimelineCollectorContext context,
 TimelineEntities data, UserGroupInformation callerUgi)
 throws IOException {

...

boolean isApplication = ApplicationEntity.isApplicationEntity(te);
 byte[] rowKey;
 if (isApplication)

{

ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, 
flowName, flowRunId, appId);

rowKey = applicationRowKey.getRowKey();

store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE);

}

else

{

EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, 
flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId());

rowKey = entityRowKey.getRowKey();

store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); }

if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te))

{

SubApplicationRowKey subApplicationRowKey = new 
SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), 
te.getIdPrefix(), te.getId(), userId);

rowKey = subApplicationRowKey.getRowKey();

store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE);

}

...

}
{quote}
h2. Suggestion

We can use the hash code of original rowKey as the rowKey to store and read 
timeline entity data.

  was:
h2. Issue

TimeLine entity information only stored in one region when use apache HBase as 
backend storage
h2. Probable cause

We found in the source code that the rowKey is composed of 
clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores 
timeline entity info,which probably cause the rowKey is sorted by dictionary 
order. Thus timeline entity may only store in one region or few adjacent 
regions.
h2. Related code snippet

HBaseTimelineWriterImpl.java

public TimelineWriteResponse write(TimelineCollectorContext context,
 TimelineEntities data, UserGroupInformation callerUgi)
 throws IOException {

...

boolean isApplication = ApplicationEntity.isApplicationEntity(te);
byte[] rowKey;
if (isApplication) {
 ApplicationRowKey applicationRowKey =
 new ApplicationRowKey(clusterId, userId, flowName, flowRunId,
 appId);
 rowKey = applicationRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE);
} else {
 EntityRowKey entityRowKey =
 new EntityRowKey(clusterId, userId, flowName, flowRunId, appId,
 te.getType(), te.getIdPrefix(), te.getId());
 rowKey = entityRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.ENTITY_TABLE);
}

if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) {
 SubApplicationRowKey subApplicationRowKey =
 new SubApplicationRowKey(subApplicationUser, clusterId,
 te.getType(), te.getIdPrefix(), te.getId(), userId);
 rowKey = subApplicationRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE);
}

...

}
h2. Suggestion

We can use the hash code of original rowKey as the rowKey to store and read 
timeline entity data.


> TimeLine entity information only stored in one region when use apache HBase 
> as backend storage
> --
>
> Key: YARN-10298
> URL: https://issues.apache.org/jira/browse/YARN-10298
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, timelineservice
>Affects Versions: 3.1.1
>Reporter: aimahou
>Priority: Major
>
> h2. Issue
> TimeLine entity information only stored in one region when use apache HBase 
> as backend storage
> h2. Probable cause
> We found in the source code that the rowKey is composed of 
> clusterId、userId、flowName、flowRunId and appId when hbase timeline writer 
> stores timeline entity info,which probably cause the rowKey is sorted by 
> dictionary order. Thus timeline entity may only store in one region or few 
> adjacent regions.
> h2. Related code snippet
> HBaseTimelineWriterImpl.java
> {quote}public TimelineWriteResponse write(TimelineCollectorContext context,
>  TimelineEntities data, UserGroupInformation callerUgi)
>  throws IOException {
> ...
> boolean isApplication = ApplicationEntity.isApplicationEntity(te);
>  byte[] rowKey;
>  if (isApplication)
> {
> ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, 
> userId, flowName, flowRunId, appId);
> rowKey = applicationRowKey.getRowKey();
> 

[jira] [Created] (YARN-10303) One yarn rest api example of yarn document is error

2020-06-01 Thread bright.zhou (Jira)
bright.zhou created YARN-10303:
--

 Summary: One yarn rest api example of yarn document is error
 Key: YARN-10303
 URL: https://issues.apache.org/jira/browse/YARN-10303
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 3.2.1, 3.1.1
Reporter: bright.zhou
 Attachments: image-2020-06-02-10-27-35-020.png

deSelects value should be resourceRequests

!image-2020-06-02-10-27-35-020.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10302) Support custom packing algorithm for FairScheduler

2020-06-01 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121484#comment-17121484
 ] 

Zhankun Tang commented on YARN-10302:
-

[~billgraham], thanks for the contribution. Could you please generate a patch 
"git diff trunk...HEAD > YARN-10302-trunk.001.patch", upload it and click 
"submitPatch" to trigger the CI?

> Support custom packing algorithm for FairScheduler
> --
>
> Key: YARN-10302
> URL: https://issues.apache.org/jira/browse/YARN-10302
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: William W. Graham Jr
>Priority: Major
>
> The {{FairScheduler}} class allocates containers to nodes based on the node 
> with the most available memory[0]. Create the ability to instead configure a 
> custom packing algorithm with different logic. For instance for effective 
> auto scaling, a bin packing algorithm might be a better choice.
> 0 - 
> https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-01 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121495#comment-17121495
 ] 

Hadoop QA commented on YARN-10300:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
24s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 29s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
48s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
46s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 13s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 31s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}158m 48s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler |
|   | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://builds.apache.org/job/PreCommit-YARN-Build/26095/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-10300 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13004526/YARN-10300.001.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux aa1ebd84c7c1 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 
10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 9fe4c37c25b |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| unit | 

[jira] [Commented] (YARN-9964) Queue metrics turn negative when relabeling a node with running containers to default partition

2020-06-01 Thread Manikandan R (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123338#comment-17123338
 ] 

Manikandan R commented on YARN-9964:


[~jhung] 
YARN-6492 patch covered this fixes too. Can we close this?

> Queue metrics turn negative when relabeling a node with running containers to 
> default partition 
> 
>
> Key: YARN-9964
> URL: https://issues.apache.org/jira/browse/YARN-9964
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Priority: Major
>
> YARN-6467 changed queue metrics logic to only update certain metrics if it's 
> for default partition. But if an app runs containers in a labeled node, then 
> this node is moved to default partition, then the container is released, this 
> container's resource won't increment queue's allocated resource, but will 
> decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10301) "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs

2020-06-01 Thread YCozy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YCozy updated YARN-10301:
-
Description: 
We observed the "Mismatched response." error in RM's log when a NM gets 
network-partitioned after RM failover. Here's how it happens:

 

Initially, we have a sleeper YARN service running in a cluster with two RMs (an 
active RM1 and a standby RM2) and one NM. At some point, we perform a RM 
failover from RM1 to RM2.

RM1's log:
{noformat}
2020-06-01 16:29:20,387 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
standby state{noformat}
RM2's log:
{noformat}
2020-06-01 16:29:27,818 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
active state{noformat}
 

After the RM failover, the NM encounters a network partition and fails to 
register with RM2. In other words, there's no "NodeManager from node *** 
registered" in RM2's log.

 

This does not affect the sleeper YARN service. The sleeper service successfully 
recovers after the RM failover. We can see in RM2's log: 
{noformat}
2020-06-01 16:30:06,703 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_6_0001_01 State change from LAUNCHED to RUNNING on event = 
REGISTERED{noformat}
 

Then, we stop the sleeper service. In RM2's log, we can see that:
{noformat}
2020-06-01 16:30:12,157 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
application_6_0001 unregistered successfully.
...
2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: 
Successfully stopped service sleeper1{noformat}
And in AM's log, we can see that: 
{noformat}
2020-06-01 16:30:12,651 [shutdown-hook-0] INFO  service.ServiceMaster - 
SHUTDOWN_MSG:{noformat}
 

Some time later, we observe the "Mismatched response" in RM2's log:
{noformat}
2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception 
encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response.
  at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376)
  at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623)
  at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414)       
 
  at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827)             
 
  at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823)             
 
  at java.security.AccessController.doPrivileged(Native Method)                 
 
  at javax.security.auth.Subject.doAs(Subject.java:422)                         
 
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823)    
 
  at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414)       
 
  at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667)               
 
  at org.apache.hadoop.ipc.Client.call(Client.java:1483)                        
 
  at org.apache.hadoop.ipc.Client.call(Client.java:1436)                        
 
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
  at com.sun.proxy.$Proxy102.stopContainers(Unknown Source)                     
 
  at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                
 
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)                           
 
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
  at com.sun.proxy.$Proxy103.stopContainers(Unknown Source)                     
 
  at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153)
  at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:354)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 

[jira] [Created] (YARN-10301) "DIGEST-MD5: digest response format violation. Mismatched response." when network partition occurs

2020-06-01 Thread YCozy (Jira)
YCozy created YARN-10301:


 Summary: "DIGEST-MD5: digest response format violation. Mismatched 
response." when network partition occurs
 Key: YARN-10301
 URL: https://issues.apache.org/jira/browse/YARN-10301
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: YCozy


We observed the "Mismatched response." error in RM's log when a NM gets 
network-partitioned after RM failover. Here's how it happens:

 

Initially, we have a sleeper YARN service running in a cluster with two RMs (an 
active RM1 and a standby RM2) and one NM. At some point, we perform a RM 
failover from RM1 to RM2.

RM1's log:

 
{noformat}
2020-06-01 16:29:20,387 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
standby state{noformat}
RM2's log:

 

 
{noformat}
2020-06-01 16:29:27,818 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
active state{noformat}
 

After the RM failover, the NM encounters a network partition and fails to 
register with RM2. In other words, there's no "NodeManager from node *** 
registered" in RM2's log.

 

This does not affect the sleeper YARN service. The sleeper service successfully 
recovers after the RM failover. We can see in RM2's log:

 
{noformat}
2020-06-01 16:30:06,703 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_6_0001_01 State change from LAUNCHED to RUNNING on event = 
REGISTERED{noformat}
 

Then, we stop the sleeper service. In RM2's log, we can see that:

 
{noformat}
2020-06-01 16:30:12,157 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
application_6_0001 unregistered successfully.
...
2020-06-01 16:31:09,861 INFO org.apache.hadoop.yarn.service.webapp.ApiServer: 
Successfully stopped service sleeper1{noformat}
And in AM's log, we can see that:

 

 
{noformat}
2020-06-01 16:30:12,651 [shutdown-hook-0] INFO  service.ServiceMaster - 
SHUTDOWN_MSG:{noformat}
 

Some time later, we observe the "Mismatched response" in RM2's log:
{noformat}
2020-06-01 16:43:20,699 WARN org.apache.hadoop.ipc.Client: Exception 
encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): 
DIGEST-MD5: digest response format violation. Mismatched response.
  at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:376)
  at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:623)
  at org.apache.hadoop.ipc.Client$Connection.access$2400(Client.java:414)       
 
  at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:827)             
 
  at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:823)             
 
  at java.security.AccessController.doPrivileged(Native Method)                 
 
  at javax.security.auth.Subject.doAs(Subject.java:422)                         
 
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:823)    
 
  at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:414)       
 
  at org.apache.hadoop.ipc.Client.getConnection(Client.java:1667)               
 
  at org.apache.hadoop.ipc.Client.call(Client.java:1483)                        
 
  at org.apache.hadoop.ipc.Client.call(Client.java:1436)                        
 
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
  at com.sun.proxy.$Proxy102.stopContainers(Unknown Source)                     
 
  at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:147)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                
 
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)                           
 
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
  at com.sun.proxy.$Proxy103.stopContainers(Unknown Source)                     
 
  at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:153)
  at 

[jira] [Updated] (YARN-6492) Generate queue metrics for each partition

2020-06-01 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated YARN-6492:

Fix Version/s: 2.10.1
   2.9.3

> Generate queue metrics for each partition
> -
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
> Fix For: 2.9.3, 3.2.2, 2.10.1, 3.4.0, 3.3.1, 3.1.5
>
> Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492-branch-2.10.016.patch, YARN-6492-branch-2.10.019.patch, 
> YARN-6492-branch-2.8.014.patch, YARN-6492-branch-2.9.015.patch, 
> YARN-6492-branch-3.1.018.patch, YARN-6492-branch-3.2.017.patch, 
> YARN-6492-junits.patch, YARN-6492.001.patch, YARN-6492.002.patch, 
> YARN-6492.003.patch, YARN-6492.004.patch, YARN-6492.005.WIP.patch, 
> YARN-6492.006.WIP.patch, YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, 
> YARN-6492.009.WIP.patch, YARN-6492.010.WIP.patch, YARN-6492.011.WIP.patch, 
> YARN-6492.012.WIP.patch, YARN-6492.013.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-01 Thread Eric Badger (Jira)
Eric Badger created YARN-10300:
--

 Summary: appMasterHost not set in RM ApplicationSummary when AM 
fails before first heartbeat
 Key: YARN-10300
 URL: https://issues.apache.org/jira/browse/YARN-10300
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Eric Badger
Assignee: Eric Badger


{noformat}
2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
 
://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=,applicationType=MAPREDUCE
{noformat}

{{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10251) Show extended resources on legacy RM UI.

2020-06-01 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10251:
--
Attachment: YARN-10251.branch-2.10.003.patch

> Show extended resources on legacy RM UI.
> 
>
> Key: YARN-10251
> URL: https://issues.apache.org/jira/browse/YARN-10251
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Legacy RM UI With Not All Resources Shown.png, Updated 
> NodesPage UI With GPU columns.png, Updated RM UI With All Resources 
> Shown.png.png, YARN-10251.branch-2.10.001.patch, 
> YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch
>
>
> It would be great to update the legacy RM UI to include GPU resources in the 
> overview and in the per-app sections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10290) Resourcemanager recover failed when fair scheduler queue acl changed

2020-06-01 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10290.
--
Resolution: Duplicate

This issue is fixed in YARN-7913.
That change fixes a number of issues around restores that fail.

The change was not backported to Hadoop 2.x

> Resourcemanager recover failed when fair scheduler queue acl changed
> 
>
> Key: YARN-10290
> URL: https://issues.apache.org/jira/browse/YARN-10290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: yehuanhuan
>Priority: Blocker
>
> Resourcemanager recover failed when fair scheduler queue acl changed. Because 
> of queue acl changed, when recover the application (addApplication() in 
> fairscheduler) is rejected. Then recover the applicationAttempt 
> (addApplicationAttempt() in fairscheduler) get Application is null. This will 
> lead to two RM is at standby. Repeat as follows.
>  
> # user run a long running application.
> # change queue acl (aclSubmitApps) so that the user does not have permission.
> # restart the RM.
> {code:java}
> 2020-05-25 16:04:06,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1590393162216_0005 with final state: FAILED
> 2020-05-25 16:04:06,192 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
> load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:663)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1246)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1072)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1036)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:897)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:850)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:723)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:322)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:427)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1173)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:584)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:980)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1021)
> at 
> 

[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-06-01 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120998#comment-17120998
 ] 

Prabhu Joseph commented on YARN-10259:
--

Have cherry-picked to branch-3.3.1. Thanks.

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10259-001.patch, YARN-10259-002.patch, 
> YARN-10259-003.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> Attached testcase which reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

2020-06-01 Thread Prabhu Joseph (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10259:
-
Fix Version/s: 3.3.1

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---
>
> Key: YARN-10259
> URL: https://issues.apache.org/jira/browse/YARN-10259
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
> Attachments: YARN-10259-001.patch, YARN-10259-002.patch, 
> YARN-10259-003.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_01 reserved container 
> container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 
> available= used=. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> 
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_01, on node=host: h1:1234 
> #containers=1 available= used= 
> with resource=
>RESERVED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h1:1234; Resource=)]
>
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_01, but node is 
> not reserved
>ALLOCATED=[(Application=appattempt_1588684773609_0003_01; 
> Node=h2:1234; Resource=)]
> {code}
> Attached testcase which reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10298) TimeLine entity information only stored in one region when use apache HBase as backend storage

2020-06-01 Thread aimahou (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

aimahou updated YARN-10298:
---
Description: 
h2. Issue

TimeLine entity information only stored in one region when use apache HBase as 
backend storage
h2. Probable cause

We found in the source code that the rowKey is composed of 
clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores 
timeline entity info,which probably cause the rowKey is sorted by dictionary 
order. Thus timeline entity may only store in one region or few adjacent 
regions.
h2. Related code snippet

HBaseTimelineWriterImpl.java
{quote} 
{code:java}
public TimelineWriteResponse write(TimelineCollectorContext context,
 TimelineEntities data, UserGroupInformation callerUgi)
 throws IOException {
 ...
 boolean isApplication = ApplicationEntity.isApplicationEntity(te);
 byte[] rowKey;
 if (isApplication){ 
 ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, 
flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); 
 }else { 
 EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, 
flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); 
 rowKey = entityRowKey.getRowKey(); 
 store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); 
 }
 if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { 
 SubApplicationRowKey subApplicationRowKey = new 
SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), 
te.getIdPrefix(), te.getId(), userId);
 rowKey = subApplicationRowKey.getRowKey(); 
 store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); }
...
}
{code}
 
{quote}
h2. Suggestion

We can use the hash code of original rowKey as the rowKey to store and read 
timeline entity data.

  was:
h2. Issue

TimeLine entity information only stored in one region when use apache HBase as 
backend storage
h2. Probable cause

We found in the source code that the rowKey is composed of 
clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores 
timeline entity info,which probably cause the rowKey is sorted by dictionary 
order. Thus timeline entity may only store in one region or few adjacent 
regions.
h2. Related code snippet

HBaseTimelineWriterImpl.java
{quote} 
{code:java}
 else
public TimelineWriteResponse write(TimelineCollectorContext context,
 TimelineEntities data, UserGroupInformation callerUgi)
 throws IOException {
 ...
 boolean isApplication = ApplicationEntity.isApplicationEntity(te);
 byte[] rowKey;
 if (isApplication){ 
 ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, 
flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); 
 }else { 
 EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, 
flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); 
 rowKey = entityRowKey.getRowKey(); 
 store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); 
 }
 if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { 
 SubApplicationRowKey subApplicationRowKey = new 
SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), 
te.getIdPrefix(), te.getId(), userId);
 rowKey = subApplicationRowKey.getRowKey(); 
 store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); }
...
}
{code}
 
{quote}
h2. Suggestion

We can use the hash code of original rowKey as the rowKey to store and read 
timeline entity data.


> TimeLine entity information only stored in one region when use apache HBase 
> as backend storage
> --
>
> Key: YARN-10298
> URL: https://issues.apache.org/jira/browse/YARN-10298
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, timelineservice
>Affects Versions: 3.1.1
>Reporter: aimahou
>Priority: Major
>
> h2. Issue
> TimeLine entity information only stored in one region when use apache HBase 
> as backend storage
> h2. Probable cause
> We found in the source code that the rowKey is composed of 
> clusterId、userId、flowName、flowRunId and appId when hbase timeline writer 
> stores timeline entity info,which probably cause the rowKey is sorted by 
> dictionary order. Thus timeline entity may only store in one region or few 
> adjacent regions.
> h2. Related code snippet
> HBaseTimelineWriterImpl.java
> {quote} 
> {code:java}
> public TimelineWriteResponse write(TimelineCollectorContext context,
>  TimelineEntities data, UserGroupInformation callerUgi)
>  throws IOException {
>  ...
>  boolean isApplication = ApplicationEntity.isApplicationEntity(te);
>  byte[] rowKey;
>  if (isApplication){ 
>  ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, 

[jira] [Updated] (YARN-10298) TimeLine entity information only stored in one region when use apache HBase as backend storage

2020-06-01 Thread aimahou (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

aimahou updated YARN-10298:
---
Description: 
h2. Issue

TimeLine entity information only stored in one region when use apache HBase as 
backend storage
h2. Probable cause

We found in the source code that the rowKey is composed of 
clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores 
timeline entity info,which probably cause the rowKey is sorted by dictionary 
order. Thus timeline entity may only store in one region or few adjacent 
regions.
h2. Related code snippet

HBaseTimelineWriterImpl.java
{quote} 
{code:java}
 else
public TimelineWriteResponse write(TimelineCollectorContext context,
 TimelineEntities data, UserGroupInformation callerUgi)
 throws IOException {
 ...
 boolean isApplication = ApplicationEntity.isApplicationEntity(te);
 byte[] rowKey;
 if (isApplication){ 
 ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, 
flowName, flowRunId, appId); rowKey = applicationRowKey.getRowKey();
 store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE); 
 }else { 
 EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, 
flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId()); 
 rowKey = entityRowKey.getRowKey(); 
 store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); 
 }
 if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te)) { 
 SubApplicationRowKey subApplicationRowKey = new 
SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), 
te.getIdPrefix(), te.getId(), userId);
 rowKey = subApplicationRowKey.getRowKey(); 
 store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE); }
...
}
{code}
 
{quote}
h2. Suggestion

We can use the hash code of original rowKey as the rowKey to store and read 
timeline entity data.

  was:
h2. Issue

TimeLine entity information only stored in one region when use apache HBase as 
backend storage
h2. Probable cause

We found in the source code that the rowKey is composed of 
clusterId、userId、flowName、flowRunId and appId when hbase timeline writer stores 
timeline entity info,which probably cause the rowKey is sorted by dictionary 
order. Thus timeline entity may only store in one region or few adjacent 
regions.
h2. Related code snippet

HBaseTimelineWriterImpl.java
{quote}public TimelineWriteResponse write(TimelineCollectorContext context,
 TimelineEntities data, UserGroupInformation callerUgi)
 throws IOException {

...

boolean isApplication = ApplicationEntity.isApplicationEntity(te);
 byte[] rowKey;
 if (isApplication)

{

ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, userId, 
flowName, flowRunId, appId);

rowKey = applicationRowKey.getRowKey();

store(rowKey, te, flowVersion, Tables.APPLICATION_TABLE);

}

else

{

EntityRowKey entityRowKey = new EntityRowKey(clusterId, userId, flowName, 
flowRunId, appId, te.getType(), te.getIdPrefix(), te.getId());

rowKey = entityRowKey.getRowKey();

store(rowKey, te, flowVersion, Tables.ENTITY_TABLE); }

if (!isApplication && SubApplicationEntity.isSubApplicationEntity(te))

{

SubApplicationRowKey subApplicationRowKey = new 
SubApplicationRowKey(subApplicationUser, clusterId, te.getType(), 
te.getIdPrefix(), te.getId(), userId);

rowKey = subApplicationRowKey.getRowKey();

store(rowKey, te, flowVersion, Tables.SUBAPPLICATION_TABLE);

}

...

}
{quote}
h2. Suggestion

We can use the hash code of original rowKey as the rowKey to store and read 
timeline entity data.


> TimeLine entity information only stored in one region when use apache HBase 
> as backend storage
> --
>
> Key: YARN-10298
> URL: https://issues.apache.org/jira/browse/YARN-10298
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, timelineservice
>Affects Versions: 3.1.1
>Reporter: aimahou
>Priority: Major
>
> h2. Issue
> TimeLine entity information only stored in one region when use apache HBase 
> as backend storage
> h2. Probable cause
> We found in the source code that the rowKey is composed of 
> clusterId、userId、flowName、flowRunId and appId when hbase timeline writer 
> stores timeline entity info,which probably cause the rowKey is sorted by 
> dictionary order. Thus timeline entity may only store in one region or few 
> adjacent regions.
> h2. Related code snippet
> HBaseTimelineWriterImpl.java
> {quote} 
> {code:java}
>  else
> public TimelineWriteResponse write(TimelineCollectorContext context,
>  TimelineEntities data, UserGroupInformation callerUgi)
>  throws IOException {
>  ...
>  boolean isApplication = ApplicationEntity.isApplicationEntity(te);
>  byte[] rowKey;
>  if (isApplication){ 
>  ApplicationRowKey applicationRowKey = new ApplicationRowKey(clusterId, 
> userId, 

[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-01 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120982#comment-17120982
 ] 

Prabhu Joseph commented on YARN-10293:
--

Thanks [~wangda] for reviewing.  

The older behavior of Allocate Container on Single Node skips scheduling on a 
node when it has reserved 
container or no available containers.
  
{code}
   if (calculator.computeAvailableContainers(Resources
.add(node.getUnallocatedResource(), 
node.getTotalKillableResources()),
minimumAllocation) <= 0) {
{code}

Multi Node Placement checks the used partition capacity which includes the 
reserved capacity. But there can be still nodes with available containers which 
is ignored. (as per JIRA description)
  
{code}
  if (getRootQueue().getQueueCapacities().getUsedCapacity(
  candidates.getPartition()) >= 1.0f
  && preemptionManager.getKillableResource(
{code}
  
This condition can be removed, don't see any impact. [~Tao Yang] Can you 
confirm the same.

Other approaches are the one in patch. Or adding extra check of if available 
containers in any node part of candidates in addition to above checks.
  
  
  
  



> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> 

[jira] [Resolved] (YARN-10289) spark on yarn execption

2020-06-01 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved YARN-10289.
---
Resolution: Invalid

> spark on yarn execption 
> 
>
> Key: YARN-10289
> URL: https://issues.apache.org/jira/browse/YARN-10289
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.3
> Environment: hadoop 3.0.0
>Reporter: huang xin
>Priority: Major
>
> i execute spark on yarn and get the issue like this:
> stderr96Error: Could not find or load main class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> prelaunch.out70Setting up env variables2_03? 
> Setting up job resources
> Launching container
> stderr96Error: Could not find or load main class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> stdout0(_1590115508504_0033_02_01Ωcontainer-localizer-syslog1842020-05-24
>  15:39:20,867 INFO [main] 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer:
>  Disk Validator: yarn.nodemanager.disk-validator is loaded.
> prelaunch.out70Setting up env variables
> Setting up job resources
> Launching container
> stderr333ERROR StatusLogger No log4j2 configuration file found. Using default 
> configuration: logging only errors to the console. Set system property 
> 'org.apache.logging.log4j.simplelog.StatusLogger.level' to TRACE to show 
> Log4j2 internal initialization logging.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> prelaunch.out70Setting up env variables1_05? 
> Setting up job resources
> Launching container
> stderr96Error: Could not find or load main class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> prelaunch.out70Setting up env variables1_04? 
> Setting up job resources
> Launching container
> stderr96Error: Could not find or load main class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> stdout0 
>  VERSION*(_1590115508504_0033_01_0none??data:BCFile.indexnone?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10289) spark on yarn execption

2020-06-01 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121009#comment-17121009
 ] 

Steve Loughran commented on YARN-10289:
---

# looks more like a spark error.
# And a config one. So not a bug in their code. Check your classpath

take it up on the spark mailing lists.

> spark on yarn execption 
> 
>
> Key: YARN-10289
> URL: https://issues.apache.org/jira/browse/YARN-10289
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.3
> Environment: hadoop 3.0.0
>Reporter: huang xin
>Priority: Major
>
> i execute spark on yarn and get the issue like this:
> stderr96Error: Could not find or load main class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> prelaunch.out70Setting up env variables2_03? 
> Setting up job resources
> Launching container
> stderr96Error: Could not find or load main class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> stdout0(_1590115508504_0033_02_01Ωcontainer-localizer-syslog1842020-05-24
>  15:39:20,867 INFO [main] 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer:
>  Disk Validator: yarn.nodemanager.disk-validator is loaded.
> prelaunch.out70Setting up env variables
> Setting up job resources
> Launching container
> stderr333ERROR StatusLogger No log4j2 configuration file found. Using default 
> configuration: logging only errors to the console. Set system property 
> 'org.apache.logging.log4j.simplelog.StatusLogger.level' to TRACE to show 
> Log4j2 internal initialization logging.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> prelaunch.out70Setting up env variables1_05? 
> Setting up job resources
> Launching container
> stderr96Error: Could not find or load main class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> prelaunch.out70Setting up env variables1_04? 
> Setting up job resources
> Launching container
> stderr96Error: Could not find or load main class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> stdout0 
>  VERSION*(_1590115508504_0033_01_0none??data:BCFile.indexnone?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.

2020-06-01 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121458#comment-17121458
 ] 

Hadoop QA commented on YARN-10251:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 21m 
38s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
10s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 20s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
39s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
45s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
25s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
35s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 54s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: The patch generated 4 new + 
34 unchanged - 0 fixed = 38 total (was 34) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 40s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
44s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
37s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 31s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}181m 34s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
|  |  Possible null pointer dereference of rmApp 

[jira] [Commented] (YARN-10251) Show extended resources on legacy RM UI.

2020-06-01 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121359#comment-17121359
 ] 

Hadoop QA commented on YARN-10251:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-2.10 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  2m 
12s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 12m 
23s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
44s{color} | {color:green} branch-2.10 passed with JDK Oracle 
Corporation-1.7.0_95-b00 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
15s{color} | {color:green} branch-2.10 passed with JDK Private 
Build-1.8.0_252-8u252-b09-1~16.04-b09 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
18s{color} | {color:green} branch-2.10 passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
17s{color} | {color:red} hadoop-yarn-server-common in branch-2.10 failed with 
JDK Oracle Corporation-1.7.0_95-b00. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
18s{color} | {color:red} hadoop-yarn-server-resourcemanager in branch-2.10 
failed with JDK Oracle Corporation-1.7.0_95-b00. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} branch-2.10 passed with JDK Private 
Build-1.8.0_252-8u252-b09-1~16.04-b09 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
29s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
33s{color} | {color:green} branch-2.10 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
17s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
40s{color} | {color:green} the patch passed with JDK Oracle 
Corporation-1.7.0_95-b00 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_252-8u252-b09-1~16.04-b09 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 38s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: The patch generated 4 new + 
43 unchanged - 1 fixed = 47 total (was 44) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
29s{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkOracleCorporation-1.7.0_95-b00
 with JDK Oracle Corporation-1.7.0_95-b00 generated 4 new + 0 unchanged - 0 
fixed = 4 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} hadoop-yarn-server-common in the patch passed with 
JDK Private 

[jira] [Created] (YARN-10302) Support custom packing algorithm for FairScheduler

2020-06-01 Thread William W. Graham Jr (Jira)
William W. Graham Jr created YARN-10302:
---

 Summary: Support custom packing algorithm for FairScheduler
 Key: YARN-10302
 URL: https://issues.apache.org/jira/browse/YARN-10302
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: William W. Graham Jr


The {{FairScheduler}} class allocates containers to nodes based on the node 
with the most available memory[0]. Create the ability to instead configure a 
custom packing algorithm with different logic. For instance for effective auto 
scaling, a bin packing algorithm might be a better choice.

0 - 
https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10302) Support custom packing algorithm for FairScheduler

2020-06-01 Thread William W. Graham Jr (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William W. Graham Jr updated YARN-10302:


https://github.com/apache/hadoop/pull/2044

> Support custom packing algorithm for FairScheduler
> --
>
> Key: YARN-10302
> URL: https://issues.apache.org/jira/browse/YARN-10302
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: William W. Graham Jr
>Priority: Major
>
> The {{FairScheduler}} class allocates containers to nodes based on the node 
> with the most available memory[0]. Create the ability to instead configure a 
> custom packing algorithm with different logic. For instance for effective 
> auto scaling, a bin packing algorithm might be a better choice.
> 0 - 
> https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10251) Show extended resources on legacy RM UI.

2020-06-01 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10251:
--
Attachment: YARN-10251.003.patch

> Show extended resources on legacy RM UI.
> 
>
> Key: YARN-10251
> URL: https://issues.apache.org/jira/browse/YARN-10251
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: Legacy RM UI With Not All Resources Shown.png, Updated 
> NodesPage UI With GPU columns.png, Updated RM UI With All Resources 
> Shown.png.png, YARN-10251.003.patch, YARN-10251.branch-2.10.001.patch, 
> YARN-10251.branch-2.10.002.patch, YARN-10251.branch-2.10.003.patch
>
>
> It would be great to update the legacy RM UI to include GPU resources in the 
> overview and in the per-app sections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

2020-06-01 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121419#comment-17121419
 ] 

Wangda Tan commented on YARN-10293:
---

[~prabhujoseph], I agree with you, I think the entire {{if}} check is helpful 
when cluster is full, we won't go into the allocation phase and save some CPU 
cycles.  

However, it won't matter too much if the cluster is full – we cannot get 
container allocation in any case. I suggest simplifying this logic by removing 
the if check, it sounds dangerous to me. If we see it cause performance issue, 
we can solve it in a different way (like increase wait time if nothing can be 
allocated or reserved).

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement (YARN-10259)
> 
>
> Key: YARN-10293
> URL: https://issues.apache.org/jira/browse/YARN-10293
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10293-001.patch, YARN-10293-002.patch
>
>
> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues 
> related to it 
> https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987
> Have found one more bug in the CapacityScheduler.java code which causes the 
> same issue with slight difference in the repro.
> *Repro:*
> *Nodes :   Available : Used*
> Node1 -  8GB, 8vcores -  8GB. 8cores
> Node2 -  8GB, 8vcores - 8GB. 8cores
> Node3 -  8GB, 8vcores - 8GB. 8cores
> Queues -> A and B both 50% capacity, 100% max capacity
> MultiNode enabled + Preemption enabled
> 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores
> 2. JobB Submitted to B queue with AM size of 1GB
> {code}
> 2020-05-21 12:12:27,313 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> IP=172.27.160.139   OPERATION=Submit Application Request
> TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005  
>   CALLERCONTEXT=CLI   QUEUENAME=dummy
> {code}
> 3. Preemption happens and used capacity is lesser than 1.0f
> {code}
> 2020-05-21 12:12:48,222 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics:
>  Non-AM container preempted, current 
> appAttemptId=appattempt_1590046667304_0004_01, 
> containerId=container_e09_1590046667304_0004_01_24, 
> resource=
> {code}
> 4. JobB gets a Reserved Container as part of 
> CapacityScheduler#allocateOrReserveNewContainer
> {code}
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to 
> RESERVED
> 2020-05-21 12:12:48,226 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved container=container_e09_1590046667304_0005_01_01, on node=host: 
> tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 
> available= used= with 
> resource=
> {code}
> *Why RegularContainerAllocator reserved the container when the used capacity 
> is <= 1.0f ?*
> {code}
> The reason is even though the container is preempted - nodemanager has to 
> stop the container and heartbeat and update the available and unallocated 
> resources to ResourceManager.
> {code}
> 5. Now, no new allocation happens and reserved container stays at reserved.
> After reservation the used capacity becomes 1.0f, below will be in a loop and 
> no new allocate or reserve happens. The reserved container cannot be 
> allocated as reserved node does not have space. node2 has space for 1GB, 
> 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting 
> called causing the Hang.
> *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> 
> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container 
> on node*
> {code}
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application application_1590046667304_0005 
> on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> assignContainers: partition= #applications=1
> 2020-05-21 12:13:33,242 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
>  Reserved 

[jira] [Updated] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat

2020-06-01 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10300:
---
Attachment: YARN-10300.001.patch

> appMasterHost not set in RM ApplicationSummary when AM fails before first 
> heartbeat
> ---
>
> Key: YARN-10300
> URL: https://issues.apache.org/jira/browse/YARN-10300
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-10300.001.patch
>
>
> {noformat}
> 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: 
> appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https
>  
> ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=  vCores:0>,applicationType=MAPREDUCE
> {noformat}
> {{appMasterHost=N/A}} should have the AM hostname instead of N/A



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9964) Queue metrics turn negative when relabeling a node with running containers to default partition

2020-06-01 Thread Manikandan R (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R reassigned YARN-9964:
--

Assignee: Manikandan R

> Queue metrics turn negative when relabeling a node with running containers to 
> default partition 
> 
>
> Key: YARN-9964
> URL: https://issues.apache.org/jira/browse/YARN-9964
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Manikandan R
>Priority: Major
>
> YARN-6467 changed queue metrics logic to only update certain metrics if it's 
> for default partition. But if an app runs containers in a labeled node, then 
> this node is moved to default partition, then the container is released, this 
> container's resource won't increment queue's allocated resource, but will 
> decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9767) PartitionQueueMetrics Issues

2020-06-01 Thread Manikandan R (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R resolved YARN-9767.

Resolution: Fixed

> PartitionQueueMetrics Issues
> 
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9767.001.patch
>
>
> The intent of the Jira is to capture the issues/observations encountered as 
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this 
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a 
> problem. For example, 
>  
> Node N has been mapped to Label X (Non exclusive). Queue A has been 
> configured with ANY Node label. App A requested resources from Queue A and 
> its containers ran on Node N for some reasons. During 
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) 
> would get used for calculation. Lets say allocate call has been fired for 3 
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>  
> is the outcome. Because app request has been fired without any label 
> specification and #a metrics has been derived. After allocation is over, 
> pending resources usually gets decreased. When this happens, it use node 
> partition info. hence #b metrics has derived. 
>  
> Given this kind of situation, We will need to put some thoughts on achieving 
> the metrics correctly.
>  
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would 
> like to retain the existing Queue Metrics for backward compatibility (as you 
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
> overridden either with some specific partition values or default partition 
> values. It could be vice - versa as well. For example, after the queues (say 
> queue A) has been initialised with some min and max cap and also with node 
> label's min and max cap, Queuemetrics (availableMB) for queue A return values 
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached 
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
> availableVcores is correct (Please refer above #2 observation). Added more 
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for 
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores 
> at root queue was not there even before. Retained the same behaviour. User 
> metrics for availableMB, availableVcores is available only at child queue's 
> level and also with partitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9767) PartitionQueueMetrics Issues

2020-06-01 Thread Manikandan R (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123337#comment-17123337
 ] 

Manikandan R commented on YARN-9767:


YARN-6492 patch covered this fixes too. Hence closing this.

> PartitionQueueMetrics Issues
> 
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9767.001.patch
>
>
> The intent of the Jira is to capture the issues/observations encountered as 
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this 
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a 
> problem. For example, 
>  
> Node N has been mapped to Label X (Non exclusive). Queue A has been 
> configured with ANY Node label. App A requested resources from Queue A and 
> its containers ran on Node N for some reasons. During 
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) 
> would get used for calculation. Lets say allocate call has been fired for 3 
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>  
> is the outcome. Because app request has been fired without any label 
> specification and #a metrics has been derived. After allocation is over, 
> pending resources usually gets decreased. When this happens, it use node 
> partition info. hence #b metrics has derived. 
>  
> Given this kind of situation, We will need to put some thoughts on achieving 
> the metrics correctly.
>  
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would 
> like to retain the existing Queue Metrics for backward compatibility (as you 
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
> overridden either with some specific partition values or default partition 
> values. It could be vice - versa as well. For example, after the queues (say 
> queue A) has been initialised with some min and max cap and also with node 
> label's min and max cap, Queuemetrics (availableMB) for queue A return values 
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached 
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
> availableVcores is correct (Please refer above #2 observation). Added more 
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for 
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores 
> at root queue was not there even before. Retained the same behaviour. User 
> metrics for availableMB, availableVcores is available only at child queue's 
> level and also with partitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-01 Thread Adam Antal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Antal updated YARN-10284:
--
Attachment: YARN-10284.004.patch

> Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
> 
>
> Key: YARN-10284
> URL: https://issues.apache.org/jira/browse/YARN-10284
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-10284.001.patch, YARN-10284.002.patch, 
> YARN-10284.003.patch, YARN-10284.004.patch
>
>
> Suppose the {{mapred}} user has no access to the remote folder. Pinging the 
> JHS if it's online in every few seconds will produce the following entry in 
> the log:
> {noformat}
> 2020-05-19 00:17:20,331 WARN 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController:
>  Unable to determine if the filesystem supports append operation
> java.nio.file.AccessDeniedException: test-bucket: 
> org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role 
> for the group(s) associated with the authenticated user. (user: mapred)
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204)
> [...]
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139)
>   at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance()
>   at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> [...]
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> We should only create the {{LogAggregationFactory}} instance when we actually 
> need it, not every time the {{LogServlet}} object is instantiated (so 
> definitely not in the constructor). In this way we prevent pressure on the 
> S3A auth side, especially if the authentication request is a costly operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10284) Add lazy initialization of LogAggregationFileControllerFactory in LogServlet

2020-06-01 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123361#comment-17123361
 ] 

Adam Antal commented on YARN-10284:
---

Fixed last checkstyle in v4.

> Add lazy initialization of LogAggregationFileControllerFactory in LogServlet
> 
>
> Key: YARN-10284
> URL: https://issues.apache.org/jira/browse/YARN-10284
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-10284.001.patch, YARN-10284.002.patch, 
> YARN-10284.003.patch, YARN-10284.004.patch
>
>
> Suppose the {{mapred}} user has no access to the remote folder. Pinging the 
> JHS if it's online in every few seconds will produce the following entry in 
> the log:
> {noformat}
> 2020-05-19 00:17:20,331 WARN 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController:
>  Unable to determine if the filesystem supports append operation
> java.nio.file.AccessDeniedException: test-bucket: 
> org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: There is no mapped role 
> for the group(s) associated with the authenticated user. (user: mapred)
>   at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:204)
> [...]
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:513)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.getRollOverLogMaxSize(LogAggregationIndexedFileController.java:1157)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initInternal(LogAggregationIndexedFileController.java:149)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileController.initialize(LogAggregationFileController.java:135)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileControllerFactory.(LogAggregationFileControllerFactory.java:139)
>   at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.(LogServlet.java:66)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.(HsWebServices.java:99)
>   at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices$$FastClassByGuice$$1eb8d5d6.newInstance()
>   at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> [...]
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> We should only create the {{LogAggregationFactory}} instance when we actually 
> need it, not every time the {{LogServlet}} object is instantiated (so 
> definitely not in the constructor). In this way we prevent pressure on the 
> S3A auth side, especially if the authentication request is a costly operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org