[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592
 ] 

Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:48 AM:
-

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing, 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea (or we can introduce max 
retry times )?

 


was (Author: yuanbo):
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing, 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 

[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592
 ] 

Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:26 AM:
-

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing, 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 


was (Author: yuanbo):
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> 

[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592
 ] 

Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:25 AM:
-

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 


was (Author: yuanbo):
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing.  
Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Yuanbo Liu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194592#comment-17194592
 ] 

Yuanbo Liu commented on YARN-10393:
---

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing.  
Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[jira] [Commented] (YARN-10159) TimelineConnector does not destroy the jersey client

2020-09-11 Thread Andras Bokor (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194520#comment-17194520
 ] 

Andras Bokor commented on YARN-10159:
-

For git greppers, the commit message went with wrong JIRA id:
{{YARN-10156. Destroy Jersey Client in TimelineConnector.}}

> TimelineConnector does not destroy the jersey client
> 
>
> Key: YARN-10159
> URL: https://issues.apache.org/jira/browse/YARN-10159
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Tanu Ajmera
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10159-001.patch, YARN-10159-002.patch, 
> YARN-10159-branch-2.8.001.patch
>
>
> TimelineConnector does not destroy the jersey client. This method must be 
> called when there are not responses pending otherwise undefined behavior will 
> occur.
> http://javadox.com/com.sun.jersey/jersey-client/1.8/com/sun/jersey/api/client/Client.html#destroy()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-950) Ability to limit or avoid aggregating logs beyond a certain size

2020-09-11 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194507#comment-17194507
 ] 

Eric Payne commented on YARN-950:
-

[~adam.antal], are you still planning on working on this JIRA?

bq. Ran into another case where a user filled a disk with a large 
stdout/stderr, and the NM took forever to recover the disk
bq. +1 on the idea of limiting the size of log aggregation. We should have 
someway to truncate (front & tail) the log contents if it is really large.

It seems that there are a couple of requirements for this feature.
# Prevent logs for any container from becoming larger than a configurable size.
# Truncate everything from the middle of large log files, leaving a 
configurable amount of the head and the tail.

Internally, we have implemented the former by allowing system admins to specify 
a configurable max log size per container.


> Ability to limit or avoid aggregating logs beyond a certain size
> 
>
> Key: YARN-950
> URL: https://issues.apache.org/jira/browse/YARN-950
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager
>Affects Versions: 0.23.9, 2.6.0
>Reporter: Jason Darrell Lowe
>Assignee: Adam Antal
>Priority: Major
>
> It would be nice if ops could configure a cluster such that any container log 
> beyond a configured size would either only have a portion of the log 
> aggregated or not aggregated at all.  This would help speed up the recovery 
> path for cases where a container creates an enormous log and fills a disk, as 
> currently it tries to aggregate the entire, enormous log rather than only 
> aggregating a small portion or simply deleting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-09-11 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194474#comment-17194474
 ] 

Hadoop QA commented on YARN-10421:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
5s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
30s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
46s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 17s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
9s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
46s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
56s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
28s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
33s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
50s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m 
 0s{color} | {color:green} There were no new shellcheck issues. {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
14s{color} | {color:green} The patch generated 0 new + 104 unchanged - 132 
fixed = 104 total (was 236) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 15m 
26s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | 

[jira] [Created] (YARN-10436) Create RM rest api endpoint to get list of cluster configurations

2020-09-11 Thread Akhil PB (Jira)
Akhil PB created YARN-10436:
---

 Summary: Create RM rest api endpoint to get list of cluster 
configurations
 Key: YARN-10436
 URL: https://issues.apache.org/jira/browse/YARN-10436
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: yarn
Reporter: Akhil PB
Assignee: Akhil PB


The current /conf endpoint returns one config at a time and this would require 
multiple api calls to get multiple configurations. This jira intends to fix 
this issue by adding new rest api endpoint /cluster-conf where user can send 
queryparam like name=config_1,confg_2 to get multiple configurations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10436) Create RM rest api endpoint to get list of cluster configurations

2020-09-11 Thread Akhil PB (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akhil PB updated YARN-10436:

Description: 
The current /conf endpoint returns one config at a time and this would require 
multiple api calls to get multiple configurations. This jira intends to fix 
this issue by adding new rest api endpoint /cluster-conf where user can send 
queryparam like name=config_1,confg_2 to get multiple configurations.



cc: [~sunilg]

  was:The current /conf endpoint returns one config at a time and this would 
require multiple api calls to get multiple configurations. This jira intends to 
fix this issue by adding new rest api endpoint /cluster-conf where user can 
send queryparam like name=config_1,confg_2 to get multiple configurations.


> Create RM rest api endpoint to get list of cluster configurations
> -
>
> Key: YARN-10436
> URL: https://issues.apache.org/jira/browse/YARN-10436
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Akhil PB
>Assignee: Akhil PB
>Priority: Major
>
> The current /conf endpoint returns one config at a time and this would 
> require multiple api calls to get multiple configurations. This jira intends 
> to fix this issue by adding new rest api endpoint /cluster-conf where user 
> can send queryparam like name=config_1,confg_2 to get multiple configurations.
> cc: [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194402#comment-17194402
 ] 

Jim Brennan commented on YARN-10393:


[~yuanbo] thanks for uploading your patch and restarting the conversation.

There are two ways we can miss a heartbeat, RM never gets it, or NM doesn't get 
response.  Existing code already handles the first case, because 
pendingCompletedContainers doesn't get cleared due to the exception.   This 
Jira is dealing with the second case, where the heartbeat after a missed 
heartbeat is seen as a duplicate by the RM, so any newly completed containers 
never get processed by the RM (because NM doesn't know it was treated as a 
duplicate by RM, and we clear pendingCompletedContainers).

I believe your draft patch will work and it improves on the original PR by 
limiting the change to only completed container statuses.  My one issue with it 
is that it will not return the most up to date status for containers on the 
heartbeat after the missed one.   In the case described by this Jira, where the 
RM got the request but the NM failed to get the response, this has no impact 
because this one will be treated as a duplicate by the RM.  But if the RM never 
got the last request, then we are deferring the most recent container statuses 
until the next heartbeat.  We are also deferring adding newly completed 
containers to recentlyStoppedContainers.

I put up another patch, draft.2, to show what I was proposing.  The main 
downside to the draft.2 patch is if the one after the missed hearbeat is not 
seen as a duplicate by the RM,  it will resend completed containers from this 
heartbeat on the next one.

Another question is how would these two proposals behave if the NM misses 
multiple heartbeats in a row?

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-11 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10393:
---
Attachment: YARN-10393.draft.2.patch

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at 

[jira] [Commented] (YARN-4971) RM fails to re-bind to wildcard IP after failover in multi homed clusters

2020-09-11 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194378#comment-17194378
 ] 

Wangda Tan commented on YARN-4971:
--

I think we should revisit the patch based on comment from Karthik: 
https://issues.apache.org/jira/browse/YARN-4971?focusedCommentId=15281097=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15281097

I also don't quite understand why the following methods of ClientRMService are 
different: 

One is:
{code:java}
  InetSocketAddress getBindAddress(Configuration conf) {
return conf.getSocketAddr(
YarnConfiguration.RM_BIND_HOST,
YarnConfiguration.RM_ADDRESS,
YarnConfiguration.DEFAULT_RM_ADDRESS,
YarnConfiguration.DEFAULT_RM_PORT);
  } {code}
 

And another one is: 
{code:java}
 clientBindAddress = conf.updateConnectAddr(YarnConfiguration.RM_BIND_HOST,
   YarnConfiguration.RM_ADDRESS,
   
YarnConfiguration.DEFAULT_RM_ADDRESS,
   
server.getListenerAddress());{code}
 

Basically, in serviceInit and serviceStart, how to get RM address is different. 
Is that a potential root cause of the problem? [~wilfreds], [~shuzirra]

> RM fails to re-bind to wildcard IP after failover in multi homed clusters
> -
>
> Key: YARN-4971
> URL: https://issues.apache.org/jira/browse/YARN-4971
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-4971.1.patch
>
>
> If the RM has the {{yarn.resourcemanager.bind-host}} set to 0.0.0.0 the first 
> time the service becomes active binding to the wildcard works as expected. If 
> the service has transitioned from active to standby and then becomes active 
> again after failovers the service only binds to one of the ip addresses.
> There is a difference between the services inside the RM: it only seem to 
> happen for the services listening on ports: 8030 and 8032



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-09-11 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10421:
-
Attachment: YARN-10421.003.patch

> Create YarnDiagnosticsService to serve diagnostic queries 
> --
>
> Key: YARN-10421
> URL: https://issues.apache.org/jira/browse/YARN-10421
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10421.001.patch, YARN-10421.002.patch, 
> YARN-10421.003.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
> forks a separate process, which executes a shell/Python/etc script. Based on 
> the use-cases listed below the script collects information, bundles it and 
> sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL
>  # Application failed: 
>  ** Application logs
>  ** ResourceManager logs during job lifecycle.
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL.
>  ** Job related metrics like container, attempts.
>  # Scheduler related issue:
>  ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
>  ** Multiple Jstacks of ResourceManager
>  ** YARN and Scheduler Configuration
>  ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API 
> _/ws/v1/cluster/nodes response_
>  ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response 
> (YARN-10319)
>  # ResourceManager / NodeManager daemon fails to start:
>  ** ResourceManager and NodeManager out and log file
>  ** YARN and Scheduler Configuration
> Two new endpoints should be added to the RM web service: one for listing the 
> available diagnostic options (_/common-issue/list_), and one for calling a 
> selected option with the user provided parameters (_/common-issue/collect_). 
> The service should be transparent to the script changes to help with the 
> (on-the-fly) extensibility of the diagnostic tool. To split the changes to 
> smaller chunks the implementation behind _collect_ endpoint is to be provided 
> in YARN-10433.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10435) Upgrade ui2 framework ember 2.x to ember 3.20

2020-09-11 Thread Akhil PB (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akhil PB updated YARN-10435:

Summary: Upgrade ui2 framework ember 2.x to ember 3.20  (was: Upgrade ember 
2.x to ember 3.20)

> Upgrade ui2 framework ember 2.x to ember 3.20
> -
>
> Key: YARN-10435
> URL: https://issues.apache.org/jira/browse/YARN-10435
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-ui-v2
>Reporter: Akhil PB
>Assignee: Akhil PB
>Priority: Major
>
> This is the full rewrite of the current YARN UI2 framework ember 2.x to ember 
> 3.20. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-09-11 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194282#comment-17194282
 ] 

Hadoop QA commented on YARN-10421:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
18s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
10s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
29s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
49s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 36s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
45s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
31s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
29s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
19s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
44s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
44s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 57s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server: The patch generated 2 new + 
16 unchanged - 0 fixed = 18 total (was 16) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m 
 0s{color} | {color:green} There were no new shellcheck issues. {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
15s{color} | {color:green} The patch generated 0 new + 104 unchanged - 132 
fixed = 104 total (was 236) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 15m 
41s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed with JDK 

[jira] [Commented] (YARN-10390) LeafQueue: retain user limits cache across assignContainers() calls

2020-09-11 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194279#comment-17194279
 ] 

Eric Payne commented on YARN-10390:
---

Thanks [~samkhan] for this important performance enhancement.
 +1. LGTM. I'll be committing this shortly

> LeafQueue: retain user limits cache across assignContainers() calls
> ---
>
> Key: YARN-10390
> URL: https://issues.apache.org/jira/browse/YARN-10390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Muhammad Samir Khan
>Assignee: Muhammad Samir Khan
>Priority: Major
> Attachments: YARN-10390-branch-3.1.002.patch, 
> YARN-10390-branch-3.2.002.patch, YARN-10390.002.patch, user limit caching 
> profile.pdf
>
>
> Currently, user limits are cached locally in leafQueue.assignContainers call 
> to avoid repeating some steps. This cache can be retained across the calls.
> Will put up a PR soon. Profiling was done using the proposed changes in 
> TestCapacitySchedulerPerf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10421) Create YarnDiagnosticsService to serve diagnostic queries

2020-09-11 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10421:
-
Attachment: YARN-10421.002.patch

> Create YarnDiagnosticsService to serve diagnostic queries 
> --
>
> Key: YARN-10421
> URL: https://issues.apache.org/jira/browse/YARN-10421
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10421.001.patch, YARN-10421.002.patch
>
>
> YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet 
> forks a separate process, which executes a shell/Python/etc script. Based on 
> the use-cases listed below the script collects information, bundles it and 
> sends it to UI2. The diagnostic options are the following:
>  # Application hanging: 
>  ** Application logs
>  ** Find the hanging container and get multiple Jstacks
>  ** ResourceManager logs during job lifecycle
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL
>  # Application failed: 
>  ** Application logs
>  ** ResourceManager logs during job lifecycle.
>  ** NodeManager logs from NodeManager where the hanging containers of the 
> jobs ran
>  ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez 
> History URL.
>  ** Job related metrics like container, attempts.
>  # Scheduler related issue:
>  ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes.
>  ** Multiple Jstacks of ResourceManager
>  ** YARN and Scheduler Configuration
>  ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API 
> _/ws/v1/cluster/nodes response_
>  ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response 
> (YARN-10319)
>  # ResourceManager / NodeManager daemon fails to start:
>  ** ResourceManager and NodeManager out and log file
>  ** YARN and Scheduler Configuration
> Two new endpoints should be added to the RM web service: one for listing the 
> available diagnostic options (_/common-issue/list_), and one for calling a 
> selected option with the user provided parameters (_/common-issue/collect_). 
> The service should be transparent to the script changes to help with the 
> (on-the-fly) extensibility of the diagnostic tool. To split the changes to 
> smaller chunks the implementation behind _collect_ endpoint is to be provided 
> in YARN-10433.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9333) TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes fails intermittent

2020-09-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194046#comment-17194046
 ] 

Peter Bacsko commented on YARN-9333:


[~adam.antal], [~snemeth], [~prabhujoseph] can we agree that raising the update 
interval is the root cause?

I don't have a hard proof because the logs can't be analized if it's set to 
10ms (Jenkins truncates the output) and I can't repro it locally. But since 
this value was raised, the problem seems to be disappeared.

IMO even if we don't have a 100% proof, I think we can commit this and see what 
happens.

> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent
> --
>
> Key: YARN-9333
> URL: https://issues.apache.org/jira/browse/YARN-9333
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9333-001.patch, YARN-9333-002.patch, 
> YARN-9333-003.patch, YARN-9333-debug1.patch
>
>
> TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes
>  fails intermittent - observed in YARN-9311.
> {code}
> [ERROR] 
> testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes[MinSharePreemptionWithDRF](org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption)
>   Time elapsed: 11.056 s  <<< FAILURE!
> java.lang.AssertionError: Incorrect # of containers on the greedy app 
> expected:<6> but was:<4>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyPreemption(TestFairSchedulerPreemption.java:296)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.verifyRelaxLocalityPreemption(TestFairSchedulerPreemption.java:537)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption.testRelaxLocalityPreemptionWithNoLessAMInRemainingNodes(TestFairSchedulerPreemption.java:473)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
>