[jira] [Commented] (YARN-10616) Nodemanagers cannot detect GPU failures

2021-03-15 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302164#comment-17302164
 ] 

Zhankun Tang commented on YARN-10616:
-

[~ebadger], Thanks for picking this up. The YARN-8823 had this consideration 
long ago. My gut feeling is we seem should go to the heartbeat way.
For the "updateNodeResource" issue, one question is that is it a frequently 
used operation? I'm not ware of the scenario that we use this often.

> Nodemanagers cannot detect GPU failures
> ---
>
> Key: YARN-10616
> URL: https://issues.apache.org/jira/browse/YARN-10616
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>
> As stated above, the bug is that GPUs can fail, but the NM doesn't notice the 
> failure. The NM will continue to schedule tasks onto the failed GPU, but the 
> GPU won't actually work and so the container will likely fail or run very 
> slowly on the CPU. 
> My initial thought on solving this is to add NM resource capabilities to the 
> NM-RM heartbeat and have the RM update its view of the NM's resource 
> capabilities on each heartbeat. This would be a fairly trivial change, but 
> comes with the unfortunate side effect that it completely undermindes {{yarn 
> rmadmin -updateNodeResource}}. When you run {{-updateNodeResource}} the 
> assumption is that the node will retain these new resource capabilities until 
> either the NM or RM is restarted. But with a heartbeat interaction constantly 
> updating those resource capabilities from the NM perspective, the explicit 
> changes via {{-updateNodeResource}} would be lost on the next heartbeat. We 
> could potentially add a flag to ignore the heartbeat updates for any node who 
> has had {{-updateNodeResource}} called on it (until a re-registration). But 
> in this case, the node would no longer get resource capability updates until 
> the NM or RM restarted. If {{-updateNodeResource}} is used a decent amount, 
> then that would give potentially unexpected behavior in relation to nodes 
> properly auto-detecting failures.
> Another idea is to add a GPU monitor thread on the NM to periodically run 
> {{nvidia-smi}} and detect changes in the number of healthy GPUs. If that 
> number decreased, the node would hook into the health check status and mark 
> itself as unhealthy. The downside of this approach is that a single failed 
> GPU would mean taking out an entire node (e.g. 8 GPUs).
> I would really like to go with the NM-RM heartbeat approach, but the 
> {{-updateNodeResource}} issue bothers me. The second approach is ok I guess, 
> but I also don't like taking down whole GPU nodes when only a single GPU is 
> bad. Would like to hear thoughts of others on how best to approach this
> [~jhung], [~leftnoteasy], [~sunilg], [~epayne], [~Jim_Brennan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9650) Set thread names for CapacityScheduler AsyncScheduleThread

2021-02-08 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-9650.

Fix Version/s: 3.4.0
   Resolution: Fixed

> Set thread names for CapacityScheduler AsyncScheduleThread
> --
>
> Key: YARN-9650
> URL: https://issues.apache.org/jira/browse/YARN-9650
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Amogh Rajesh Desai
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9650) Set thread names for CapacityScheduler AsyncScheduleThread

2021-02-08 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281495#comment-17281495
 ] 

Zhankun Tang commented on YARN-9650:


[~zhuqi], Thanks for the review. [~amoghdesai], thanks for the contribution. 
Merged to trunk.

> Set thread names for CapacityScheduler AsyncScheduleThread
> --
>
> Key: YARN-9650
> URL: https://issues.apache.org/jira/browse/YARN-9650
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Amogh Rajesh Desai
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10610) Add queuePath to restful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-04 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279391#comment-17279391
 ] 

Zhankun Tang commented on YARN-10610:
-

Thanks for the contribution [~Qi Zhu]. please fix the new chechstyle issues. 
And except that, it LGTM. +1

> Add queuePath to restful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9650) Set thread names for CapacityScheduler AsyncScheduleThread

2021-02-04 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279304#comment-17279304
 ] 

Zhankun Tang commented on YARN-9650:


[~amoghdesai]
Thanks for the contribution. It looks good to me. +1
[~bibinchundatt], could you please take a look at this? Thanks!
https://github.com/apache/hadoop/pull/2665

> Set thread names for CapacityScheduler AsyncScheduleThread
> --
>
> Key: YARN-9650
> URL: https://issues.apache.org/jira/browse/YARN-9650
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin Chundatt
>Assignee: Amogh Rajesh Desai
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10589) Improve logic of multi-node allocation

2021-02-02 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276973#comment-17276973
 ] 

Zhankun Tang edited comment on YARN-10589 at 2/2/21, 10:02 AM:
---

[~zhuqi], Thanks a lot for the review!
[~tanu.ajmera], are we sure that this PARTITION_SKIPPED only represents the 
partition mismatch? It seems the reason could be placement rule mismatch too. 
See "NODE_DO_NOT_MATCH_PARTITION_OR_PLACEMENT_CONSTRAINTS"

{code:java}

if (!appInfo.precheckNode(schedulerKey, node, schedulingMode, dcOpt)) {
  ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
  activitiesManager, node, application, schedulerKey,
  ActivityDiagnosticConstant.
  NODE_DO_NOT_MATCH_PARTITION_OR_PLACEMENT_CONSTRAINTS
  + ActivitiesManager.getDiagnostics(dcOpt),
  ActivityLevel.NODE);
  return ContainerAllocation.PARTITION_SKIPPED;
}
{code}



was (Author: ztang):
[~zhuqi], Thanks a lot for the review!
[~tanu.ajmera], I'm not very clear what we are doing now. When we change 
PRIORITY_SKIPPED to PARTITION_SKIPPED, what's the difference if we use 
PRIORITY_SKIPPED to skip the node iteration?

> Improve logic of multi-node allocation
> --
>
> Key: YARN-10589
> URL: https://issues.apache.org/jira/browse/YARN-10589
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 3.3.0
>Reporter: Tanu Ajmera
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10589-001.patch, YARN-10589-002.patch, 
> YARN-10589-003.patch
>
>
> {code:java}
> for (String partititon : partitions) {
>  if (current++ > start) {
>  break;
>  }
>  CandidateNodeSet candidates =
>  cs.getCandidateNodeSet(partititon);
>  if (candidates == null) {
>  continue;
>  }
>  cs.allocateContainersToNode(candidates, false);
> }{code}
> In above logic, if we have thousands of node in one partition, we will still 
> repeatedly access all nodes of the partition thousands of times. There is no 
> break point where if the partition is not same for the first node, it should 
> stop checking other nodes in that partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10589) Improve logic of multi-node allocation

2021-02-02 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276973#comment-17276973
 ] 

Zhankun Tang commented on YARN-10589:
-

[~zhuqi], Thanks a lot for the review!
[~tanu.ajmera], I'm not very clear what we are doing now. When we change 
PRIORITY_SKIPPED to PARTITION_SKIPPED, what's the difference if we use 
PRIORITY_SKIPPED to skip the node iteration?

> Improve logic of multi-node allocation
> --
>
> Key: YARN-10589
> URL: https://issues.apache.org/jira/browse/YARN-10589
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 3.3.0
>Reporter: Tanu Ajmera
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10589-001.patch, YARN-10589-002.patch
>
>
> {code:java}
> for (String partititon : partitions) {
>  if (current++ > start) {
>  break;
>  }
>  CandidateNodeSet candidates =
>  cs.getCandidateNodeSet(partititon);
>  if (candidates == null) {
>  continue;
>  }
>  cs.allocateContainersToNode(candidates, false);
> }{code}
> In above logic, if we have thousands of node in one partition, we will still 
> repeatedly access all nodes of the partition thousands of times. There is no 
> break point where if the partition is not same for the first node, it should 
> stop checking other nodes in that partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10589) Improve logic of multi-node allocation

2021-02-01 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276170#comment-17276170
 ] 

Zhankun Tang commented on YARN-10589:
-

[~zhuqi], could you please review Tanu's patch too?

> Improve logic of multi-node allocation
> --
>
> Key: YARN-10589
> URL: https://issues.apache.org/jira/browse/YARN-10589
> Project: Hadoop YARN
>  Issue Type: Task
>Affects Versions: 3.3.0
>Reporter: Tanu Ajmera
>Assignee: Tanu Ajmera
>Priority: Major
> Attachments: YARN-10589-001.patch
>
>
> {code:java}
> for (String partititon : partitions) {
>  if (current++ > start) {
>  break;
>  }
>  CandidateNodeSet candidates =
>  cs.getCandidateNodeSet(partititon);
>  if (candidates == null) {
>  continue;
>  }
>  cs.allocateContainersToNode(candidates, false);
> }{code}
> In above logic, if we have thousands of node in one partition, we will still 
> repeatedly access all nodes of the partition thousands of times. There is no 
> break point where if the partition is not same for the first node, it should 
> stop checking other nodes in that partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement

2021-01-28 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17274196#comment-17274196
 ] 

Zhankun Tang commented on YARN-10352:
-

Sorry for the late reply. Thanks for the contribution [~prabhujoseph] and 
[~zhuqi].
And thanks for the review! [~bibinchundatt]

The "filteringIterator" costs me quite a while to understand. Especially the 
confusing "cache" filed member. It's better to call it "nextObject".

 But that's a minor suggestion. +1 from me.

> Skip schedule on not heartbeated nodes in Multi Node Placement
> --
>
> Key: YARN-10352
> URL: https://issues.apache.org/jira/browse/YARN-10352
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>  Labels: capacityscheduler, multi-node-placement
> Attachments: YARN-10352-001.patch, YARN-10352-002.patch, 
> YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, 
> YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, 
> YARN-10352-010.patch, YARN-10352.009.patch
>
>
> When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM 
> Active Nodes will be still having those stopped nodes until NM Liveliness 
> Monitor Expires after configured timeout 
> (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, 
> Multi Node Placement assigns the containers on those nodes. They need to 
> exclude the nodes which has not heartbeated for configured heartbeat interval 
> (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to 
> Asynchronous Capacity Scheduler Threads. 
> (CapacityScheduler#shouldSkipNodeSchedule)
> *Repro:*
> 1. Enable Multi Node Placement 
> (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery 
> Enabled  (yarn.node.recovery.enabled)
> 2. Have only one NM running say worker0
> 3. Stop worker0 and start any other NM say worker1
> 4. Submit a sleep job. The containers will timeout as assigned to stopped NM 
> worker0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-12-20 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252569#comment-17252569
 ] 

Zhankun Tang commented on YARN-10463:
-

[~BilwaST] Thanks for the review.
[~zhuqi] Thanks for the contribution.
I've merged this into trunk.

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch, 
> YARN-10463.003.patch, YARN-10463.004.patch, YARN-10463.005.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-12-20 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-10463:

Fix Version/s: 3.4.0

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch, 
> YARN-10463.003.patch, YARN-10463.004.patch, YARN-10463.005.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-12-17 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251497#comment-17251497
 ] 

Zhankun Tang commented on YARN-10463:
-

[~zhuqi], I triggered a new CI and it failed. I guess it needs a rebase to the 
newest trunk. Could you please help to rebase it and trigger the CI again?

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch, 
> YARN-10463.003.patch, YARN-10463.004.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-12-10 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247657#comment-17247657
 ] 

Zhankun Tang commented on YARN-10463:
-

[~zhuqi], Thanks for the contribution.
[~BilwaST], I can help to push this if it's ok for you.

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch, 
> YARN-10463.003.patch, YARN-10463.004.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-12-09 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246473#comment-17246473
 ] 

Zhankun Tang commented on YARN-10380:
-

[~jiwq] Thanks for the review!
[~zhuqi] Thanks for the contribution! I've merged the patch from Github.

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
>  Labels: pull-request-available
> Attachments: YARN-10380.001.patch, YARN-10380.002.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-12-01 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241360#comment-17241360
 ] 

Zhankun Tang commented on YARN-10380:
-

[~zhuqi], It should be no problem to merge it if you've tested manually through 
logging or some way.
If not, we could write a basic unit test for this.

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch, YARN-10380.002.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-11-30 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241243#comment-17241243
 ] 

Zhankun Tang commented on YARN-10380:
-

[~zhuqi], Thanks a lot for the contributions! It looks good to me. Could you 
please fix the check style as described here?
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2494/2/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt

And one question is that since we don't have a unit tests here, did we verify 
the new call path if multi-node is enabled?

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch, YARN-10380.002.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path

2020-07-08 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153405#comment-17153405
 ] 

Zhankun Tang commented on YARN-10333:
-

It LGTM. +1. Thanks for your contribution! [~prabhujoseph], [~sunilg]

> YarnClient obtain Delegation Token for Log Aggregation Path
> ---
>
> Key: YARN-10333
> URL: https://issues.apache.org/jira/browse/YARN-10333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10333-001.patch, YARN-10333-002.patch, 
> YARN-10333-003.patch
>
>
> There are use cases where Yarn Log Aggregation Path is configured to a 
> FileSystem like S3 or ABFS different from what is configured in fs.defaultFS 
> (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS 
> and not for log aggregation path.
> This Jira is to improve YarnClient by obtaining delegation token for log 
> aggregation path and add it to the Credential of Container Launch Context 
> similar to how it does for Timeline Delegation Token.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist

2020-06-04 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126357#comment-17126357
 ] 

Zhankun Tang commented on YARN-10307:
-

[~appleyuchi], IIRC, I don't think the "Hive on Tez" depends on the timeline 
service. It seems more like an installation issue.

> /leveldb-timeline-store.ldb/LOCK not exist
> --
>
> Key: YARN-10307
> URL: https://issues.apache.org/jira/browse/YARN-10307
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: Ubuntu 19.10
> Hadoop 3.1.2
> Tez 0.9.2
> Hbase 2.2.4
>Reporter: appleyuchi
>Priority: Blocker
>
> $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver
>  
> in hadoop-appleyuchi-timelineserver-Desktop.out I get
>  
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
> /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:]
>  沒有此一檔案或目錄
>  at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>  at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>  at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,525 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state 
> INITED
>  java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268)
>  at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237)
>  at 
> org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  2020-06-04 17:48:21,526 INFO [main] service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>  failed in state INITED
>  org.apache.hadoop.service.ServiceStateException: 
> java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>  at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177)
>  at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187)
>  Caused by: java.io.FileNotFoundException: Source 
> 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb'
>  does not exist
>  at 

[jira] [Commented] (YARN-10302) Support custom packing algorithm for FairScheduler

2020-06-01 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121484#comment-17121484
 ] 

Zhankun Tang commented on YARN-10302:
-

[~billgraham], thanks for the contribution. Could you please generate a patch 
"git diff trunk...HEAD > YARN-10302-trunk.001.patch", upload it and click 
"submitPatch" to trigger the CI?

> Support custom packing algorithm for FairScheduler
> --
>
> Key: YARN-10302
> URL: https://issues.apache.org/jira/browse/YARN-10302
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: William W. Graham Jr
>Priority: Major
>
> The {{FairScheduler}} class allocates containers to nodes based on the node 
> with the most available memory[0]. Create the ability to instead configure a 
> custom packing algorithm with different logic. For instance for effective 
> auto scaling, a bin packing algorithm might be a better choice.
> 0 - 
> https://github.com/apache/hadoop/blob/56b7571131b0af03b32bf1c5673c32634652df21/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1034-L1043



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10248) when config allowed-gpu-devices , excluded GPUs still be visible to containers

2020-05-12 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105933#comment-17105933
 ] 

Zhankun Tang commented on YARN-10248:
-

[~jasstionzyf], do you mean the existing test case 
"testAllocationWithoutAllowedGpus" fails but is not related to our changes?

> when config allowed-gpu-devices , excluded GPUs still be visible to containers
> --
>
> Key: YARN-10248
> URL: https://issues.apache.org/jira/browse/YARN-10248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: zhao yufei
>Assignee: zhao yufei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.2.1
>
> Attachments: YARN-10248-branch-3.2.001.path, 
> YARN-10248-branch-3.2.001.path
>
>
> I have a server with two GPU, and i want to use only one of them within yarn 
> cluster.
> according to hadoop document, i set configs:
> {code:java}
> 
> yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices
> 0:1
>   
> 
> 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables
> /etc/alternatives/x86_64-linux-gnu_nvidia_smi
>   
> {code}
> then i running following command to test:
> {code:java}
> yarn jar 
> ./share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.2.1.jar \
>  -jar 
> ./share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.2.1.jar  
> -shell_command ' nvidia-smi & sleep 3  ' \
>  -container_resources memory-mb=3072,vcores=1,yarn.io/gpu=1  \
>  -num_containers 1 -queue yufei -node_label_expression slaves
> {code}
> iI expected gpu with minor number 0 will not visible to container, but in the 
> launched container, nvidia-smi  print two gpu information.
> I check the related source code and find it is a bug.
> the problem is:
> when you specify allowed-gpu-devices, GpuDiscoverer will populate usable gpus 
> from it,  
> then when assign to a container some of the gpus, it will set denied gpus for 
> the container,
> but it never consider excluded gpu of the host. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10248) when config allowed-gpu-devices , excluded GPUs still be visible to containers

2020-04-28 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095035#comment-17095035
 ] 

Zhankun Tang commented on YARN-10248:
-

[~jasstionzyf], Thanks for the contribution! Hadoop GitHub integration is not 
good enough due to the CI/CD.

Could you please generate a patch using "git diff branch-3.2...HEAD > 
YARN-10248-branch-3.2.001.path" and upload it here and click "submitPatch" to 
trigger the CI/CD?

> when config allowed-gpu-devices , excluded GPUs still be visible to containers
> --
>
> Key: YARN-10248
> URL: https://issues.apache.org/jira/browse/YARN-10248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: zhao yufei
>Assignee: zhao yufei
>Priority: Minor
>  Labels: pull-request-available
>
> I have a server with two GPU, and i want to use only one of them within yarn 
> cluster.
> according to hadoop document, i set configs:
> {code:java}
> 
> yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices
> 0:1
>   
> 
> 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables
> /etc/alternatives/x86_64-linux-gnu_nvidia_smi
>   
> {code}
> then i running following command to test:
> {code:java}
> yarn jar 
> ./share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.2.1.jar \
>  -jar 
> ./share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.2.1.jar  
> -shell_command ' nvidia-smi & sleep 3  ' \
>  -container_resources memory-mb=3072,vcores=1,yarn.io/gpu=1  \
>  -num_containers 1 -queue yufei -node_label_expression slaves
> {code}
> iI expected gpu with minor number 0 will not visible to container, but in the 
> launched container, nvidia-smi  print two gpu information.
> I check the related source code and find it is a bug.
> the problem is:
> when you specify allowed-gpu-devices, GpuDiscoverer will populate usable gpus 
> from it,  
> then when assign to a container some of the gpus, it will set denied gpus for 
> the container,
> but it never consider excluded gpu of the host. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10248) when config allowed-gpu-devices , excluded GPUs still be visible to containers

2020-04-28 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang reassigned YARN-10248:
---

Assignee: zhao yufei

> when config allowed-gpu-devices , excluded GPUs still be visible to containers
> --
>
> Key: YARN-10248
> URL: https://issues.apache.org/jira/browse/YARN-10248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.1
>Reporter: zhao yufei
>Assignee: zhao yufei
>Priority: Minor
>  Labels: pull-request-available
>
> I have a server with two GPU, and i want to use only one of them within yarn 
> cluster.
> according to hadoop document, i set configs:
> {code:java}
> 
> yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices
> 0:1
>   
> 
> 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables
> /etc/alternatives/x86_64-linux-gnu_nvidia_smi
>   
> {code}
> then i running following command to test:
> {code:java}
> yarn jar 
> ./share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.2.1.jar \
>  -jar 
> ./share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.2.1.jar  
> -shell_command ' nvidia-smi & sleep 3  ' \
>  -container_resources memory-mb=3072,vcores=1,yarn.io/gpu=1  \
>  -num_containers 1 -queue yufei -node_label_expression slaves
> {code}
> iI expected gpu with minor number 0 will not visible to container, but in the 
> launched container, nvidia-smi  print two gpu information.
> I check the related source code and find it is a bug.
> the problem is:
> when you specify allowed-gpu-devices, GpuDiscoverer will populate usable gpus 
> from it,  
> then when assign to a container some of the gpus, it will set denied gpus for 
> the container,
> but it never consider excluded gpu of the host. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10225) Support of AMD ROCm GPUs in Yarn

2020-04-08 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078340#comment-17078340
 ] 

Zhankun Tang commented on YARN-10225:
-

Not sure if YARN-8851 can help here. You can try to write a plugin for AMD GPU.

> Support of AMD ROCm GPUs in Yarn
> 
>
> Key: YARN-10225
> URL: https://issues.apache.org/jira/browse/YARN-10225
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Luca Toscano
>Priority: Major
>
> Hi!
> I just watched [1] and it seems that Hops supports AMD GPUs natively in Yarn, 
> so I am wondering if there any plans for Hadoop to do the same. I work at the 
> Wikimedia foundation and we are currently using AMD GPUs, it would be really 
> great to have support for them in Hadoop 3.x. 
> [1][ 
> https://databricks.com/session/rocm-and-distributed-deep-learning-on-spark-and-tensorflow|https://databricks.com/session/rocm-and-distributed-deep-learning-on-spark-and-tensorflow]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-24 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066316#comment-17066316
 ] 

Zhankun Tang commented on YARN-10200:
-

[~jhung], Thanks for the update. Looks better now. +1.

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch, YARN-10200.002.patch, 
> YARN-10200.003.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10200) Add number of containers to RMAppManager summary

2020-03-24 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065399#comment-17065399
 ] 

Zhankun Tang commented on YARN-10200:
-

[~jhung], Thanks for the patch. +1 from me. Just one minor doubt. Is it better 
to rename the variable "numTotalContainers" to "totalAllocatedContainers" in 
case we have more statistics of the containers?

> Add number of containers to RMAppManager summary
> 
>
> Key: YARN-10200
> URL: https://issues.apache.org/jira/browse/YARN-10200
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10200.001.patch, YARN-10200.002.patch
>
>
> It would be useful to persist this so we can track containers processed by RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2020-01-21 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020046#comment-17020046
 ] 

Zhankun Tang commented on YARN-9605:


[~cane], let me trigger again. Yeah. It seems the cc WARNING is not related.

> Add ZkConfiguredFailoverProxyProvider for RM HA
> ---
>
> Key: YARN-9605
> URL: https://issues.apache.org/jira/browse/YARN-9605
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-9605.001.patch, YARN-9605.002.patch, 
> YARN-9605.003.patch, YARN-9605.004.patch, YARN-9605.005.patch, 
> YARN-9605.006.patch
>
>
> In this issue, i will track a new feature to support 
> ZkConfiguredFailoverProxyProvider for RM HA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8851) [Umbrella] A pluggable device plugin framework to ease vendor plugin development

2020-01-08 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-8851.

Fix Version/s: 3.3.0
   Resolution: Fixed

> [Umbrella] A pluggable device plugin framework to ease vendor plugin 
> development
> 
>
> Key: YARN-8851
> URL: https://issues.apache.org/jira/browse/YARN-8851
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8851-WIP2-trunk.001.patch, 
> YARN-8851-WIP3-trunk.001.patch, YARN-8851-WIP4-trunk.001.patch, 
> YARN-8851-WIP5-trunk.001.patch, YARN-8851-WIP6-trunk.001.patch, 
> YARN-8851-WIP7-trunk.001.patch, YARN-8851-WIP8-trunk.001.patch, 
> YARN-8851-WIP9-trunk.001.patch, YARN-8851-trunk.001.patch, 
> YARN-8851-trunk.002.patch, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-4.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling 
> way. But it's difficult for a vendor to implement such a device plugin 
> because the developer needs much knowledge of YARN internals. And this brings 
> burden to the community to maintain both YARN core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin 
> development and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8851) [Umbrella] A pluggable device plugin framework to ease vendor plugin development

2020-01-08 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010470#comment-17010470
 ] 

Zhankun Tang commented on YARN-8851:


[~brahmareddy], thanks for planning the 3.3.0 release. Yeah. Let me close this 
Jira and move the remaining JIRAs out.

> [Umbrella] A pluggable device plugin framework to ease vendor plugin 
> development
> 
>
> Key: YARN-8851
> URL: https://issues.apache.org/jira/browse/YARN-8851
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8851-WIP2-trunk.001.patch, 
> YARN-8851-WIP3-trunk.001.patch, YARN-8851-WIP4-trunk.001.patch, 
> YARN-8851-WIP5-trunk.001.patch, YARN-8851-WIP6-trunk.001.patch, 
> YARN-8851-WIP7-trunk.001.patch, YARN-8851-WIP8-trunk.001.patch, 
> YARN-8851-WIP9-trunk.001.patch, YARN-8851-trunk.001.patch, 
> YARN-8851-trunk.002.patch, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-4.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling 
> way. But it's difficult for a vendor to implement such a device plugin 
> because the developer needs much knowledge of YARN internals. And this brings 
> burden to the community to maintain both YARN core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin 
> development and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10048) NodeManager fails to start after mounting CGroup

2019-12-19 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000594#comment-17000594
 ] 

Zhankun Tang commented on YARN-10048:
-

[~Sen Zhao], thanks for catching this. Let me understand this, there's a 
mismatch between the founded controller path and the configured value when 
there's multiple path under cpu subsystem?

And could you please also show the error message when NM fails to crash? Thanks!

> NodeManager fails to start after mounting CGroup
> 
>
> Key: YARN-10048
> URL: https://issues.apache.org/jira/browse/YARN-10048
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Sen Zhao
>Assignee: Sen Zhao
>Priority: Major
> Attachments: YARN-10048.001.patch, YARN-10048.002.patch
>
>
> After manually mounting the Cgroup, the NodeManager fails to start.
> If the cpu controller has multiple mount path, only the first mount path will 
> be returned. This will cause the return value to be not the actual cpu 
> controller mount path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10042) Uupgrade grpc-xxx depdencies to 1.26.0

2019-12-19 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000578#comment-17000578
 ] 

Zhankun Tang commented on YARN-10042:
-

[~cheersyang], thanks for the review. Committed to trunk. Thanks [~seanlau] for 
the contribution!

> Uupgrade grpc-xxx depdencies to 1.26.0
> --
>
> Key: YARN-10042
> URL: https://issues.apache.org/jira/browse/YARN-10042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: liusheng
>Priority: Major
> Attachments: YARN-10042.001.patch, 
> hadoop_build_aarch64_grpc_1.26.0.log, hadoop_build_x86_64_grpc_1.26.0.log, 
> yarn_csi_tests_aarch64_grpc_1.26.0.log, yarn_csi_tests_x86_64_grpc_1.26.0.log
>
>
> For now, Hadoop YARN use grpc-context, grpc-core, grpc-netty, grpc-protobuf, 
> grpc-protobuf-lite, grpc-stub and protoc-gen-grpc-java of version 1.15.1, but 
> the "protoc-gen-grpc-java" cannot support on aarch64 platform. Now the 
> grpc-java repo has support aarch64 platform and release in 1.26.0 in maven 
> central.
> see:
> [https://github.com/grpc/grpc-java/pull/6496]
> [https://search.maven.org/search?q=g:io.grpc]
>  It is better to upgrade the version of grpc-xxx dependencies to 1.26.0 
> version. both x86_64 and aarch64 server are building OK accroding to my 
> testing, please see the attachment, they are: log of building on aarch64, log 
> of building on x86_64, log of running tests of yarn csi on aarch64, log of 
> running tests of yarn csi on x86_64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10042) Uupgrade grpc-xxx depdencies to 1.26.0

2019-12-19 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-10042:

Fix Version/s: 3.3.0

> Uupgrade grpc-xxx depdencies to 1.26.0
> --
>
> Key: YARN-10042
> URL: https://issues.apache.org/jira/browse/YARN-10042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: liusheng
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-10042.001.patch, 
> hadoop_build_aarch64_grpc_1.26.0.log, hadoop_build_x86_64_grpc_1.26.0.log, 
> yarn_csi_tests_aarch64_grpc_1.26.0.log, yarn_csi_tests_x86_64_grpc_1.26.0.log
>
>
> For now, Hadoop YARN use grpc-context, grpc-core, grpc-netty, grpc-protobuf, 
> grpc-protobuf-lite, grpc-stub and protoc-gen-grpc-java of version 1.15.1, but 
> the "protoc-gen-grpc-java" cannot support on aarch64 platform. Now the 
> grpc-java repo has support aarch64 platform and release in 1.26.0 in maven 
> central.
> see:
> [https://github.com/grpc/grpc-java/pull/6496]
> [https://search.maven.org/search?q=g:io.grpc]
>  It is better to upgrade the version of grpc-xxx dependencies to 1.26.0 
> version. both x86_64 and aarch64 server are building OK accroding to my 
> testing, please see the attachment, they are: log of building on aarch64, log 
> of building on x86_64, log of running tests of yarn csi on aarch64, log of 
> running tests of yarn csi on x86_64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10041) Should not use AbstractPath to create unix domain socket

2019-12-19 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000569#comment-17000569
 ] 

Zhankun Tang commented on YARN-10041:
-

[~bzhaoopenstack], [~liusheng], could you please upload patch file like 
YARN-10042 and click "submit patch" button to trigger the CI.

> Should not use AbstractPath to create unix domain socket
> 
>
> Key: YARN-10041
> URL: https://issues.apache.org/jira/browse/YARN-10041
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
> Environment: X86/ARM
> OS: ubuntu 1804
> java: java8
>Reporter: zhao bo
>Priority: Major
>
> This issue hits by a very coincidental scene. That 's happend when we test on 
> ARM.
> The test case is:
> org.apache.hadoop.yarn.csi.client.TestCsiClient.testIdentityService
>  
> The step is:
> If we make the hadoop source code dir to a very deep dir path, this case 
> would be pass at the first time running, but always fail in the following 
> tries.
> The official jenkins doesn't cover this, because it runs on Docker container 
> and just run test 1 time. So it looks like alway pass.
>  
> The  key point is the UNIX domain socket path exceed the limit of 
> UNIX_PATH_MAX(108). Please see [1]
>  
> This issue is very difficult to locate, as it will always return binding 
> failed when we exec the test.
>  
> Also I saw the hadoop code in trunk branch, the code use the AbsolutePath to 
> create the UNIX DOMAIN SOCKET file. The source code is [2]. So that can not 
> forbid to hit this issue. That's good to provide a second way to set the 
> socket path to '/tmp' or any place when exec this test.
> [1] 
> [https://serverfault.com/questions/641347/check-if-a-path-exceeds-maximum-for-unix-domain-socket]
> [2] 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-csi/src/test/java/org/apache/hadoop/yarn/csi/client/TestCsiClient.java#L48]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10042) Uupgrade grpc-xxx depdencies to 1.26.0

2019-12-19 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1658#comment-1658
 ] 

Zhankun Tang commented on YARN-10042:
-

[~seanlau], Thanks for catching this. The patch looks good to me. The failure 
test cases seems not related. The "testDeadNodeDetectionInBackground" failure 
appears in other Jira too. And the other two test case failures are out of 
memory. +1.

[~cheersyang], since this is related to CSI dependencies, would you like to 
take a look at this?

> Uupgrade grpc-xxx depdencies to 1.26.0
> --
>
> Key: YARN-10042
> URL: https://issues.apache.org/jira/browse/YARN-10042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: liusheng
>Priority: Major
> Attachments: YARN-10042.001.patch, 
> hadoop_build_aarch64_grpc_1.26.0.log, hadoop_build_x86_64_grpc_1.26.0.log, 
> yarn_csi_tests_aarch64_grpc_1.26.0.log, yarn_csi_tests_x86_64_grpc_1.26.0.log
>
>
> For now, Hadoop YARN use grpc-context, grpc-core, grpc-netty, grpc-protobuf, 
> grpc-protobuf-lite, grpc-stub and protoc-gen-grpc-java of version 1.15.1, but 
> the "protoc-gen-grpc-java" cannot support on aarch64 platform. Now the 
> grpc-java repo has support aarch64 platform and release in 1.26.0 in maven 
> central.
> see:
> [https://github.com/grpc/grpc-java/pull/6496]
> [https://search.maven.org/search?q=g:io.grpc]
>  It is better to upgrade the version of grpc-xxx dependencies to 1.26.0 
> version. both x86_64 and aarch64 server are building OK accroding to my 
> testing, please see the attachment, they are: log of building on aarch64, log 
> of building on x86_64, log of running tests of yarn csi on aarch64, log of 
> running tests of yarn csi on x86_64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10041) Should not use AbstractPath to create unix domain socket

2019-12-18 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998975#comment-16998975
 ] 

Zhankun Tang commented on YARN-10041:
-

[~bzhaoopenstack], thanks for catching this. Would you like to provide a patch 
for this?

> Should not use AbstractPath to create unix domain socket
> 
>
> Key: YARN-10041
> URL: https://issues.apache.org/jira/browse/YARN-10041
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
> Environment: X86/ARM
> OS: ubuntu 1804
> java: java8
>Reporter: zhao bo
>Priority: Major
>
> This issue hits by a very coincidental scene. That 's happend when we test on 
> ARM.
> The test case is:
> org.apache.hadoop.yarn.csi.client.TestCsiClient.testIdentityService
>  
> The step is:
> If we make the hadoop source code dir to a very deep dir path, this case 
> would be pass at the first time running, but always fail in the following 
> tries.
> The official jenkins doesn't cover this, because it runs on Docker container 
> and just run test 1 time. So it looks like alway pass.
>  
> The  key point is the UNIX domain socket path exceed the limit of 
> UNIX_PATH_MAX(108). Please see [1]
>  
> This issue is very difficult to locate, as it will always return binding 
> failed when we exec the test.
>  
> Also I saw the hadoop code in trunk branch, the code use the AbsolutePath to 
> create the UNIX DOMAIN SOCKET file. The source code is [2]. So that can not 
> forbid to hit this issue. That's good to provide a second way to set the 
> socket path to '/tmp' or any place when exec this test.
> [1] 
> [https://serverfault.com/questions/641347/check-if-a-path-exceeds-maximum-for-unix-domain-socket]
> [2] 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-csi/src/test/java/org/apache/hadoop/yarn/csi/client/TestCsiClient.java#L48]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2019-11-05 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968014#comment-16968014
 ] 

Zhankun Tang commented on YARN-9605:


[~cane], I triggered a new build and let's see.

> Add ZkConfiguredFailoverProxyProvider for RM HA
> ---
>
> Key: YARN-9605
> URL: https://issues.apache.org/jira/browse/YARN-9605
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-9605.001.patch, YARN-9605.002.patch
>
>
> In this issue, i will track a new feature to support 
> ZkConfiguredFailoverProxyProvider for RM HA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9011) Race condition during decommissioning

2019-10-29 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961935#comment-16961935
 ] 

Zhankun Tang edited comment on YARN-9011 at 10/29/19 12:15 PM:
---

[~pbacsko], Thanks for the explanation. After the offline sync up, this 
"lazyLoaded" seems the good way to go without lock the hostDetails. + 1 from 
me. Thoughts? [~bibinchundatt]?


was (Author: tangzhankun):
[~pbacsko], Thanks for the explanation. After the offline sync up, this seems 
the good way to go without lock the hostDetails. + 1 from me. Thoughts? 
[~bibinchundatt]?

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, 
> YARN-9011-006.patch, YARN-9011-007.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9011) Race condition during decommissioning

2019-10-29 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961935#comment-16961935
 ] 

Zhankun Tang edited comment on YARN-9011 at 10/29/19 12:14 PM:
---

[~pbacsko], Thanks for the explanation. After the offline sync up, this seems 
the good way to go without lock the hostDetails. + 1 from me. Thoughts? 
[~bibinchundatt]?


was (Author: tangzhankun):
[~pbacsko], Thanks for the explanation. After the offline sync up, this seems 
the good lock-free way to go. + 1 from me.

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, 
> YARN-9011-006.patch, YARN-9011-007.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-10-29 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961935#comment-16961935
 ] 

Zhankun Tang commented on YARN-9011:


[~pbacsko], Thanks for the explanation. After the offline sync up, this seems 
the good lock-free way to go. + 1 from me.

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, 
> YARN-9011-006.patch, YARN-9011-007.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9011) Race condition during decommissioning

2019-10-29 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961621#comment-16961621
 ] 

Zhankun Tang edited comment on YARN-9011 at 10/29/19 11:54 AM:
---

[~pbacsko], Thanks for the new patch. The idea looks good to me. Several 
comments:

1. Why do we need a "lazyLoaded"? I don't see "hostDetails" differences between 
"getLazyLoadedHostDetails" and "getHostDetails".
2. Could we check the "Decommissioning" status before 
"isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"? Because 
The "gracefulDecommissionableNodes" will only be cleared after the refresh 
operation. So it will always be scanned when heartbeat which seems not 
necessary. 


was (Author: tangzhankun):
[~pbacsko], Thanks for the new patch. The idea looks good to me. Several 
comments:

1. Why do we need a lazy update? I don't see "hostDetails" differences between 
"getLazyLoadedHostDetails" and "getHostDetails".
2. Could we check the "Decommissioning" status before 
"isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"? Because 
The "gracefulDecommissionableNodes" will only be cleared after the refresh 
operation. So it will always be scanned when heartbeat which seems not 
necessary. 

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, 
> YARN-9011-006.patch, YARN-9011-007.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, 

[jira] [Comment Edited] (YARN-9011) Race condition during decommissioning

2019-10-28 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961621#comment-16961621
 ] 

Zhankun Tang edited comment on YARN-9011 at 10/29/19 2:49 AM:
--

[~pbacsko], Thanks for the new patch. The idea looks good to me. Several 
comments:

1. Why do we need a lazy update? I don't see "hostDetails" differences between 
"getLazyLoadedHostDetails" and "getHostDetails".
2. Could we check the "Decommissioning" status before 
"isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"? Because 
The "gracefulDecommissionableNodes" will only be cleared after the refresh 
operation. So it will always be scanned when heartbeat which seems not 
necessary. 


was (Author: tangzhankun):
[~pbacsko], Thanks for the new patch. The idea looks good to me. Several 
comments:

1. Why do we need a lazy update? I don't see "hostDetails" differences between 
"getLazyLoadedHostDetails" and "getHostDetails".
2. Could we check the "Decommissioning" status before 
"isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"? Because 
The "gracefulDecommissionableNodes" will only be cleared after the refresh 
operation. So it will always be scanned which seems not necessary. 

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, 
> YARN-9011-006.patch, YARN-9011-007.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (YARN-9011) Race condition during decommissioning

2019-10-28 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961621#comment-16961621
 ] 

Zhankun Tang edited comment on YARN-9011 at 10/29/19 2:49 AM:
--

[~pbacsko], Thanks for the new patch. The idea looks good to me. Several 
comments:

1. Why do we need a lazy update? I don't see "hostDetails" differences between 
"getLazyLoadedHostDetails" and "getHostDetails".
2. Could we check the "Decommissioning" status before 
"isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"? Because 
The "gracefulDecommissionableNodes" will only be cleared after the refresh 
operation. So it will always be scanned which seems not necessary. 


was (Author: tangzhankun):
[~pbacsko], Thanks for the new patch. The idea looks good to me. Several 
comments:

1. Why do we need a lazy update? I don't see "hostDetails" differences between 
"getLazyLoadedHostDetails" and "getHostDetails".
2. Could we check the "Decommissioning" status before 
"isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"? Because 
The "gracefulDecommissionableNodes" will only be cleared after the refresh 
operation. So it will always be executed which seems not necessary. 

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, 
> YARN-9011-006.patch, YARN-9011-007.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-10-28 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961621#comment-16961621
 ] 

Zhankun Tang commented on YARN-9011:


[~pbacsko], Thanks for the new patch. The idea looks good to me. Several 
comments:

1. Why do we need a lazy update? I don't see "hostDetails" differences between 
"getLazyLoadedHostDetails" and "getHostDetails".
2. Could we check the "Decommissioning" status before 
"isGracefullyDecommissionableNode" in method "isNodeInDecommissioning"? Because 
The "gracefulDecommissionableNodes" will only be cleared after the refresh 
operation. So it will always be executed which seems not necessary. 

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch, 
> YARN-9011-006.patch, YARN-9011-007.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9931) Support run script before kill container

2019-10-28 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960938#comment-16960938
 ] 

Zhankun Tang edited comment on YARN-9931 at 10/28/19 11:08 AM:
---

[~cane], IIUC, this is for debugging why container got killed? do you have a 
sample patch?


was (Author: tangzhankun):
[~cane], do you have a sample patch?

> Support run script before kill container
> 
>
> Key: YARN-9931
> URL: https://issues.apache.org/jira/browse/YARN-9931
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Like node health check script. We can add a pre-kill script which run before 
> kill container.
> For example we can save the thread dump before kill the container, which is 
> helpful for troubleshooting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9931) Support run script before kill container

2019-10-28 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960938#comment-16960938
 ] 

Zhankun Tang commented on YARN-9931:


[~cane], do you have a sample patch?

> Support run script before kill container
> 
>
> Key: YARN-9931
> URL: https://issues.apache.org/jira/browse/YARN-9931
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Like node health check script. We can add a pre-kill script which run before 
> kill container.
> For example we can save the thread dump before kill the container, which is 
> helpful for troubleshooting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9748) Allow capacity-scheduler configuration on HDFS and support reload from HDFS

2019-10-28 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960934#comment-16960934
 ] 

Zhankun Tang commented on YARN-9748:


[~cane], could you please clarify your requirement? 

> Allow capacity-scheduler configuration on HDFS and support reload from HDFS
> ---
>
> Key: YARN-9748
> URL: https://issues.apache.org/jira/browse/YARN-9748
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> Improvement:
> Support auto reload from hdfs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-23 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9921:
---
Fix Version/s: 3.1.4
   3.3.0

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.3.0, 3.1.4
>
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-23 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958465#comment-16958465
 ] 

Zhankun Tang commented on YARN-9921:


[~prabhujoseph], Thanks for the review.

[~tarunparimi], Thanks for the patch. Committed to trunk and branch-3.1.

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-23 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957693#comment-16957693
 ] 

Zhankun Tang commented on YARN-9921:


[~Prabhu Joseph], [~sunilg], if no more comment. I'll commit it soon

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-21 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955769#comment-16955769
 ] 

Zhankun Tang commented on YARN-9921:


[~tarunparimi], Thanks for reproducing it and find the root cause!  The patch 
looks good to me. +1

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9861) The ResourceManager log reports an error "Too many open files", the analysis is related to the service

2019-09-27 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939320#comment-16939320
 ] 

Zhankun Tang commented on YARN-9861:


[~billie.rinaldi], if any chance, could you please take a look at this?

The issue happens when running the submarine per offline discussion. It seems 
caused by yarn native service leaks the socket/hdfs file handles. Thoughts?

> The ResourceManager log reports an error "Too many open files", the analysis 
> is related to the service
> --
>
> Key: YARN-9861
> URL: https://issues.apache.org/jira/browse/YARN-9861
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.3.0
> Environment: yarn version:3.3.0-SNAPSHOT
> hdfs version:2.7.1
>Reporter: jason
>Priority: Major
> Attachments: picture1.png, picture2.png, picture3.png, picture4.png, 
> picture5.png, submarine_kerasgesv2date20190807.json
>
>
> The ResourceManager log outputs "Too many open files" and cannot commit a new 
> task.
> 1. First is the error in picture1,
> 2. Then check the file handle open by RM (lsof -p PID), see picture 2,
> 3. Also read nameNode audit log (picture 3),
> 4. Confirm about service according to the path of service configuration 
> (picture 4),
> 5. Handle number growth trend (picture 5).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9861) The ResourceManager log reports an error "Too many open files", the analysis is related to the service

2019-09-27 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9861:
---
Attachment: submarine_kerasgesv2date20190807.json

> The ResourceManager log reports an error "Too many open files", the analysis 
> is related to the service
> --
>
> Key: YARN-9861
> URL: https://issues.apache.org/jira/browse/YARN-9861
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.3.0
> Environment: yarn version:3.3.0-SNAPSHOT
> hdfs version:2.7.1
>Reporter: jason
>Priority: Major
> Attachments: picture1.png, picture2.png, picture3.png, picture4.png, 
> picture5.png, submarine_kerasgesv2date20190807.json
>
>
> The ResourceManager log outputs "Too many open files" and cannot commit a new 
> task.
> 1. First is the error in picture1,
> 2. Then check the file handle open by RM (lsof -p PID), see picture 2,
> 3. Also read nameNode audit log (picture 3),
> 4. Confirm about service according to the path of service configuration 
> (picture 4),
> 5. Handle number growth trend (picture 5).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-09-24 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936635#comment-16936635
 ] 

Zhankun Tang commented on YARN-9011:


[~pbacsko], I see. I may be missing something important.

What about adding more check condition? Would it be ugly? Just a simple 
pseudo-code:

{code:java}
if !isValidNode && (!isNodeDecommissioning && !isNodeRunning)
{code}




> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-24 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936518#comment-16936518
 ] 

Zhankun Tang commented on YARN-9847:


[~suxingfate], Thanks for the clarification. It looks good to me. +1

> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch, YARN-9847.002.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at 

[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-24 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936476#comment-16936476
 ] 

Zhankun Tang commented on YARN-9847:


[~suxingfate], I see. Thanks! One question on the patch. In the test case, the 
maximum size is 100K, and the truncate will change the original size to 100K. 
Why the two is not equal here?
{code:java}
 assertNotEquals("", attempt1.getDiagnostics(),
   attemptStateData1.getDiagnostics()); 
{code}

> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch, YARN-9847.002.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> 

[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-23 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936333#comment-16936333
 ] 

Zhankun Tang commented on YARN-9847:


[~suxingfate], thanks for the clarification! Is this duplicated with YARN-5006?

> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch, YARN-9847.002.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at 

[jira] [Commented] (YARN-9011) Race condition during decommissioning

2019-09-23 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936329#comment-16936329
 ] 

Zhankun Tang commented on YARN-9011:


[~pbacsko], Thanks for the elaboration. Not sure if I understand this clearly.
 Is the unexpected "Disallowed NodeManager nodeId .." caused by the *false* 
value returned by "isNodeInDecommissioning(nodeId)"? If this node is not in the 
"decommissioning" state, what's the state of it now? In the middle of the 
transition from RUNNING state to DECOMMISSIONING state?
{code:java}
if (!this.nodesListManager.isValidNode(nodeId.getHost())
&& !isNodeInDecommissioning(nodeId)) {
{code}

> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch, 
> YARN-9011-003.patch, YARN-9011-004.patch
>
>
> During internal testing, we found a nasty race condition which occurs during 
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:00:17,634 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received 
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting 
> down.
> 2018-06-18 21:07:37,377 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
> ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be 
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs 
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an 
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader 
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
>  exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully 
> decommission node node-6.hostname.com:8041 with state RUNNING
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: 
> Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: 
> node-6.hostname.com
> 2018-06-18 21:00:17,576 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node 
> node-6.hostname.com:8041 in DECOMMISSIONING.
> 2018-06-18 21:00:17,575 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
> IP=172.26.22.115OPERATION=refreshNodes  TARGET=AdminService 
> RESULT=SUCCESS
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve 
> original total capability: 
> 2018-06-18 21:00:17,577 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 
> node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
> {noformat}
> When the decommissioning succeeds, there is no output logged from 
> {{ResourceTrackerService}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9847) ZKRMStateStore will cause zk connection loss when writing huge data into znode

2019-09-20 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934125#comment-16934125
 ] 

Zhankun Tang commented on YARN-9847:


[~suxingfate], thanks for reporting this. This is interesting. One question is 
that will this truncate affect the state recovery?

> ZKRMStateStore will cause zk connection loss when writing huge data into znode
> --
>
> Key: YARN-9847
> URL: https://issues.apache.org/jira/browse/YARN-9847
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wang, Xinglong
>Assignee: Wang, Xinglong
>Priority: Minor
> Attachments: YARN-9847.001.patch
>
>
> Recently, we encountered RM ZK connection issue due to RM was trying to write 
> huge data into znode. This behavior will zk report Len error and then cause 
> zk session connection loss. And eventually RM would crash due to zk 
> connection issue.
> *The fix*
> In order to protect ResouceManager from crash due to this.
> This fix is trying to limit the size of data for attemp by limiting the 
> diagnostic info when writing ApplicationAttemptStateData into znode. The size 
> will be regulated by -Djute.maxbuffer set in yarn-env.sh. The same value will 
> be also used by zookeeper server.
> *The story*
> ResourceManager Log
> {code:java}
> 2019-07-29 02:14:59,638 WARN org.apache.zookeeper.ClientCnxn: Session 
> 0x36ab902369100a0 for serverabc-zk-5.vip.ebay.com/10.210.82.29:2181, 
> unexpected error, closing socket connection and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> 2019-07-29 04:27:35,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:998)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$5.run(ZKRMStateStore.java:995)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1174)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1001)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:1009)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:1050)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:699)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:299)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:955)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1036)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1031)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> 

[jira] [Commented] (YARN-9612) Support using ip to register NodeID

2019-09-08 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925321#comment-16925321
 ] 

Zhankun Tang commented on YARN-9612:


[~cane], the background and the motivation still not clear to me. :)

> Support using ip to register NodeID
> ---
>
> Key: YARN-9612
> URL: https://issues.apache.org/jira/browse/YARN-9612
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Priority: Major
>
> In the environment like k8s. We should support ip when register NodeID with 
> RM since the hostname will be podName which can not be be resolved by DNS of 
> k8s



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9605) Add ZkConfiguredFailoverProxyProvider for RM HA

2019-09-08 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925317#comment-16925317
 ] 

Zhankun Tang commented on YARN-9605:


[~cane], Thanks for contributing this. I saw there're failures in the Jenkins 
result. Could you please try to fix them?

> Add ZkConfiguredFailoverProxyProvider for RM HA
> ---
>
> Key: YARN-9605
> URL: https://issues.apache.org/jira/browse/YARN-9605
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-9605.001.patch
>
>
> In this issue, i will track a new feature to support 
> ZkConfiguredFailoverProxyProvider for RM HA



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9739) appsTableData in AppsBlock may cause OOM

2019-09-08 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925308#comment-16925308
 ] 

Zhankun Tang commented on YARN-9739:


[~cane], Thanks for catching this point. Do you mean we should make this a 
cache to serve multiple user's request?

> appsTableData in AppsBlock may cause OOM
> 
>
> Key: YARN-9739
> URL: https://issues.apache.org/jira/browse/YARN-9739
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhoukang
>Priority: Major
> Attachments: heap0.png, heap1.png, stack.png
>
>
> If we have many users list the applications, it may cause RM OOM



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-09-04 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9785:
---
Fix Version/s: 3.1.3

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9785-001.patch, YARN-9785-branch-3.1.001.patch, 
> YARN-9785.002.patch, YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-09-03 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9785:
---
Fix Version/s: 3.2.1
   3.3.0

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-09-03 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921207#comment-16921207
 ] 

Zhankun Tang commented on YARN-9785:


[~bibinchundatt], this has been committed to trunk and branch-3.2. But it 
doesn't apply to branch-3.1. Could you please update the patch for branch-3.1?

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-09-02 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921128#comment-16921128
 ] 

Zhankun Tang commented on YARN-9797:


Thanks, [~bibinchundatt], [~BilwaST].  +1 from me. cc [~sunilg]

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch, YARN-9797-002.patch, 
> YARN-9797-003.patch, YARN-9797-004.patch, YARN-9797-005.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Fix DominantResourceCalculator when one resource is zero

2019-09-02 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921125#comment-16921125
 ] 

Zhankun Tang commented on YARN-9785:


+1 as well. Will commit this soon.

> Fix DominantResourceCalculator when one resource is zero
> 
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9785-001.patch, YARN-9785.002.patch, 
> YARN-9785.003.patch, YARN-9785.wip.patch
>
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9797) LeafQueue#activateApplications should use resourceCalculator#fitsIn

2019-08-29 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918478#comment-16918478
 ] 

Zhankun Tang commented on YARN-9797:


[~BilwaST], Thanks for the patch and [~bibinchundatt] for the review.

One suggestion, is it better we have a test case with GPU resource enabled but 
AM resource only with CPU and Memory?

> LeafQueue#activateApplications should use resourceCalculator#fitsIn
> ---
>
> Key: YARN-9797
> URL: https://issues.apache.org/jira/browse/YARN-9797
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Attachments: YARN-9797-001.patch
>
>
> Dominant resource calculator compare function check for dominant resource is 
> lessThan.
> Incase case of AM limit we should activate application only when all the 
> resourceValues are less than the AM limit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9785) Application gets activated even when AM memory has reached

2019-08-26 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916269#comment-16916269
 ] 

Zhankun Tang commented on YARN-9785:


[~BilwaST], Thanks for reporting this. We're going to have branch cut for 3.1.3 
and 3.2.1. Do you have a patch for this blocker issue? We can get it reviewed 
and merged soon.

> Application gets activated even when AM memory has reached
> --
>
> Key: YARN-9785
> URL: https://issues.apache.org/jira/browse/YARN-9785
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Blocker
>
> Configure below property in resource-types.xml
> {quote}
>  yarn.resource-types
>  yarn.io/gpu
>  
> {quote}
> Submit applications even after AM limit for a queue is reached. Applications 
> get activated even after limit is reached
> !queue.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems

2019-08-26 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915831#comment-16915831
 ] 

Zhankun Tang commented on YARN-9607:


Bulk update: Preparing for 3.1.3 release. moved all 3.1.3 non-blocker issues to 
3.1.4, please move back if it is a blocker for you.

> Auto-configuring rollover-size of IFile format for non-appendable filesystems
> -
>
> Key: YARN-9607
> URL: https://issues.apache.org/jira/browse/YARN-9607
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9607.001.patch, YARN-9607.002.patch
>
>
> In YARN-9525, we made IFile format compatible with remote folders with s3a 
> scheme. In rolling fashioned log-aggregation IFile still fails with the 
> "append is not supported" error message, which is a known limitation of the 
> format by design. 
> There is a workaround though: setting the rollover size in the configuration 
> of the IFile format, in each rolling cycle a new aggregated log file will be 
> created, thus we eliminated the append from the process. Setting this config 
> globally would cause performance problems in the regular log-aggregation, so 
> I'm suggesting to enforcing this config to zero, if the scheme of the URI is 
> s3a (or any other non-appendable filesystem).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9718) Yarn REST API, services endpoint remote command ejection

2019-08-26 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915826#comment-16915826
 ] 

Zhankun Tang commented on YARN-9718:


Bulk update: Preparing for 3.1.3 release. moved all 3.1.3 non-blocker issues to 
3.1.4, please move back if it is a blocker for you.

> Yarn REST API, services endpoint remote command ejection
> 
>
> Key: YARN-9718
> URL: https://issues.apache.org/jira/browse/YARN-9718
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9718.001.patch, YARN-9718.002.patch, 
> YARN-9718.003.patch, YARN-9718.004.patch
>
>
> Email from Oskars Vegeris:
>  
> During internal infrastructure testing it was discovered that the Hadoop Yarn 
> REST endpoint /app/v1/services contains a command injection vulnerability.
>  
> The services endpoint's normal use-case is for launching containers (e.g. 
> Docker images/apps), however by providing an argument with special shell 
> characters it is possible to execute arbitrary commands on the Host server - 
> this would allow to escalate privileges and access. 
>  
> The command injection is possible in the parameter for JVM options - 
> "yarn.service.am.java.opts". It's possible to enter arbitrary shell commands 
> by using sub-shell syntax `cmd` or $(cmd). No shell character filtering is 
> performed. 
>  
> The "launch_command" which needs to be provided is meant for the container 
> and if it's not being run in privileged mode or with special options, host OS 
> should not be accessible.
>  
> I've attached a minimal request sample with an injected 'ping' command. The 
> endpoint can also be found via UI @ 
> [http://yarn-resource-manager:8088/ui2/#/yarn-services]
>  
> If no auth, or "simple auth" (username) is enabled, commands can be executed 
> on the host OS. I know commands can also be ran by the "new-application" 
> feature, however this is clearly not meant to be a way to touch the host OS.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9718) Yarn REST API, services endpoint remote command ejection

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9718:
---
Target Version/s: 3.3.0, 3.2.1, 3.1.4  (was: 3.3.0, 3.2.1, 3.1.3)

> Yarn REST API, services endpoint remote command ejection
> 
>
> Key: YARN-9718
> URL: https://issues.apache.org/jira/browse/YARN-9718
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9718.001.patch, YARN-9718.002.patch, 
> YARN-9718.003.patch, YARN-9718.004.patch
>
>
> Email from Oskars Vegeris:
>  
> During internal infrastructure testing it was discovered that the Hadoop Yarn 
> REST endpoint /app/v1/services contains a command injection vulnerability.
>  
> The services endpoint's normal use-case is for launching containers (e.g. 
> Docker images/apps), however by providing an argument with special shell 
> characters it is possible to execute arbitrary commands on the Host server - 
> this would allow to escalate privileges and access. 
>  
> The command injection is possible in the parameter for JVM options - 
> "yarn.service.am.java.opts". It's possible to enter arbitrary shell commands 
> by using sub-shell syntax `cmd` or $(cmd). No shell character filtering is 
> performed. 
>  
> The "launch_command" which needs to be provided is meant for the container 
> and if it's not being run in privileged mode or with special options, host OS 
> should not be accessible.
>  
> I've attached a minimal request sample with an injected 'ping' command. The 
> endpoint can also be found via UI @ 
> [http://yarn-resource-manager:8088/ui2/#/yarn-services]
>  
> If no auth, or "simple auth" (username) is enabled, commands can be executed 
> on the host OS. I know commands can also be ran by the "new-application" 
> feature, however this is clearly not meant to be a way to touch the host OS.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9607:
---
Target Version/s: 3.3.0, 3.2.1, 3.1.4  (was: 3.3.0, 3.2.1, 3.1.3)

> Auto-configuring rollover-size of IFile format for non-appendable filesystems
> -
>
> Key: YARN-9607
> URL: https://issues.apache.org/jira/browse/YARN-9607
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: YARN-9607.001.patch, YARN-9607.002.patch
>
>
> In YARN-9525, we made IFile format compatible with remote folders with s3a 
> scheme. In rolling fashioned log-aggregation IFile still fails with the 
> "append is not supported" error message, which is a known limitation of the 
> format by design. 
> There is a workaround though: setting the rollover size in the configuration 
> of the IFile format, in each rolling cycle a new aggregated log file will be 
> created, thus we eliminated the append from the process. Setting this config 
> globally would cause performance problems in the regular log-aggregation, so 
> I'm suggesting to enforcing this config to zero, if the scheme of the URI is 
> s3a (or any other non-appendable filesystem).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8453) Additional Unit tests to verify queue limit and max-limit with multiple resource types

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8453:
---
Target Version/s: 3.0.4, 3.1.4  (was: 3.0.4, 3.1.3)

> Additional Unit  tests to verify queue limit and max-limit with multiple 
> resource types
> ---
>
> Key: YARN-8453
> URL: https://issues.apache.org/jira/browse/YARN-8453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.2
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8453.001.patch
>
>
> Post support of additional resource types other then CPU and Memory, it could 
> be possible that one such new resource is exhausted its quota on a given 
> queue. But other resources such as Memory / CPU is still there beyond its 
> guaranteed limit (under max-limit). Adding more units test to ensure we are 
> not starving such allocation requests



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8453) Additional Unit tests to verify queue limit and max-limit with multiple resource types

2019-08-26 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915818#comment-16915818
 ] 

Zhankun Tang commented on YARN-8453:


Bulk update: Preparing for 3.1.3 release. moved all 3.1.3 non-blocker issues to 
3.1.4, please move back if it is a blocker for you.

> Additional Unit  tests to verify queue limit and max-limit with multiple 
> resource types
> ---
>
> Key: YARN-8453
> URL: https://issues.apache.org/jira/browse/YARN-8453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.2
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8453.001.patch
>
>
> Post support of additional resource types other then CPU and Memory, it could 
> be possible that one such new resource is exhausted its quota on a given 
> queue. But other resources such as Memory / CPU is still there beyond its 
> guaranteed limit (under max-limit). Adding more units test to ensure we are 
> not starving such allocation requests



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9642) AbstractYarnScheduler#clearPendingContainerCache could run even after transitiontostandby

2019-08-26 Thread Zhankun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915799#comment-16915799
 ] 

Zhankun Tang commented on YARN-9642:


Triggered a rebuild just now. Let's see the result if it finishes before the 
Jenkins shutdown.

> AbstractYarnScheduler#clearPendingContainerCache could run even after 
> transitiontostandby
> -
>
> Key: YARN-9642
> URL: https://issues.apache.org/jira/browse/YARN-9642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
>  Labels: memory-leak
> Attachments: YARN-9642.001.patch, YARN-9642.002.patch, 
> YARN-9642.003.patch, image-2019-06-22-16-05-24-114.png
>
>
> The TimeTask could hold the reference of Scheduler in case of fast switch 
> over too.
>  AbstractYarnScheduler should make sure scheduled Timer cancelled on 
> serviceStop.
> Causes memory leak too
> !image-2019-06-22-16-05-24-114.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8257:
---
Target Version/s: 3.1.4  (was: 3.1.3)

Bulk update: Preparing for 3.1.3 release. Moved all 3.1.3 non-blocker issues to 
3.1.4, please move back if it is a blocker.

> Native service should automatically adding escapes for environment/launch cmd 
> before sending to YARN
> 
>
> Key: YARN-8257
> URL: https://issues.apache.org/jira/browse/YARN-8257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Gour Saha
>Priority: Critical
>
> Noticed this issue while using native service: 
> Basically, when a string for environment / launch command contains chars like 
> ", /, `: it needs to be escaped twice.
> The first time is from json spec, because of json accept double quote only, 
> it needs an escape.
> The second time is from launch container, what we did for command line is: 
> (ContainerLaunch.java)
> {code:java}
> line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code}
> And for environment:
> {code:java}
> line("export ", key, "=\"", value, "\"");{code}
> An example of launch_command: 
> {code:java}
> "launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop 
> classpath --glob\\`"{code}
> And example of environment:
> {code:java}
> "TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": 
> [\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": 
> [\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": 
> [\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, 
> \\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", 
> \\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code}
> To improve usability, I think we should auto escape the input string once. 
> (For example, if user specified 
> {code}
> "TF_CONFIG": "\"key\""
> {code}
> We will automatically escape it to:
> {code}
> "TF_CONFIG": \\\"key\\\"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8417) Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container.

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8417:
---
Target Version/s: 3.1.4  (was: 3.1.3)

Bulk update: Preparing for 3.1.3 release. Moved all 3.1.3 non-blocker issues to 
3.1.4, please move back if it is a blocker.

> Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker 
> container.
> 
>
> Key: YARN-8417
> URL: https://issues.apache.org/jira/browse/YARN-8417
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Currently, YARN NM passes JAVA_HOME, HDFS_HOME, CLASSPATH environments before 
> launching Docker container no matter if ENTRY_POINT is used or not. This will 
> overwrite environments defined inside Dockerfile (by using \{{ENV}}). For 
> Docker container, it actually doesn't make sense to pass JAVA_HOME, 
> HDFS_HOME, etc. because inside docker image we have a separate Java/Hadoop 
> installed or mounted to exactly same directory of host machine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8052) Move overwriting of service definition during flex to service master

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8052:
---
Target Version/s: 3.1.4  (was: 3.1.3)

Bulk update: Preparing for 3.1.3 release. Moved all 3.1.3 non-blocker issues to 
3.1.4, please move back if it is a blocker.

> Move overwriting of service definition during flex to service master
> 
>
> Key: YARN-8052
> URL: https://issues.apache.org/jira/browse/YARN-8052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> The overwrite of service definition during flex is done from the 
> ServiceClient. 
> During auto finalization of upgrade, the current service definition gets 
> overwritten as well by the service master. This creates a potential conflict. 
> Need to move the overwrite of service definition during flex to the 
> ServiceClient. 
> Discussed on YARN-8018.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8552) [DS] Container report fails for distributed containers

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8552:
---
Target Version/s: 3.1.4  (was: 3.1.3)

Bulk update: Preparing for 3.1.3 release. Moved all 3.1.3 non-blocker issues to 
3.1.4, please move back if it is a blocker.

> [DS]  Container report fails for distributed containers
> ---
>
> Key: YARN-8552
> URL: https://issues.apache.org/jira/browse/YARN-8552
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> 2018-07-19 19:15:02,281 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1531994217928_0003_01_1099511627753 Container Transitioned from 
> ACQUIRED to RUNNING
> 2018-07-19 19:15:02,384 ERROR 
> org.apache.hadoop.yarn.server.webapp.ContainerBlock: Failed to read the 
> container container_1531994217928_0003_01_1099511627773.
> Container report failing for Distributed Scheduler containers. Currently all 
> the container are fetched from central RM so need to find alternative for the 
> same.
> {code}
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.yarn.exceptions.ContainerNotFoundException: 
> Container with id 'container_1531994217928_0003_01_1099511627773' doesn't 
> exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:499)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMContainerBlock.getContainerReport(RMContainerBlock.java:44)
> at 
> org.apache.hadoop.yarn.server.webapp.ContainerBlock$1.run(ContainerBlock.java:82)
> at 
> org.apache.hadoop.yarn.server.webapp.ContainerBlock$1.run(ContainerBlock.java:79)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688)
> ... 70 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8234:
---
Target Version/s: 3.1.4  (was: 3.1.3)

Bulk update: Preparing for 3.1.3 release. Moved all 3.1.3 non-blocker issues to 
3.1.4, please move back if it is a blocker.

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, 
> YARN-8234.003.patch, YARN-8234.004.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9376) too many ContainerIdComparator instances are not necessary

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9376:
---

Bulk update: Preparing for 3.1.3 release. Moved the incorrect "3.1.2" 
non-blocker issues to 3.1.4, please move back to 3.1.3 if it is a blocker.

> too many ContainerIdComparator instances are not necessary
> --
>
> Key: YARN-9376
> URL: https://issues.apache.org/jira/browse/YARN-9376
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.1.1, 3.1.2
>Reporter: lindongdong
>Assignee: lindongdong
>Priority: Minor
> Attachments: YARN-9376.000.patch
>
>
>  One RMNodeImpl will create a new ContainerIdComparator instance, but it is 
> not necessary.
> we may keep a static ContainerIdComparator instance and it is enough.
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl#containersToClean
> {code:java}
> /* set of containers that need to be cleaned */
> private final Set containersToClean = new TreeSet(
> new ContainerIdComparator());
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9330) Add support to query scheduler endpoint filtered via queue (/scheduler/queue=abc)

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9330:
---

Bulk update: Preparing for 3.1.3 release. Moved the incorrect "3.1.2" 
non-blocker issues to 3.1.4, please move back to 3.1.3 if it is a blocker.

> Add support to query scheduler endpoint filtered via queue 
> (/scheduler/queue=abc)
> -
>
> Key: YARN-9330
> URL: https://issues.apache.org/jira/browse/YARN-9330
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: webapp
>Affects Versions: 3.1.2
>Reporter: Prashant Golash
>Assignee: Prashant Golash
>Priority: Minor
>  Labels: newbie, patch
> Attachments: YARN-9330.001.patch, YARN-9330.002.patch, 
> YARN-9330.003.patch, YARN-9330.004.patch
>
>
> Currently, the endpoint */ws/v1/cluster/scheduler * returns all the 
> queues as part of rest contract.
> The intention of the JIRA is to be able to pass additional queue PathParam to 
> just return that queue. For e.g.
> */ws/v1/cluster/scheduler/queue=testParentQueue*
> */ws/v1/cluster/scheduler/queue=testChildQueue*
> This will make it easy for Rest clients to query just for the desired queue 
> and parse from the response.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9674) Max AM Resource calculation is wrong

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9674:
---

Bulk update: Preparing for 3.1.3 release. Moved the incorrect "3.1.2" 
non-blocker issues to 3.1.4, please move back to 3.1.3 if it is a blocker.

> Max AM Resource calculation is wrong
> 
>
> Key: YARN-9674
> URL: https://issues.apache.org/jira/browse/YARN-9674
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.2
>Reporter: ANANDA G B
>Priority: Major
> Attachments: RM_Issue.png
>
>
> 'Max AM Resource' calculated for default partition using 'Effective Max 
> Capacity' and ohter partitions it using 'Effective Capacity'.
> Which one is correct implemenation?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8657) User limit calculation should be read-lock-protected within LeafQueue

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8657:
---

Bulk update: Preparing for 3.1.3 release. Moved the incorrect "3.1.2" 
non-blocker issues to 3.1.4, please move back to 3.1.3 if it is a blocker.

> User limit calculation should be read-lock-protected within LeafQueue
> -
>
> Key: YARN-8657
> URL: https://issues.apache.org/jira/browse/YARN-8657
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8657.001.patch, YARN-8657.002.patch
>
>
> When async scheduling is enabled, user limit calculation could be wrong: 
> It is possible that scheduler calculated a user_limit, but inside 
> {{canAssignToUser}} it becomes staled. 
> We need to protect user limit calculation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9720) MR job submitted to a queue with default partition accessing the non-exclusive label resources

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9720:
---

Bulk update: Preparing for 3.1.3 release. Moved the incorrect "3.1.2" 
non-blocker issues to 3.1.4, please move back to 3.1.3 if it is a blocker.

> MR job submitted to a queue with default partition accessing the 
> non-exclusive label resources
> --
>
> Key: YARN-9720
> URL: https://issues.apache.org/jira/browse/YARN-9720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 3.1.1, 3.1.2
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Major
> Attachments: Issue.png
>
>
> When MR job is submitted to a queue1 with default partition, then it is 
> accessing non-exclusive partition resources. Please find the attachments.
> MR Job command:
> ./yarn jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.0201.jar 
> pi -Dmapreduce.job.queuename=queue1 -Dmapreduce.job.node-label-expression= 10 
> 10
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9681) AM resource limit is incorrect for queue

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9681:
---

Bulk update: Preparing for 3.1.3 release. Moved the incorrect "3.1.2" 
non-blocker issues to 3.1.4, please move back to 3.1.3 if it is a blocker.

> AM resource limit is incorrect for queue
> 
>
> Key: YARN-9681
> URL: https://issues.apache.org/jira/browse/YARN-9681
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1, 3.1.2
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Major
>  Labels: patch
> Attachments: After running job on queue1.png, Before running job on 
> queue1.png, YARN-9681.0001.patch, YARN-9681.0002.patch, YARN-9681.0003.patch, 
> YARN-9681.0004.patch, YARN-9681.0005.patch
>
>
> After running the job on Queue1 of Partition1, then Queue1 of 
> DEFAULT_PARTITION's 'Max Application Master Resources' is calculated wrongly. 
> Please find the attachement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9674) Max AM Resource calculation is wrong

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9674:
---
Target Version/s: 3.1.4  (was: 3.1.2)

> Max AM Resource calculation is wrong
> 
>
> Key: YARN-9674
> URL: https://issues.apache.org/jira/browse/YARN-9674
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.2
>Reporter: ANANDA G B
>Priority: Major
> Attachments: RM_Issue.png
>
>
> 'Max AM Resource' calculated for default partition using 'Effective Max 
> Capacity' and ohter partitions it using 'Effective Capacity'.
> Which one is correct implemenation?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9376) too many ContainerIdComparator instances are not necessary

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9376:
---
Target Version/s: 3.1.4  (was: 3.1.2)

> too many ContainerIdComparator instances are not necessary
> --
>
> Key: YARN-9376
> URL: https://issues.apache.org/jira/browse/YARN-9376
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.1.1, 3.1.2
>Reporter: lindongdong
>Assignee: lindongdong
>Priority: Minor
> Attachments: YARN-9376.000.patch
>
>
>  One RMNodeImpl will create a new ContainerIdComparator instance, but it is 
> not necessary.
> we may keep a static ContainerIdComparator instance and it is enough.
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl#containersToClean
> {code:java}
> /* set of containers that need to be cleaned */
> private final Set containersToClean = new TreeSet(
> new ContainerIdComparator());
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9720) MR job submitted to a queue with default partition accessing the non-exclusive label resources

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9720:
---
Target Version/s: 3.1.4  (was: 3.1.2)

> MR job submitted to a queue with default partition accessing the 
> non-exclusive label resources
> --
>
> Key: YARN-9720
> URL: https://issues.apache.org/jira/browse/YARN-9720
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, resourcemanager
>Affects Versions: 3.1.1, 3.1.2
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Major
> Attachments: Issue.png
>
>
> When MR job is submitted to a queue1 with default partition, then it is 
> accessing non-exclusive partition resources. Please find the attachments.
> MR Job command:
> ./yarn jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.0201.jar 
> pi -Dmapreduce.job.queuename=queue1 -Dmapreduce.job.node-label-expression= 10 
> 10
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9681) AM resource limit is incorrect for queue

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9681:
---
Target Version/s: 3.1.4  (was: 3.1.2)

> AM resource limit is incorrect for queue
> 
>
> Key: YARN-9681
> URL: https://issues.apache.org/jira/browse/YARN-9681
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1, 3.1.2
>Reporter: ANANDA G B
>Assignee: ANANDA G B
>Priority: Major
>  Labels: patch
> Attachments: After running job on queue1.png, Before running job on 
> queue1.png, YARN-9681.0001.patch, YARN-9681.0002.patch, YARN-9681.0003.patch, 
> YARN-9681.0004.patch, YARN-9681.0005.patch
>
>
> After running the job on Queue1 of Partition1, then Queue1 of 
> DEFAULT_PARTITION's 'Max Application Master Resources' is calculated wrongly. 
> Please find the attachement.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9330) Add support to query scheduler endpoint filtered via queue (/scheduler/queue=abc)

2019-08-26 Thread Zhankun Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9330:
---
Target Version/s: 3.1.4  (was: 3.1.2)

> Add support to query scheduler endpoint filtered via queue 
> (/scheduler/queue=abc)
> -
>
> Key: YARN-9330
> URL: https://issues.apache.org/jira/browse/YARN-9330
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: webapp
>Affects Versions: 3.1.2
>Reporter: Prashant Golash
>Assignee: Prashant Golash
>Priority: Minor
>  Labels: newbie, patch
> Attachments: YARN-9330.001.patch, YARN-9330.002.patch, 
> YARN-9330.003.patch, YARN-9330.004.patch
>
>
> Currently, the endpoint */ws/v1/cluster/scheduler * returns all the 
> queues as part of rest contract.
> The intention of the JIRA is to be able to pass additional queue PathParam to 
> just return that queue. For e.g.
> */ws/v1/cluster/scheduler/queue=testParentQueue*
> */ws/v1/cluster/scheduler/queue=testChildQueue*
> This will make it easy for Rest clients to query just for the desired queue 
> and parse from the response.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9106) Add option to graceful decommission to not wait for applications

2019-08-13 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9106:
---
Issue Type: Sub-task  (was: Improvement)
Parent: YARN-914

> Add option to graceful decommission to not wait for applications
> 
>
> Key: YARN-9106
> URL: https://issues.apache.org/jira/browse/YARN-9106
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Mikayla Konst
>Assignee: Mikayla Konst
>Priority: Major
> Attachments: YARN-9106.patch
>
>
> Add property 
> yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-applications.
> If true (the default), the resource manager waits for all containers, as well 
> as all applications associated with those containers, to finish before 
> gracefully decommissioning a node.
> If false, the resource manager only waits for containers, but not 
> applications, to finish. For map-only jobs or other jobs in which mappers do 
> not need to serve shuffle data, this allows nodes to be decommissioned as 
> soon as their containers are finished as opposed to when the job is done.
> Add property 
> yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-app-masters.
> If false, during graceful decommission, when the resource manager waits for 
> all containers on a node to finish, it will not wait for app master 
> containers to finish. Defaults to true. This property should only be set to 
> false if app master failure is recoverable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9721) An easy method to exclude a nodemanager from the yarn cluster cleanly

2019-08-06 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901656#comment-16901656
 ] 

Zhankun Tang edited comment on YARN-9721 at 8/7/19 3:20 AM:


[~yuan_zac], Thanks for raising this issue! This is very helpful in a hybrid 
elastic environment.

I'm checking this story to get a more clear understanding. BTW, which solution 
do you prefer?


was (Author: tangzhankun):
[~yuan_zac], Thanks for raising this issue! This is very helpful in a hybrid 
environment.

I'm checking this story to get a more clear understanding. BTW, which solution 
do you prefer?

> An easy method to exclude a nodemanager from the yarn cluster cleanly
> -
>
> Key: YARN-9721
> URL: https://issues.apache.org/jira/browse/YARN-9721
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Priority: Major
> Attachments: decommission nodes.png
>
>
> If we want to take offline a nodemanager server, nodes.exclude-path
>  and "rmadmin -refreshNodes" command are used to decommission the server.
>  But this method cannot clean up the node clearly. Nodemanager servers are 
> still in Decommissioned Nodes as the attachment shows.
>   !decommission nodes.png!
> YARN-4311 enable a removalTimer to clean up the untracked node.
>  But the logic of isUntrackedNode method is to restrict. If include-path is 
> not used, no servers can meet the criteria. Using an include file would make 
> a potential risk in maintenance.
> If yarn cluster is installed on cloud, nodemanager servers are created and 
> deleted frequently. We need a way to exclude a nodemanager from the yarn 
> cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
> keep growing, which would cause a memory issue of RM.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9721) An easy method to exclude a nodemanager from the yarn cluster cleanly

2019-08-06 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901656#comment-16901656
 ] 

Zhankun Tang commented on YARN-9721:


[~yuan_zac], Thanks for raising this issue! This is very helpful in a hybrid 
environment.

I'm checking this story to get a more clear understanding. BTW, which solution 
do you prefer?

> An easy method to exclude a nodemanager from the yarn cluster cleanly
> -
>
> Key: YARN-9721
> URL: https://issues.apache.org/jira/browse/YARN-9721
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Priority: Major
> Attachments: decommission nodes.png
>
>
> If we want to take offline a nodemanager server, nodes.exclude-path
>  and "rmadmin -refreshNodes" command are used to decommission the server.
>  But this method cannot clean up the node clearly. Nodemanager servers are 
> still in Decommissioned Nodes as the attachment shows.
>   !decommission nodes.png!
> YARN-4311 enable a removalTimer to clean up the untracked node.
>  But the logic of isUntrackedNode method is to restrict. If include-path is 
> not used, no servers can meet the criteria. Using an include file would make 
> a potential risk in maintenance.
> If yarn cluster is installed on cloud, nodemanager servers are created and 
> deleted frequently. We need a way to exclude a nodemanager from the yarn 
> cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
> keep growing, which would cause a memory issue of RM.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9584) Should put initializeProcessTrees method call before get pid

2019-07-05 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9584:
---
Fix Version/s: (was: 3.1.2)
   (was: 3.0.3)
   (was: 3.0.0)
   3.1.3
   3.2.1

> Should put initializeProcessTrees method call before get pid
> 
>
> Key: YARN-9584
> URL: https://issues.apache.org/jira/browse/YARN-9584
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0, 3.0.3, 3.1.2
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Critical
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9584.001.patch
>
>
> In ContainerMonitorImpl#MonitoringThread.run method had a logical error that 
> get pid first then initialize uninitialized process trees. 
> {code:java}
> String pId = ptInfo.getPID();
> // Initialize uninitialized process trees
> initializeProcessTrees(entry);
> if (pId == null || !isResourceCalculatorAvailable()) {
>   continue; // processTree cannot be tracked
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9584) Should put initializeProcessTrees method call before get pid

2019-07-05 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9584:
---
Fix Version/s: (was: 3.2.0)

> Should put initializeProcessTrees method call before get pid
> 
>
> Key: YARN-9584
> URL: https://issues.apache.org/jira/browse/YARN-9584
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0, 3.0.3, 3.1.2
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Critical
> Fix For: 3.0.0, 3.0.3, 3.1.2, 3.3.0
>
> Attachments: YARN-9584.001.patch
>
>
> In ContainerMonitorImpl#MonitoringThread.run method had a logical error that 
> get pid first then initialize uninitialized process trees. 
> {code:java}
> String pId = ptInfo.getPID();
> // Initialize uninitialized process trees
> initializeProcessTrees(entry);
> if (pId == null || !isResourceCalculatorAvailable()) {
>   continue; // processTree cannot be tracked
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9584) Should put initializeProcessTrees method call before get pid

2019-07-05 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9584:
---
Fix Version/s: (was: 3.1.0)

> Should put initializeProcessTrees method call before get pid
> 
>
> Key: YARN-9584
> URL: https://issues.apache.org/jira/browse/YARN-9584
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0, 3.0.3, 3.1.2
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Critical
> Fix For: 3.0.0, 3.2.0, 3.0.3, 3.1.2, 3.3.0
>
> Attachments: YARN-9584.001.patch
>
>
> In ContainerMonitorImpl#MonitoringThread.run method had a logical error that 
> get pid first then initialize uninitialized process trees. 
> {code:java}
> String pId = ptInfo.getPID();
> // Initialize uninitialized process trees
> initializeProcessTrees(entry);
> if (pId == null || !isResourceCalculatorAvailable()) {
>   continue; // processTree cannot be tracked
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9480) createAppDir() in LogAggregationService shouldn't block dispatcher thread of ContainerManagerImpl

2019-07-01 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876210#comment-16876210
 ] 

Zhankun Tang commented on YARN-9480:


[~yoelee], added [~Yunyao Zhang]. Thanks [~Weiwei Yang] !

> createAppDir() in LogAggregationService shouldn't block dispatcher thread of 
> ContainerManagerImpl
> -
>
> Key: YARN-9480
> URL: https://issues.apache.org/jira/browse/YARN-9480
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: liyakun
>Assignee: liyakun
>Priority: Major
>
> At present, when startContainers(), if NM does not contain the application, 
> it will enter the step of INIT_APPLICATION. In the application init step, 
> createAppDir() will be executed, and it is a blocking operation.
> createAppDir() is an operation that needs to interact with an external file 
> system. This operation is affected by the SLA of the external file system. 
> Once the external file system has a high latency, the NM dispatcher thread of 
> ContainerManagerImpl will be stuck. (In fact, I have seen a scene that NM 
> stuck here for more than an hour.)
> I think it would be more reasonable to move createAppDir() to the actual time 
> of uploading log (in other threads). And according to the logRetentionPolicy, 
> many of the containers may not get to this step, which will save a lot of 
> interactions with external file system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9640) Slow event processing could cause too many attempt unregister events

2019-06-28 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874723#comment-16874723
 ] 

Zhankun Tang commented on YARN-9640:


[~bibinchundatt], yeah. agree.

> Slow event processing could cause too many attempt unregister events
> 
>
> Key: YARN-9640
> URL: https://issues.apache.org/jira/browse/YARN-9640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>  Labels: scalability
> Attachments: YARN-9640.001.patch, YARN-9640.002.patch, 
> YARN-9640.003.patch
>
>
> We found in one of our test cluster verification that the number attempt 
> unregister events is about 300k+.
>  # AM all containers completed.
>  # AMRMClientImpl send finishApplcationMaster
>  # AMRMClient check event 100ms the finish Status using 
> finishApplicationMaster request.
>  # AMRMClientImpl#unregisterApplicationMaster
> {code:java}
>   while (true) {
> FinishApplicationMasterResponse response =
> rmClient.finishApplicationMaster(request);
> if (response.getIsUnregistered()) {
>   break;
> }
> LOG.info("Waiting for application to be successfully unregistered.");
> Thread.sleep(100);
>   }
> {code}
>  # ApplicationMasterService finishApplicationMaster interface sends 
> unregister events on every status update.
> We should send unregister event only once and cache event send , ignore and 
> send not unregistered response back to AM not overloading the event queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9477) Implement VE discovery using libudev

2019-06-26 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873460#comment-16873460
 ] 

Zhankun Tang commented on YARN-9477:


[~snemeth], thanks for the review. [~pbacsko], Thanks for the patch! +1 
Committed to trunk.

> Implement VE discovery using libudev
> 
>
> Key: YARN-9477
> URL: https://issues.apache.org/jira/browse/YARN-9477
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9477-001.patch, YARN-9477-002.patch, 
> YARN-9477-003.patch, YARN-9477-004.patch, YARN-9477-005.patch, 
> YARN-9477-006.patch, YARN-9477-007.patch, YARN-9477-POC.patch, 
> YARN-9477-POC2.patch, YARN-9477-POC3.patch
>
>
> Right now we have a Python script which is able to discover VE cards using 
> pyudev: https://pyudev.readthedocs.io/en/latest/
> Java does not officially support libudev. There are some projects on Github 
> (example: https://github.com/Zubnix/udev-java-bindings) but they're not 
> available as Maven artifacts.
> However it's not that difficult to create a minimal layer around libudev 
> using JNA. We don't have to wrap every function, we need to call 4-5 methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9640) Slow event processing could cause too many attempt unregister events

2019-06-23 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870582#comment-16870582
 ] 

Zhankun Tang commented on YARN-9640:


[~bibinchundatt] , Thanks for the patch! One question is that how about we 
avoid this unnecessary events in the client side?
Not quite sure if this will cause much overhead or incompatibility to existing 
production wokload.

> Slow event processing could cause too many attempt unregister events
> 
>
> Key: YARN-9640
> URL: https://issues.apache.org/jira/browse/YARN-9640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
>  Labels: scalability
> Attachments: YARN-9640.001.patch, YARN-9640.002.patch, 
> YARN-9640.003.patch
>
>
> We found in one of our test cluster verification that the number attempt 
> unregister events is about 300k+.
>  # AM all containers completed.
>  # AMRMClientImpl send finishApplcationMaster
>  # AMRMClient check event 100ms the finish Status using 
> finishApplicationMaster request.
>  # AMRMClientImpl#unregisterApplicationMaster
> {code:java}
>   while (true) {
> FinishApplicationMasterResponse response =
> rmClient.finishApplicationMaster(request);
> if (response.getIsUnregistered()) {
>   break;
> }
> LOG.info("Waiting for application to be successfully unregistered.");
> Thread.sleep(100);
>   }
> {code}
>  # ApplicationMasterService finishApplicationMaster interface sends 
> unregister events on every status update.
> We should send unregister event only once and cache event send , ignore and 
> send not unregistered response back to AM not overloading the event queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >