[
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382513#comment-15382513
]
Sunil G commented on YARN-4091:
-------------------------------
Thanks [~ChenGe] and [~leftnoteasy]. As discussed offline, we will now track
only one heartbeat activity per request. So my point 1 and 2 is fine. I have
gone through code also a little bit.
*Few additional comments:*
1. I think Its better we synchronize {{startNodeUpdateRecording}}, as it can be
invoked from multiple node heartbeat same time. Also we have some variables
which is not atomic in nature {{recordNextAvailableNode}}
2. {{recordingNodesAllocation}} itself might be enough to track the active
nodes which are doing scheduling activity. Since this is a concurrent hash map,
we can get the key set to work with. If so, we can avoid {{activeRecordedNodes}}
3. If a node is dead and we were recording some activity earlier,then we are
not flushing out that data. A timer mechanism may be needed only for that.
4. I think all new activity monitor can be packed under a new one
{{org/apache/hadoop/yarn/server/resourcemanager/scheduler/activitymonitor}}
5. Could we place activityManager in {{CapacitySchedulerContext}}. So that we
can have getter method rather chasing all method signature in
{{AbstractCSQueue}} etc
6. I think we should not have activity start/stop/update code in various places
in scheduler code. Now in {{LeafQueue}}, we have separate methods written such
as recordActivity , finishAppAllocationRecording etc. Similar in allocate code
also. I think all such methods should be inside {{activityManager}} and those
apis need to be public (with proper java doc). I think its better if we have
more clarity for the interface from {{activityManager}}. Even though if we have
a new intermediate util class / helper class, which can work as a wrapper of
{{activityManager}}, its fine. We can pull all these extra codes from scheduler
end.
7. Instead of using {{Date}} in various places in code, I think we can use
{{SystemClock}} or {{MonotonicClock}}.
8. In few places, i can see below code.
{code}
@@ -92,9 +973,24 @@ public synchronized CSAssignment assignContainers(Resource
clusterResource,
application, node.getPartition(), currentResourceLimits)) {
application.updateAMContainerDiagnostics(AMState.ACTIVATED,
"User capacity has reached its maximum limit.");
+ recordActivity(node, getQueueName(),
+ application.getApplicationId().toString(),
+ application.getPriority().toString(), ActivityState.REJECTED,
+ ActivityDiagnosticConstant.USER_CAPACITY_MAXIMUM_LIMIT,
+ AllocationActivityType.app);
+ updateActivityState(node, ActivityState.SKIPPED,
+ ActivityDiagnosticConstant.EMPTY);
+ finishAppAllocationRecording(application.getApplicationId(),
+ ActivityState.REJECTED);
continue;
}
{code}
This is basically an error code handling piece of code. But we have 3 steps to
record activity. I think we need to optimize and call one api to
activityManager. If multiple app/node are interested, {{activityManager}} need
to divert and make the call separate. From scheduler, its better we make it
more simple and clean.
*Minor nits:*
turnOffAppUpdate —> turnOffActivityMonitoringForApp
addActivity -> addSchedulingActivityForNode
addAppActivity -> addSchedulingActivityForApp
I will try look in scheduler code more where we record activities and will
share comments if any. Thank You.
> Improvement: Introduce more debug/diagnostics information to detail out
> scheduler activity
> ------------------------------------------------------------------------------------------
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: capacity scheduler, resourcemanager
> Affects Versions: 2.7.0
> Reporter: Sunil G
> Assignee: Chen Ge
> Attachments: Improvement on debugdiagnostic information - YARN.pdf,
> YARN-4091-design-doc-v1.pdf, YARN-4091.preliminary.1.patch
>
>
> As schedulers are improved with various new capabilities, more configurations
> which tunes the schedulers starts to take actions such as limit assigning
> containers to an application, or introduce delay to allocate container etc.
> There are no clear information passed down from scheduler to outerworld under
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in
> scheduler where it skips/rejects container assignment, activate application
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve
> on this as we discuss.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]