[ 
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382513#comment-15382513
 ] 

Sunil G commented on YARN-4091:
-------------------------------

Thanks [~ChenGe] and [~leftnoteasy]. As discussed offline, we will now track 
only one heartbeat activity per request. So my point 1 and 2 is fine. I have 
gone through code also a little bit.

*Few additional comments:*
1. I think Its better we synchronize {{startNodeUpdateRecording}}, as it can be 
invoked from multiple node heartbeat same time. Also we have some variables 
which is not atomic in nature {{recordNextAvailableNode}}
2. {{recordingNodesAllocation}} itself might be enough to track the active 
nodes which are doing scheduling activity. Since this is a concurrent hash map, 
we can get the key set to work with. If so, we can avoid {{activeRecordedNodes}}
3. If a node is dead and we were recording some activity earlier,then we are 
not flushing out that data. A timer mechanism may be needed only for that.
4. I think all new activity monitor can be packed under a new one 
{{org/apache/hadoop/yarn/server/resourcemanager/scheduler/activitymonitor}}
5. Could we place activityManager in {{CapacitySchedulerContext}}. So that we 
can have getter method rather chasing all method signature in 
{{AbstractCSQueue}} etc
6. I think we should not have activity start/stop/update code in various places 
in scheduler code. Now in {{LeafQueue}}, we have separate methods written such 
as recordActivity , finishAppAllocationRecording etc. Similar in allocate code 
also. I think all such methods should be inside {{activityManager}} and those 
apis need to be public (with proper java doc). I think its better if we have 
more clarity for the interface from {{activityManager}}. Even though if we have 
a new intermediate util class / helper class, which can work as a wrapper of 
{{activityManager}}, its fine. We can pull all these extra codes from scheduler 
end.
7. Instead of using {{Date}} in various places in code, I think we can use 
{{SystemClock}} or {{MonotonicClock}}.
8. In few places, i can see below code.
{code}
@@ -92,9 +973,24 @@ public synchronized CSAssignment assignContainers(Resource 
clusterResource,
           application, node.getPartition(), currentResourceLimits)) {
         application.updateAMContainerDiagnostics(AMState.ACTIVATED,
             "User capacity has reached its maximum limit.");
+        recordActivity(node, getQueueName(),
+            application.getApplicationId().toString(),
+            application.getPriority().toString(), ActivityState.REJECTED,
+            ActivityDiagnosticConstant.USER_CAPACITY_MAXIMUM_LIMIT,
+            AllocationActivityType.app);
+        updateActivityState(node, ActivityState.SKIPPED,
+            ActivityDiagnosticConstant.EMPTY);
+        finishAppAllocationRecording(application.getApplicationId(),
+            ActivityState.REJECTED);
         continue;
       }
{code}
This is basically an error code handling piece of code. But we have 3 steps to 
record activity. I think we need to optimize and call one api to 
activityManager. If multiple app/node are interested, {{activityManager}} need 
to divert and make the call separate. From scheduler, its better we make it 
more simple and clean.


*Minor nits:*

turnOffAppUpdate —> turnOffActivityMonitoringForApp
addActivity -> addSchedulingActivityForNode
addAppActivity -> addSchedulingActivityForApp

I will try look in scheduler code more where we record activities and will 
share comments if any. Thank You.

> Improvement: Introduce more debug/diagnostics information to detail out 
> scheduler activity
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-4091
>                 URL: https://issues.apache.org/jira/browse/YARN-4091
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Sunil G
>            Assignee: Chen Ge
>         Attachments: Improvement on debugdiagnostic information - YARN.pdf, 
> YARN-4091-design-doc-v1.pdf, YARN-4091.preliminary.1.patch
>
>
> As schedulers are improved with various new capabilities, more configurations 
> which tunes the schedulers starts to take actions such as limit assigning 
> containers to an application, or introduce delay to allocate container etc. 
> There are no clear information passed down from scheduler to outerworld under 
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in 
> scheduler where it skips/rejects container assignment, activate application 
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve 
> on this as we discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to