[
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810691#comment-16810691
]
Adam Antal commented on YARN-9050:
----------------------------------
Hi,
I just came across this jira, and looked into the design doc you put in.
Though points like 1) and 2) are closely related to capacity scheduler,
usability improvements like 3) seems to be greatly beneficial in the fair
scheduler as well. I am missing the tech-depth to have a definite opinion about
this, but do you see rationale in this?
I think this was out-of-scope previously, but as long as the improvements are
not tight-coupled to the CS would make sense to me. Would you mind if some
paralel jiras are filed for integration with the FS?
> [Umbrella] Usability improvements for scheduler activities
> ----------------------------------------------------------
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: capacityscheduler
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node
> activities maybe confused when multiple scheduling threads record activities
> of different allocation processes in the same variables like appsAllocation
> and recordingNodesAllocation in ActivitiesManager. I think these variables
> should be thread-local to make activities clear among multiple threads.
> 2. Incomplete activities for multi-node lookup mechanism, since
> ActivitiesLogger will skip recording through \{{if (node == null ||
> activitiesManager == null) }} when node is null which represents this
> allocation is for multi-nodes. We need support recording activities for
> multi-node lookup mechanism.
> 3. Current app activities can not meet requirements of diagnostics, for
> example, we can know that node doesn't match request but hard to know why,
> especially when using placement constraints, it's difficult to make a
> detailed diagnosis manually. So I propose to improve the diagnoses of
> activities, add diagnosis for placement constraints check, update
> insufficient resource diagnosis with detailed info (like 'insufficient
> resource names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to
> distinguish different requests but can't locate requests based on app
> activities info, there are some other fields can help to filter what we want
> such as allocation tags. We have added containerPriority, allocationRequestId
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app
> activities is massive, it's hard to find what we want. We have support filter
> by allocation-tags to meet requirements from some apps, more over, we can
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggregate app activities by diagnoses. For a single allocation process,
> activities still can be massive in a large cluster, we frequently want to
> know why request can't be allocated in cluster, it's hard to check every node
> manually in a large cluster, so that aggregation for app activities by
> diagnoses is necessary to solve this trouble. We have added groupingType
> parameter for app-activities REST API for this, supports grouping by
> diagnostics.
> I think we can have a discuss about these points, useful improvements which
> can be accepted will be added into the patch. Thanks.
> Running design doc is attachedĀ
> [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]