[
https://issues.apache.org/jira/browse/YARN-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402492#comment-15402492
]
Eric Payne commented on YARN-4091:
----------------------------------
[~ChenGe], thank you for your work on this feature. I am sorry for the delay in
my response.
{quote}
If running in previous patch without changes, one node heartbeat costs 0.2ms
approximately. If we only record application activities, the difference of
running time is unnoticeable, less than 0.01 ms. But if we record a complete
node heartbeat activities, the running time for each node heartbeat is 0.6ms,
which is about 3X compared to the baseline. However, in practice, only a few
nodes' activities will be recorded at the same time. For example, if there're
30 nodes activities being recoreded at the same time (which is already a huge
number to me). Compared to the time cost by 2000 node heartbeats, the time to
record activities is small (around 3% more overhead), so it is neglectable and
acceptable.
{quote}
I would be interested to know how you gathered this information. Also, how are
you limiting the number of nodes whose state is being logged?
I am concerned about the performance load this feature will add to the resource
manager. I have analyzed the code and experimented with the feature on a 3-node
cluster. It appears that the state is being recorded for every node on every
heartbeat:
{code}
case NODE_UPDATE:
{
...
if (!scheduleAsynchronously) {
ActivitiesLogger.NODE.startNodeUpdateRecording(activitiesManager,
node.getNodeID());
allocateContainersToNode(getNode(node.getNodeID()));
ActivitiesLogger.NODE.finishNodeUpdateRecording(activitiesManager,
node.getNodeID());
...
{code}
And, from my experimentation,
{{ActivitiesLogger.NODE.startNodeUpdateRecording}} is always called, and it is
almost always followed by a call to one of the
{{ActivitiesLogger.NODE.finish*}} methods. If this is happening every
heartbeat, I am afraid that it will put a great strain on the resource manager.
Can you please comment?
> Add REST API to retrieve scheduler activity
> -------------------------------------------
>
> Key: YARN-4091
> URL: https://issues.apache.org/jira/browse/YARN-4091
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: capacity scheduler, resourcemanager
> Affects Versions: 2.7.0
> Reporter: Sunil G
> Assignee: Chen Ge
> Attachments: Improvement on debugdiagnostic information - YARN.pdf,
> SchedulerActivityManager-TestReport v2.pdf,
> SchedulerActivityManager-TestReport.pdf, YARN-4091-design-doc-v1.pdf,
> YARN-4091.1.patch, YARN-4091.2.patch, YARN-4091.3.patch, YARN-4091.4.patch,
> YARN-4091.5.patch, YARN-4091.5.patch, YARN-4091.preliminary.1.patch,
> app_activities.json, node_activities.json
>
>
> As schedulers are improved with various new capabilities, more configurations
> which tunes the schedulers starts to take actions such as limit assigning
> containers to an application, or introduce delay to allocate container etc.
> There are no clear information passed down from scheduler to outerworld under
> these various scenarios. This makes debugging very tougher.
> This ticket is an effort to introduce more defined states on various parts in
> scheduler where it skips/rejects container assignment, activate application
> etc. Such information will help user to know whats happening in scheduler.
> Attaching a short proposal for initial discussion. We would like to improve
> on this as we discuss.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]