[
https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310242#comment-14310242
]
Sangjin Lee commented on YARN-2928:
-----------------------------------
[~hitesh], continuing that discussion,
{quote}
[~vinodkv] Should have probably added more context from the design doc:
"We assume that the failure semantics of the ATS writer companion is the same
as the AM. If the ATS writer companion fails for any reason, we try to bring it
back up up to a specified number of times. If the maximum retries are
exhausted, we consider it a fatal failure, and fail the application."
{quote}
Yes, I definitely could add more color to that point. I'm going to update the
design doc as there are a number of clarifications made. Hopefully some time
next week.
In the per-app timeline aggregator (a.k.a. ATS writer companion) model, it is a
special container. And we need to be able to allocate both the timeline
aggregator and the AM or neither. Also, we do want to be able to co-locate the
AM and the aggregator on the same node. Then RM needs to negotiate that
combined capacity atomically. In other words, we don't want to have a situation
where we were able to allocate ATS but not AM, or vice versa. If AM needs 2 G,
and the timeline aggregator needs 1 G, then this pair needs to go to a node on
which 3 G can be allocated at that time.
In terms of the failure scenarios, we may need to hash out some more details.
Since allocation is considered as a pair, it is also natural to consider their
failure semantics in the same manner. But a deeper question is, if the AM came
up but the timeline aggregator didn't come up (for resource reasons or
otherwise), do we consider that an acceptable situation? If the timeline
aggregator for that app cannot come up, should that be considered fatal? Or, if
apps are running but they're not logging critical lifecycle events, etc.
because the timeline aggregator went down, do we consider that situation
acceptable? The discussion was that it is probably not acceptable as if it is a
common occurrence, it would leave a large hole in the collected timeline data
and the overall value of the timeline data goes down significantly.
That said, this point is deferred somewhat because initially we're starting out
with a per-node aggregator option. The per-node aggregator option somewhat
sidesteps (but not completely) this issue.
> Application Timeline Server (ATS) next gen: phase 1
> ---------------------------------------------------
>
> Key: YARN-2928
> URL: https://issues.apache.org/jira/browse/YARN-2928
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: timelineserver
> Reporter: Sangjin Lee
> Priority: Critical
> Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf, Data model proposal
> v1.pdf
>
>
> We have the application timeline server implemented in yarn per YARN-1530 and
> YARN-321. Although it is a great feature, we have recognized several critical
> issues and features that need to be addressed.
> This JIRA proposes the design and implementation changes to address those.
> This is phase 1 of this effort.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)