[ 
https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310242#comment-14310242
 ] 

Sangjin Lee commented on YARN-2928:
-----------------------------------

[~hitesh], continuing that discussion,

{quote}
[~vinodkv] Should have probably added more context from the design doc:
"We assume that the failure semantics of the ATS writer companion is the same 
as the AM. If the ATS writer companion fails for any reason, we try to bring it 
back up up to a specified number of times. If the maximum retries are 
exhausted, we consider it a fatal failure, and fail the application."
{quote}

Yes, I definitely could add more color to that point. I'm going to update the 
design doc as there are a number of clarifications made. Hopefully some time 
next week.

In the per-app timeline aggregator (a.k.a. ATS writer companion) model, it is a 
special container. And we need to be able to allocate both the timeline 
aggregator and the AM or neither. Also, we do want to be able to co-locate the 
AM and the aggregator on the same node. Then RM needs to negotiate that 
combined capacity atomically. In other words, we don't want to have a situation 
where we were able to allocate ATS but not AM, or vice versa. If AM needs 2 G, 
and the timeline aggregator needs 1 G, then this pair needs to go to a node on 
which 3 G can be allocated at that time.

In terms of the failure scenarios, we may need to hash out some more details. 
Since allocation is considered as a pair, it is also natural to consider their 
failure semantics in the same manner. But a deeper question is, if the AM came 
up but the timeline aggregator didn't come up (for resource reasons or 
otherwise), do we consider that an acceptable situation? If the timeline 
aggregator for that app cannot come up, should that be considered fatal? Or, if 
apps are running but they're not logging critical lifecycle events, etc. 
because the timeline aggregator went down, do we consider that situation 
acceptable? The discussion was that it is probably not acceptable as if it is a 
common occurrence, it would leave a large hole in the collected timeline data 
and the overall value of the timeline data goes down significantly.

That said, this point is deferred somewhat because initially we're starting out 
with a per-node aggregator option. The per-node aggregator option somewhat 
sidesteps (but not completely) this issue.

> Application Timeline Server (ATS) next gen: phase 1
> ---------------------------------------------------
>
>                 Key: YARN-2928
>                 URL: https://issues.apache.org/jira/browse/YARN-2928
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Priority: Critical
>         Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf, Data model proposal 
> v1.pdf
>
>
> We have the application timeline server implemented in yarn per YARN-1530 and 
> YARN-321. Although it is a great feature, we have recognized several critical 
> issues and features that need to be addressed.
> This JIRA proposes the design and implementation changes to address those. 
> This is phase 1 of this effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to