[
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203096#comment-16203096
]
Vinod Kumar Vavilapalli commented on YARN-7272:
-----------------------------------------------
bq. In 1st cases, there will be outstanding unflushed entities in app collector
buffer. If NM is restarted then it will looses all the outstanding entities
from app collector buffer. So, scope of fault tolerance is restricted to NM JVM
restart only
bq. In 2nd case, since NM machine itself is down which looses all the running
master containers. RM will launches these master container in different machine
as a second attempt.
This assumes that the collector lives inside the NM. One of the design goals
for large scale apps is to fork the collector into its own container. When that
is implemented, the above assumptions will be invalidated. We will have new
fault scenarios where collector and AM may run on different machines, only
collector dies and restarts on a different machine etc.
bq. Since it is fresh attempt, old attempt data is not much important to end
user. Considering this behavior, 2nd case can be eliminated by considering for
fault tolerance of app collectors.
If our goal is to take care of entity/event data in transit for 1 min (assuming
the collector flush interval is 1 min), we should be equally concerned about
data loss either due to NM failure or machine failure or HBase failures.
Granted a HBase client buffer solution is faster / cheaper than levelDB
solution which is in turn faster /cheaper than writing a JobHistory like WAL to
HDFS. But the last one will encompass all those faults collectively, no?
> Enable timeline collector fault tolerance
> -----------------------------------------
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineclient, timelinereader, timelineserver
> Reporter: Vrushali C
> Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a
> running yarn app, we would like that yarn app to re-establish connection with
> a new timeline collector.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]