[
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194127#comment-16194127
]
Rohith Sharma K S commented on YARN-7272:
-----------------------------------------
thoughts on collector fault tolerance! Scenarios to consider for fault
tolerance are
* NodeManager JVM restart!
** NM is up and running but HBase cluster is down!
** TimelineClient async API put entities into app collector buffer, which is
prone to loose data in short span of flush interval time!
* NM machines is lost either it can be network outage or split brain issues!
In 1st cases, there will be outstanding unflushed entities in app collector
buffer. If NM is restarted then it will looses all the outstanding entities
from app collector buffer. So, scope of fault tolerance is restricted to NM
JVM restart only.
In 2nd case, since NM machine itself is down which looses all the running
master containers. RM will launches these master container in different machine
as a second attempt. Since it is fresh attempt, old attempt data is not much
important to end user. Considering this behavior, 2nd case can be eliminated by
considering for fault tolerance of app collectors.
Approach is to provide WAL in app collector. WAL will contains only unflushed
entities entry in it. Any entities which are flushed are being removed from
WAL. Once it is flushed, then we relay on back end fault tolerance
functionality. This makes WAL to have very minimal data i.e maximum last 1
minute data(1 minute is flush interval in app collector.) I have planned to use
LocalFS to store WALs.
> Enable timeline collector fault tolerance
> -----------------------------------------
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineclient, timelinereader, timelineserver
> Reporter: Vrushali C
> Assignee: Rohith Sharma K S
>
> If a NM goes down and along with it the timeline collector aux service for a
> running yarn app, we would like that yarn app to re-establish connection with
> a new timeline collector.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]