[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240801#comment-16240801 ] Jason Lowe commented on YARN-7272: -- bq. Another possible case to handle is the case where storage is down i.e. instead of waiting for sync entity call to wait, it can be potentially committed to WAL till backend is unavailable. We can potentially explore this option. My guess here is that this is going to be problematic because: # By the time you get a robust, performant WAL implemented on HDFS you've practically reinvented the core of HBase. # The point of having a synchronous call is to tell the client, "yes, I promise this has been persisted to the ATS database" yet it hasn't. If the AM side-band signals another client to start reading from ATS then that other client will not see those writes despite the AM's synchronous call to the collector returning success. The synchronous call cannot return until HBase says it has it. In that sense, I don't see the WAL being so much a fault tolerance tool. Instead I see it as a performance enhancement tool where it can buffer more asynchronous events before blocking the caller or potentially recover more asynchronous events in the case of a collector tool crash. The latter requires a lot of work where I can see us essentially requiring or reinventing systems like Apache BookKeeper. I don't see how the WAL helps in the synchronous call scenario, since the whole point of the synchronous call is to guarantee the result appears in the ATSv2 database. > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > Attachments: YARN-7272-wip.patch > > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240754#comment-16240754 ] Varun Saxena commented on YARN-7272: Sorry for coming in a little late on this discussion, although we did discuss it during the call. The primary objective of fault tolerance is to ensure that the entities which are guaranteed to be written by timeline service v2 are not lost. But writing every entity to some sort of WAL implementation would be expensive. Now, we have 2 kinds of entity writes, sync and async. Sync entities are guaranteed to be written to the backend via collector or an exception, even for server-side failures, is returned i.e. we indicate to the client that an entity could not be written all the way to the backend so that it can retry or take some other suitable action. Async entities, as the name suggests are written asynchronously. They are not guaranteed to be written to the backend, by design. We initially cache them at the client side for some time or till a sync entity arrives, combine them and then send them to collector. Moreover, if any exception occurs in writing to the backend, the result is not propagated back to the client. We only throw exceptions for client-side failures. Async entities are later cached in HBase writer implementation too, inside collector, before being flushed to HBase. Sync writes hence should be used for publishing important events, while async writes should be used for not so important events, losing which should not be a big deal in case of a failure. For instance, publishing metric values every N seconds can be an asynchronous write, unless the metric is very important, say, used for billing. Keeping this in mind, a client can potentially do synchronous writes if it cares about durability of entity data. Furthermore, asynchronous writes can have other points of failure too. For instance, the collector can crash while writing the async entity to WAL. In this case, we currently do not propagate this error to timeline client i.e. client would not know which entity writes have failed. Another possible case to handle is the case where storage is down i.e. instead of waiting for sync entity call to wait, it can be potentially committed to WAL till backend is unavailable. We can potentially explore this option. Say, in cases where HBase cluster runs separately from the cluster where ATS is running. For HBase, would HBaseAdmin#checkHBaseAvailable be sufficient to check if HBase storage is down? Thoughts? > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > Attachments: YARN-7272-wip.patch > > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240021#comment-16240021 ] Rohith Sharma K S commented on YARN-7272: - thanks [~vrushalic] for putting up summary. Adding to above points, some of the pros and cons which are discussed in call are Pros : # Additional WAL layer would help recover async entities. This ensures no entities are lost which are sent by TimelineV2Clients to collectors. Primarily 2 major down time trying to address with this JIRA i.e Collector JVM going down or Collector machine itself going down. # WAL layer is independent service that run on collector. It does not tightly bind to back end storage. This enables recovery of async entities nevertheless of any plugged in back end storage. Cons : # Ensuring all async entities are written into WAL would be costly operation because multiple clients request will be waiting for writing into HDFS. This brings up request contention to write into WAL to ensure atomicity. This slows down request processing from TimelineClients. # This would become duplicated effort storing entities into WAL apart from back end storage! # Since we keep only last 1 minute data, for every collector flush it is also required to rename the file in hdfs. This operation lead to creation of entity file spread across the cluster which lead to write performance slower since local write is always faster than remote write! Probably this need to think how we can deal with single file overall collector lifetime to keep track of last 1 minute entities only. I see *truncate* API in hdfs, this need to check what does this api functionality. I think _If cost of flushing into WAL for every async API is greater than or equal to cost of flushing into HBase(as of now) then it is better to go for flushing into HBase direclty_. But this approach tightly coupled with back end storage cost! > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > Attachments: YARN-7272-wip.patch > > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238427#comment-16238427 ] Vrushali C commented on YARN-7272: -- this is the jira for fault tolerance for timeline collector cc [~jlowe] [~jrottinghuis] [~djp] as being discussed in the call > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-7272-wip.patch > > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236796#comment-16236796 ] Vrushali C commented on YARN-7272: -- Sharing some thoughts: Collector fault tolerance helps deal with two things: - when collector itself goes down - when the data that is in the memory of the buffered mutator which has NOT yet been flushed to hbase is lost. Fault tolerance solution should have the ability to be turned on/ off. And should be off by default. It should be a cluster wide default as well as allowed as a client specific setting as well. For example, some super critical application might be requiring zero tolerance for timeline data loss, in which case, it can be turned on for this specific app. For some other app, slightly different tuning may be preferable. And for all other apps, writing to offline storage should have the ability to be turned off. > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-7272-wip.patch > > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220935#comment-16220935 ] Hadoop QA commented on YARN-7272: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 18m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 17m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 57s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 55s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 4s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 34 new + 211 unchanged - 1 fixed = 245 total (was 212) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 28s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 49s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 37s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 2s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 30s{color} | {color:red} The patch generated 4 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 97m 57s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice | | | Inconsistent synchronization of org.apache.hadoop.yarn.server.timelineservice.recovery.FileSystemWALstore.deleteLogPathRoot; locked 50% of time Unsynchronized access at FileSystemWALstore.java:50% of time Unsynchronized access at FileSystemWALstore.java:[line 345] | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:5b98639 | | JIRA Issue | YARN-7272 | | JIRA
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203129#comment-16203129 ] Rohith Sharma K S commented on YARN-7272: - Update : I had offline discussion with Vinod and his concern is scope of this JIRA is limited to auxiliary services that runs on NodeManager. Given app collectors can be launched as separate container which is long term goal but not supported yet, fault tolerance design should consider all those use cases as well. Otherwise it will end up in redesigning new fault tolerance solution later. Thinking wrt to container based app collectors recovery which also holds good for auxiliary service recovery, storing WAL in HDFS makes more appropriate. > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203096#comment-16203096 ] Vinod Kumar Vavilapalli commented on YARN-7272: --- bq. In 1st cases, there will be outstanding unflushed entities in app collector buffer. If NM is restarted then it will looses all the outstanding entities from app collector buffer. So, scope of fault tolerance is restricted to NM JVM restart only bq. In 2nd case, since NM machine itself is down which looses all the running master containers. RM will launches these master container in different machine as a second attempt. This assumes that the collector lives inside the NM. One of the design goals for large scale apps is to fork the collector into its own container. When that is implemented, the above assumptions will be invalidated. We will have new fault scenarios where collector and AM may run on different machines, only collector dies and restarts on a different machine etc. bq. Since it is fresh attempt, old attempt data is not much important to end user. Considering this behavior, 2nd case can be eliminated by considering for fault tolerance of app collectors. If our goal is to take care of entity/event data in transit for 1 min (assuming the collector flush interval is 1 min), we should be equally concerned about data loss either due to NM failure or machine failure or HBase failures. Granted a HBase client buffer solution is faster / cheaper than levelDB solution which is in turn faster /cheaper than writing a JobHistory like WAL to HDFS. But the last one will encompass all those faults collectively, no? > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16198201#comment-16198201 ] Rohith Sharma K S commented on YARN-7272: - thanks for clarifying doubts! bq. Is there a specific concern about using leveldb to implement the WAL for transient persistence? We don't have any concerns for using leveldb. Given delete operation can be performed, I would also highly recommend for using level db. > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196949#comment-16196949 ] Jason Lowe commented on YARN-7272: -- I'm not proposing we use leveldb for persisting the entities long-term, rather only for the duration between receipt from the client and up to the point the ATSv2 backend acknowledges receipt. At that point the entries would be deleted from leveldb. A routine, background compaction would prevent the database from growing to a point where recovery performance would be a concern. The NM state store already does this today, deleting container, resource, and application entries when we no longer need to recover them. Is there a specific concern about using leveldb to implement the WAL for transient persistence? I just want to make sure we're not going to invent yet another WAL solution here as there are many to choose from already. > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196504#comment-16196504 ] Rohith Sharma K S commented on YARN-7272: - thanks Jason for your inputs! I am looking not only RW operations but also for delete operations. Reasons for delete is if we start keeping all the entities in level-db, it would become too heavier like in ATS1. So only outstanding entities need to be kept under level-db which are not flushed yet into back end. After flushing succeeded, delete from WAL which reduces size of WAL. Advantage we get from this is recovery will be faster with only delta entities. bq. Leveldb is already a dependency used in multiple places, and I'd hate to see us add yet another dependency or reinvent the wheel here. Sorry I didn't get consensus to be taken. Shouldn't ATSv2 use level db for WAL writers? > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194763#comment-16194763 ] Jason Lowe commented on YARN-7272: -- Leveldb seems like a great fit for this, IMO. It has high performance for writes and works quite well in the nodemanager use-case. This case seems identical in that the collector would write to the database and only read upon recovery. Leveldb is already a dependency used in multiple places, and I'd hate to see us add yet another dependency or reinvent the wheel here. > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194138#comment-16194138 ] Rohith Sharma K S commented on YARN-7272: - This proposal discussed in ATS weekly call, and one of the concern from [~varun_saxena] is impact on performance if we use FileSystem. This need to be validated before and after WAL implementations. As a part of this discussion, also had thoughts on using level db for storing buffered entities. This also need to be validated. Probably, we can provide interface to WAL writer so that any efficient libraries can be plugged in either localFS or level db! > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194127#comment-16194127 ] Rohith Sharma K S commented on YARN-7272: - thoughts on collector fault tolerance! Scenarios to consider for fault tolerance are * NodeManager JVM restart! ** NM is up and running but HBase cluster is down! ** TimelineClient async API put entities into app collector buffer, which is prone to loose data in short span of flush interval time! * NM machines is lost either it can be network outage or split brain issues! In 1st cases, there will be outstanding unflushed entities in app collector buffer. If NM is restarted then it will looses all the outstanding entities from app collector buffer. So, scope of fault tolerance is restricted to NM JVM restart only. In 2nd case, since NM machine itself is down which looses all the running master containers. RM will launches these master container in different machine as a second attempt. Since it is fresh attempt, old attempt data is not much important to end user. Considering this behavior, 2nd case can be eliminated by considering for fault tolerance of app collectors. Approach is to provide WAL in app collector. WAL will contains only unflushed entities entry in it. Any entities which are flushed are being removed from WAL. Once it is flushed, then we relay on back end fault tolerance functionality. This makes WAL to have very minimal data i.e maximum last 1 minute data(1 minute is flush interval in app collector.) I have planned to use LocalFS to store WALs. > Enable timeline collector fault tolerance > - > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C >Assignee: Rohith Sharma K S > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org