[ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240021#comment-16240021 ]
Rohith Sharma K S commented on YARN-7272: ----------------------------------------- thanks [~vrushalic] for putting up summary. Adding to above points, some of the pros and cons which are discussed in call are Pros : # Additional WAL layer would help recover async entities. This ensures no entities are lost which are sent by TimelineV2Clients to collectors. Primarily 2 major down time trying to address with this JIRA i.e Collector JVM going down or Collector machine itself going down. # WAL layer is independent service that run on collector. It does not tightly bind to back end storage. This enables recovery of async entities nevertheless of any plugged in back end storage. Cons : # Ensuring all async entities are written into WAL would be costly operation because multiple clients request will be waiting for writing into HDFS. This brings up request contention to write into WAL to ensure atomicity. This slows down request processing from TimelineClients. # This would become duplicated effort storing entities into WAL apart from back end storage! # Since we keep only last 1 minute data, for every collector flush it is also required to rename the file in hdfs. This operation lead to creation of entity file spread across the cluster which lead to write performance slower since local write is always faster than remote write! Probably this need to think how we can deal with single file overall collector lifetime to keep track of last 1 minute entities only. I see *truncate* API in hdfs, this need to check what does this api functionality. I think _If cost of flushing into WAL for every async API is greater than or equal to cost of flushing into HBase(as of now) then it is better to go for flushing into HBase direclty_. But this approach tightly coupled with back end storage cost! > Enable timeline collector fault tolerance > ----------------------------------------- > > Key: YARN-7272 > URL: https://issues.apache.org/jira/browse/YARN-7272 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver > Reporter: Vrushali C > Assignee: Rohith Sharma K S > Attachments: YARN-7272-wip.patch > > > If a NM goes down and along with it the timeline collector aux service for a > running yarn app, we would like that yarn app to re-establish connection with > a new timeline collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org