[ 
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240021#comment-16240021
 ] 

Rohith Sharma K S commented on YARN-7272:
-----------------------------------------

thanks [~vrushalic] for putting up summary. 
Adding to above points, some of the pros and cons which are discussed in call 
are
Pros :
# Additional WAL layer would help recover async entities. This ensures no 
entities are lost which are sent by TimelineV2Clients to collectors. 
Primarily 2 major down time trying to address with this JIRA i.e Collector JVM 
going down or Collector machine itself going down. 
# WAL layer is independent service that run on collector. It does not tightly 
bind to back end storage. This enables recovery of async entities nevertheless 
of any plugged in back end storage. 

Cons :
# Ensuring all async entities are written into WAL would be costly operation 
because multiple clients request will be waiting for writing into HDFS. This 
brings up request contention to write into WAL to ensure atomicity. This slows 
down request processing from TimelineClients. 
# This would become duplicated effort storing entities into WAL apart from back 
end storage!
# Since we keep only last 1 minute data, for every collector flush it is also 
required to rename the file in hdfs. This operation lead to creation of entity 
file spread across the cluster which lead to write performance slower since 
local write is always faster than remote write! Probably this need to think how 
we can deal with single file overall collector lifetime to keep track of last 1 
minute entities only. I see *truncate* API in hdfs, this need to check what 
does this api functionality.

I think _If cost of flushing into WAL for every async API is greater than or 
equal to cost of flushing into HBase(as of now) then it is better to go for 
flushing into HBase direclty_. But this approach tightly coupled with back end 
storage cost!

> Enable timeline collector fault tolerance
> -----------------------------------------
>
>                 Key: YARN-7272
>                 URL: https://issues.apache.org/jira/browse/YARN-7272
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineclient, timelinereader, timelineserver
>            Reporter: Vrushali C
>            Assignee: Rohith Sharma K S
>         Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a 
> running yarn app, we would like that yarn app to re-establish connection with 
> a new timeline collector. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to