[
https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240021#comment-16240021
]
Rohith Sharma K S commented on YARN-7272:
-----------------------------------------
thanks [~vrushalic] for putting up summary.
Adding to above points, some of the pros and cons which are discussed in call
are
Pros :
# Additional WAL layer would help recover async entities. This ensures no
entities are lost which are sent by TimelineV2Clients to collectors.
Primarily 2 major down time trying to address with this JIRA i.e Collector JVM
going down or Collector machine itself going down.
# WAL layer is independent service that run on collector. It does not tightly
bind to back end storage. This enables recovery of async entities nevertheless
of any plugged in back end storage.
Cons :
# Ensuring all async entities are written into WAL would be costly operation
because multiple clients request will be waiting for writing into HDFS. This
brings up request contention to write into WAL to ensure atomicity. This slows
down request processing from TimelineClients.
# This would become duplicated effort storing entities into WAL apart from back
end storage!
# Since we keep only last 1 minute data, for every collector flush it is also
required to rename the file in hdfs. This operation lead to creation of entity
file spread across the cluster which lead to write performance slower since
local write is always faster than remote write! Probably this need to think how
we can deal with single file overall collector lifetime to keep track of last 1
minute entities only. I see *truncate* API in hdfs, this need to check what
does this api functionality.
I think _If cost of flushing into WAL for every async API is greater than or
equal to cost of flushing into HBase(as of now) then it is better to go for
flushing into HBase direclty_. But this approach tightly coupled with back end
storage cost!
> Enable timeline collector fault tolerance
> -----------------------------------------
>
> Key: YARN-7272
> URL: https://issues.apache.org/jira/browse/YARN-7272
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineclient, timelinereader, timelineserver
> Reporter: Vrushali C
> Assignee: Rohith Sharma K S
> Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a
> running yarn app, we would like that yarn app to re-establish connection with
> a new timeline collector.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]